0% found this document useful (0 votes)

27 views34 pages

Dinesh DM

The document is a submission for a Data Warehousing and Data Mining course at Tribhuvan University, detailing various features and functionalities of the WEKA toolkit for machine learning. It covers the WEKA interfaces, data preprocessing options, the ARFF file format, and practical applications of algorithms like Apriori, J48, Naive Bayes, and K-means clustering. Additionally, it includes Python program implementations for the Apriori algorithm, Naive Bayes classification, and K-means clustering.

Uploaded by

shresthadinesh71555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views34 pages

Dinesh DM

Uploaded by

shresthadinesh71555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

NEPALAYA COLLEGE

Tribhuvan University

Institute of Science and Technology

Data Warehousing and Data Mining - CSC-410

Submitted by:
Dinesh Shrestha
(26672/077)

Submitted to:
Narayan Chalise
Department of Computer Science and Information Technology
Kalanki, Kathmandu Nepal

In partial fulfilment of requirements for the Degree of bachelor’s in

computer science and information technology (B.Sc. CSIT)
1. Understand the features of WEKA tool kit such as Explorer,
Knowledge flow interface, Experimenter, command-line interface.

Answer: WEKA (Waikato Environment for Knowledge Analysis) is a popular open- source
machine learning toolkit that provides a collection of tools for data mining and machine
learning tasks. It was developed at the University of Waikato in New Zealand. WEKA
offers various interfaces and tools to cater to different user preferences and
requirements. Here are some of the key features of WEKA:

Explorer:
➢ The Explorer is a graphical user interface (GUI) that provides an interactive
environment for exploring and analyzing datasets.
➢ It allows users to load datasets, preprocess data, apply various machine learning
algorithms, and evaluate model performance.
➢ Users can visualize the results, such as confusion matrices, ROC curves, and more.

Knowledge Flow Interface:

➢ The Knowledge Flow Interface is a visual programming environment that allows users
to create, modify, and execute machine learning workflows graphically.
➢ Users can drag and drop components representing data preprocessing, learning
algorithms, and evaluation methods to create a data analysis or machine learning
pipeline.
➢ This interface is useful for users who prefer a more visual and intuitive way of building
and experimenting with machine learning workflows.

Experimenter:
➢ The Experimenter is a tool for conducting experiments and comparing the performance
of multiple machine learning algorithms on one or more datasets.
➢ Users can design experiments, specify datasets, algorithms, and evaluation metrics,
and run multiple configurations in a batch mode.
➢ The Experimenter facilitates the systematic comparison of different models and
configurations to identify the most suitable approach for a given problem.

Command-Line Interface:
➢ WEKA also provides a command-line interface (CLI) for users who prefer scripting or
automation.
➢ Users can execute various WEKA functionalities using command-line commands,
making it convenient for batch processing and integration with other tools or scripts.
➢ This interface is particularly useful for advanced users or those who want to
incorporate WEKA into larger data processing workflows.

2
2. Navigate the options available in the WEKA such as select attributes
panel, preprocess panel, classify panel, associate panel and visualize.

Dataset (Vote)

Preprocess Panel:
➢ The Preprocess panel provides options for data preprocessing tasks.
➢ Users can apply filters for data transformation, normalization, handling missing values,
and other preprocessing steps to prepare the data for machine learning.

3
Classify Panel:
➢ The Classify panel is used for building and evaluating machine learning models for
classification tasks.
➢ Users can choose from various classification algorithms, set their parameters, and
assess the performance of the models using different evaluation metrics.

4
Select Attributes Panel:
➢ This panel in WEKA allows users to choose or deselect specific attributes (features)
from the dataset.
➢ Feature selection is crucial for improving model performance, reducing
dimensionality, and eliminating irrelevant or redundant information.

5
Associate Panel:
➢ The Associate panel is dedicated to association rule mining.
➢ It allows users to discover interesting relationships or patterns in the data, often used
in market basket analysis or recommendation systems.

6
Visualize Panel:
➢ The Visualize option in WEKA enables users to visualize the dataset, model
performance, and various statistical metrics.
➢ It includes tools for visualizing decision trees, ROC curves, confusion matrices,
and more.

7
3. Study the ARFF file format.

Answer: The ARFF (Attribute-Relation File Format) is a plain text file format used for
representing datasets in WEKA and other machine learning software. ARFF files describe the
data and its attributes, making it easy to exchange datasets between different machine learning
tools. Here's an overview of the ARFF file format:
Structure of ARFF File:
a. Header Section:
➢ The header section contains information about the dataset, including the name of the
relation and a list of attributes.
➢ It starts with the @relation keyword, followed by the dataset name.
Example: @relation Iris b
b. Attribute Section:
➢ The attribute section specifies the attributes (features) of the dataset.
➢ Each attribute is defined using the @attribute keyword, followed by the attribute
name and its type.
Example:
@attribute sepal_length
numeric @attribute
sepal_width numeric
@attribute petal_length
numeric @attribute
petal_width numeric
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica}
In the example above, the attributes are sepal_length, sepal_width, petal_length,
petal_width (numeric), and class (categorical with three possible values).
c. Data Section:
➢ The data section begins with the @data keyword and contains the actual instances or
examples of the dataset.
➢ Each instance is represented as a comma-separated list of attribute values.

8
Example:
@data 5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
...
Each line corresponds to a data instance, with values in the order of attributes specified in the
attribute section.
Key Points:
➢ Numeric attributes can take real or integer values, while nominal attributes represent
categorical values.
➢ Missing values are represented by a question mark (?).
➢ Comments can be added using the % symbol.
Example ARFF File (Iris Dataset):
@relation Iris
@attribute sepal_length
numeric @attribute
sepal_width numeric
@attribute petal_length
numeric @attribute
petal_width numeric
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica} @data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
...
In this example, we have a dataset named "Iris" with four numeric attributes and one nominal
attribute representing the class. The data section contains instances of the dataset with
corresponding attribute values. ARFF files provide a standardized and human-readable way to
represent datasets, making it easy to share and use data across different machine learning
platforms that support the ARFF format.

9
4. Explore the available data sets in WEKA.

Answer: In WEKA, there are several built-in datasets that you can use for experimenting
with various machine learning algorithms. These datasets cover a wide range of domains and
are often used for educational purposes, testing algorithms, and exploring machine learning
concepts.
Here's how you can explore the available datasets in WEKA:
a. Launch WEKA:
➢ Open the WEKA GUI (Explorer) or use the command line to access WEKA.
b. Load Dataset:
➢ In the Explorer, go to the "Preprocess" tab.
➢ Click on the "Open file" button and navigate to the "data" folder within the WEKA
installation directory.
c. Select Dataset:
➢ In the "data" folder, you will find various datasets in ARFF format.
➢ Select a dataset of interest and open it. Alternatively, you can use the "Open"
button to load a dataset.
d. Explore Attributes:
➢ Once a dataset is loaded, go to the "Preprocess" panel to explore the attributes.
➢ You can view statistics, distributions, and visualize attribute values using various
tools in WEKA.
e. Classify and Visualize:
➢ Move to the "Classify" panel to apply machine learning algorithms on the dataset.
➢ You can choose a classifier, set its parameters, and evaluate its performance.
➢ Use the "Visualize" panel to explore the results graphically.
f. Knowledge Flow Interface:
➢ Alternatively, if you prefer the Knowledge Flow interface, you can use it to load
datasets and build workflows visually.

10
These datasets cover different types of problems and are useful for practicing with various
machine learning algorithms. They can be accessed and explored within the WEKA
environment to gain hands-on experience with data analysis and model building.

11
5. Load a data set and observe the following:
• List attribute names and their types.
• Number of records in the dataset.
• Identify the class attribute (if any).
• Visualize the data in various dimensions.

Answer: Dataset(breast-cancer)

12
13
14
6. Explore various options in WEKA for preprocessing data and apply
Discretization filter and Resample filter etc. in each dataset.

Answer: Dataset (Glass)

Before filter:

15
After Discretization filter:

Discretization in Weka involves converting numeric attributes into nominal (categorical)

attributes by dividing the range of numeric values into intervals (or bins).

16
After Resample filter:

Resample filter in Weka is useful when you need to adjust the dataset size, balance class
distributions, or create a random sample for training/testing.

17
7. Load each dataset into WEKA and run the Apriori algorithm with
different support and confidence values. Study the rules generated.

The Apriori algorithm is a classical algorithm in data mining and machine learning used for
association rule mining in transactional databases. It aims to find interesting relationships or
associations among a set of items in large datasets. The most common application of the
Apriori algorithm is in market basket analysis, where it helps identify associations between
products that are frequently purchased together.

Dataset (Supermarket)

18
19
8. Load each dataset into WEKA and run j48 classification algorithm.
Study the classifier output. Compute entropy values, Kappa statistics.

The J48 algorithm is a classification technique based on the C4.5 decision tree algorithm. It
generates a decision tree for a dataset by splitting it into subsets based on attribute values. This
process uses statistical measures such as information gain or gain ratio to decide the best way
to divide the data at each node. The resulting decision tree can classify new data instances by
traversing the tree based on attribute values.

Dataset (Weather.symbolic)

20
Decision Tree:

21
9. Extract the if-then rule from the decision tree generated by the
classifier.
The Naive Bayes classifier is a probabilistic machine learning algorithm used for classification
tasks. It is based on Bayes' Theorem, which calculates the probability of a class given a set of
features (also called evidence). The algorithm assumes that all features are conditionally
independent given the class label, which simplifies computation.

Dataset (Vote):

22
23
10. Load a dataset into WEKA and perform Naive-bayes classification
and K-Nearest Neighbor classification.
Naive Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem, used for
classification tasks. It assumes that the features (attributes) of a dataset are independent of each
other given the class label. Despite this "naive" independence assumption, the algorithm often
performs surprisingly well in practice, especially for text classification and spam filtering.

k-Nearest Neighbor (k-NN) is one of the simplest and most intuitive machine learning
algorithms used for both classification and regression tasks. The idea behind k-NN is based on
the assumption that similar instances exist in close proximity to each other in the feature space.
It is an instance-based learning algorithm, meaning it does not construct an explicit model
during training but rather uses the entire dataset to make predictions.

Dataset (Glass)

Naive-bayes classification

24
25
26
K-Nearest Neighbor classification

27
11. Load dataset and perform k-means clustering algorithm with
different values of k.
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning
a dataset into a predetermined number of clusters. The goal of K means is to minimize the
within-cluster variance, also known as inertia or sum of squared distances from each point in
the cluster to the centroid of that cluster.

Dataset (Iris)

K=2

28
K=4

29
12. Load dataset and build Linear Regression model.

Dataset (CPU)

30
13. Write a python program to implement Apriori algorithm.
Question: (2080)
Transaction ID Items Purchased
1 Bread, Cheese, Egg, Juice
2 Bread, Cheese, Juice
3 Bread, Milk, Yogurt
4 Bread, Juice, Milk
5 Cheese, Juice, Milk
Assuming min. support is 50% and min confidence is 75%.

Source Code:

Output:

31
14. Write a python program to implement Naive-bayes classification.

Naive Bayes Classification is a probabilistic machine learning algorithm based on Bayes'

Theorem. It is used for classification tasks and is particularly effective for text classification,
spam detection, and sentiment analysis. The algorithm assumes that the features are
conditionally independent given the class label, which simplifies the computation.

Question: (2078)
Confident Studied Sick Result
Yes No No Fail
Yes No No Pass
No Yes Yes Fail
No Yes Yes Pass
Yes Yes Yes Pass
Find out whether the object with attribute Confident = Yes, Sick = No will Fail or Pass using
Bayesian classification.

Source Code:

32
Output:

33
15. Write a python program to implement k-means algorithm.

Source Code:

Output:

Deep Learning
100% (3)
Deep Learning
207 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
Cs4407 Programming Assignment 5
100% (1)
Cs4407 Programming Assignment 5
7 pages
Data Warehousing Lab Exp 1-3
No ratings yet
Data Warehousing Lab Exp 1-3
24 pages
Solar Radiation Prediction: Dr. Himani Bansal
No ratings yet
Solar Radiation Prediction: Dr. Himani Bansal
43 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
Iare DWDM and WT Lab Manual PDF
No ratings yet
Iare DWDM and WT Lab Manual PDF
69 pages
AI&DS Module 1 KTU
No ratings yet
AI&DS Module 1 KTU
29 pages
2.3 Weka Tool
No ratings yet
2.3 Weka Tool
84 pages
Crop Yield
No ratings yet
Crop Yield
112 pages
ML Notes
No ratings yet
ML Notes
16 pages
Data Mining Complete Lab Manual - DRSNR
No ratings yet
Data Mining Complete Lab Manual - DRSNR
27 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
38 pages
UNIT-IV Notes
No ratings yet
UNIT-IV Notes
42 pages
Unit 2
No ratings yet
Unit 2
76 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
55 pages
Unit-3 New
No ratings yet
Unit-3 New
75 pages
Data Mining Concepts Models and Techniques 1st Edition by Florin Gorunescu ISBN 3642197213 9783642197215
No ratings yet
Data Mining Concepts Models and Techniques 1st Edition by Florin Gorunescu ISBN 3642197213 9783642197215
42 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
Data Mining Techniques For Software Effort Estimation: A Comparative Study
No ratings yet
Data Mining Techniques For Software Effort Estimation: A Comparative Study
23 pages
SML Updated UNIT 3
No ratings yet
SML Updated UNIT 3
41 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
Pradeep Aiml
No ratings yet
Pradeep Aiml
47 pages
Review IML 2020
No ratings yet
Review IML 2020
17 pages
WEKA Practical Protocol
No ratings yet
WEKA Practical Protocol
40 pages
Unit 1 ML
No ratings yet
Unit 1 ML
14 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
Mothan Aiml - Removed
No ratings yet
Mothan Aiml - Removed
34 pages
Visual Data Mining Techniques
No ratings yet
Visual Data Mining Techniques
13 pages
Data Mining in Data Analysis For Business Decision Support in Warehouse Management With Weka Program
No ratings yet
Data Mining in Data Analysis For Business Decision Support in Warehouse Management With Weka Program
8 pages
6th - SEM Data Science Notes
No ratings yet
6th - SEM Data Science Notes
46 pages
DWBI Lab Manual 2023-24 Final
No ratings yet
DWBI Lab Manual 2023-24 Final
40 pages
DHW Lab (Ex1 To 3)
No ratings yet
DHW Lab (Ex1 To 3)
18 pages
32013105-BDA LabManual
No ratings yet
32013105-BDA LabManual
122 pages
Aiml Manual
No ratings yet
Aiml Manual
27 pages
WIREs Computational Stats - 2013 - de Ville - Decision Trees
No ratings yet
WIREs Computational Stats - 2013 - de Ville - Decision Trees
8 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
12 pages
Lecture 7 - Weka
No ratings yet
Lecture 7 - Weka
69 pages
What Is Weka
No ratings yet
What Is Weka
2 pages
DWM1 Riya
No ratings yet
DWM1 Riya
16 pages
DWDM File-Final Ver3.pdf 20241230 172003 0000
No ratings yet
DWDM File-Final Ver3.pdf 20241230 172003 0000
54 pages
DMW Lab Print
No ratings yet
DMW Lab Print
21 pages
Data Warehousing Laboratory
0% (1)
Data Warehousing Laboratory
28 pages
Applied Analytics Using Enterprise Miner5
No ratings yet
Applied Analytics Using Enterprise Miner5
3 pages
Assignment B 1 LinearRegression
No ratings yet
Assignment B 1 LinearRegression
5 pages
Weka Experiment
No ratings yet
Weka Experiment
13 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
DMW LabFile 0901CS243D11 Swastik
No ratings yet
DMW LabFile 0901CS243D11 Swastik
25 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
50 pages
ExplorerGuide A Version 3-5-8
No ratings yet
ExplorerGuide A Version 3-5-8
22 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
DWDM File
No ratings yet
DWDM File
26 pages
Data Mining Cup 2010 Report
No ratings yet
Data Mining Cup 2010 Report
7 pages
Lab Manual
No ratings yet
Lab Manual
24 pages
Qwe Inc Report
No ratings yet
Qwe Inc Report
7 pages
Data Mining (WEKA) en Formatted
No ratings yet
Data Mining (WEKA) en Formatted
52 pages
Wekappt
No ratings yet
Wekappt
58 pages
Sequence Prediction and Anomaly Detection For Satellite Telemetry
No ratings yet
Sequence Prediction and Anomaly Detection For Satellite Telemetry
12 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
WEKA Explorer User Guide For Version 3-4: Richard Kirkby Eibe Frank July 15, 2008
No ratings yet
WEKA Explorer User Guide For Version 3-4: Richard Kirkby Eibe Frank July 15, 2008
13 pages
WEKA Manual
No ratings yet
WEKA Manual
25 pages
Data Mining (WEKA) en
No ratings yet
Data Mining (WEKA) en
51 pages
Weka Data Miningvsem
No ratings yet
Weka Data Miningvsem
7 pages
Weka Tutorial
No ratings yet
Weka Tutorial
45 pages
Weka (20030421-Version1 by Kdelab)
No ratings yet
Weka (20030421-Version1 by Kdelab)
51 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
No ratings yet
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
23 pages
Weka Overview Slides
No ratings yet
Weka Overview Slides
31 pages
Weka Software Manuala
No ratings yet
Weka Software Manuala
20 pages
Data Base Management Key Points
No ratings yet
Data Base Management Key Points
8 pages
WEKA Explorer Tutorial
No ratings yet
WEKA Explorer Tutorial
45 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
From Everand
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
Whitney Soto
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
WS-BPEL 2.0 Beginner's Guide
From Everand
WS-BPEL 2.0 Beginner's Guide
Matjaz B. Juric
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Oracle OBIEE Interview Q & A
From Everand
Oracle OBIEE Interview Q & A
Mohammed Azizuddin Aamer
3/5 (1)
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
ORACLE 12C Complete Self-Assessment Guide
From Everand
ORACLE 12C Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
SnapLogic Second Edition
From Everand
SnapLogic Second Edition
Gerardus Blokdyk
No ratings yet
Java servlet Second Edition
From Everand
Java servlet Second Edition
Gerardus Blokdyk
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Dinesh DM

Uploaded by

Dinesh DM

Uploaded by

NEPALAYA COLLEGE

Institute of Science and Technology

Data Warehousing and Data Mining - CSC-410

In partial fulfilment of requirements for the Degree of bachelor’s in

Knowledge Flow Interface:

Answer: Dataset (Glass)

Discretization in Weka involves converting numeric attributes into nominal (categorical)

Naive Bayes Classification is a probabilistic machine learning algorithm based on Bayes'

You might also like