0% found this document useful (0 votes)
27 views34 pages

Dinesh DM

The document is a submission for a Data Warehousing and Data Mining course at Tribhuvan University, detailing various features and functionalities of the WEKA toolkit for machine learning. It covers the WEKA interfaces, data preprocessing options, the ARFF file format, and practical applications of algorithms like Apriori, J48, Naive Bayes, and K-means clustering. Additionally, it includes Python program implementations for the Apriori algorithm, Naive Bayes classification, and K-means clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views34 pages

Dinesh DM

The document is a submission for a Data Warehousing and Data Mining course at Tribhuvan University, detailing various features and functionalities of the WEKA toolkit for machine learning. It covers the WEKA interfaces, data preprocessing options, the ARFF file format, and practical applications of algorithms like Apriori, J48, Naive Bayes, and K-means clustering. Additionally, it includes Python program implementations for the Apriori algorithm, Naive Bayes classification, and K-means clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

NEPALAYA COLLEGE

Tribhuvan University

Institute of Science and Technology

Data Warehousing and Data Mining - CSC-410

Submitted by:
Dinesh Shrestha
(26672/077)

Submitted to:
Narayan Chalise
Department of Computer Science and Information Technology
Kalanki, Kathmandu Nepal

In partial fulfilment of requirements for the Degree of bachelor’s in


computer science and information technology (B.Sc. CSIT)
1. Understand the features of WEKA tool kit such as Explorer,
Knowledge flow interface, Experimenter, command-line interface.

Answer: WEKA (Waikato Environment for Knowledge Analysis) is a popular open- source
machine learning toolkit that provides a collection of tools for data mining and machine
learning tasks. It was developed at the University of Waikato in New Zealand. WEKA
offers various interfaces and tools to cater to different user preferences and
requirements. Here are some of the key features of WEKA:

Explorer:
➢ The Explorer is a graphical user interface (GUI) that provides an interactive
environment for exploring and analyzing datasets.
➢ It allows users to load datasets, preprocess data, apply various machine learning
algorithms, and evaluate model performance.
➢ Users can visualize the results, such as confusion matrices, ROC curves, and more.

Knowledge Flow Interface:


➢ The Knowledge Flow Interface is a visual programming environment that allows users
to create, modify, and execute machine learning workflows graphically.
➢ Users can drag and drop components representing data preprocessing, learning
algorithms, and evaluation methods to create a data analysis or machine learning
pipeline.
➢ This interface is useful for users who prefer a more visual and intuitive way of building
and experimenting with machine learning workflows.

Experimenter:
➢ The Experimenter is a tool for conducting experiments and comparing the performance
of multiple machine learning algorithms on one or more datasets.
➢ Users can design experiments, specify datasets, algorithms, and evaluation metrics,
and run multiple configurations in a batch mode.
➢ The Experimenter facilitates the systematic comparison of different models and
configurations to identify the most suitable approach for a given problem.

Command-Line Interface:
➢ WEKA also provides a command-line interface (CLI) for users who prefer scripting or
automation.
➢ Users can execute various WEKA functionalities using command-line commands,
making it convenient for batch processing and integration with other tools or scripts.
➢ This interface is particularly useful for advanced users or those who want to
incorporate WEKA into larger data processing workflows.

2
2. Navigate the options available in the WEKA such as select attributes
panel, preprocess panel, classify panel, associate panel and visualize.

Dataset (Vote)

Preprocess Panel:
➢ The Preprocess panel provides options for data preprocessing tasks.
➢ Users can apply filters for data transformation, normalization, handling missing values,
and other preprocessing steps to prepare the data for machine learning.

3
Classify Panel:
➢ The Classify panel is used for building and evaluating machine learning models for
classification tasks.
➢ Users can choose from various classification algorithms, set their parameters, and
assess the performance of the models using different evaluation metrics.

4
Select Attributes Panel:
➢ This panel in WEKA allows users to choose or deselect specific attributes (features)
from the dataset.
➢ Feature selection is crucial for improving model performance, reducing
dimensionality, and eliminating irrelevant or redundant information.

5
Associate Panel:
➢ The Associate panel is dedicated to association rule mining.
➢ It allows users to discover interesting relationships or patterns in the data, often used
in market basket analysis or recommendation systems.

6
Visualize Panel:
➢ The Visualize option in WEKA enables users to visualize the dataset, model
performance, and various statistical metrics.
➢ It includes tools for visualizing decision trees, ROC curves, confusion matrices,
and more.

7
3. Study the ARFF file format.

Answer: The ARFF (Attribute-Relation File Format) is a plain text file format used for
representing datasets in WEKA and other machine learning software. ARFF files describe the
data and its attributes, making it easy to exchange datasets between different machine learning
tools. Here's an overview of the ARFF file format:
Structure of ARFF File:
a. Header Section:
➢ The header section contains information about the dataset, including the name of the
relation and a list of attributes.
➢ It starts with the @relation keyword, followed by the dataset name.
Example: @relation Iris b
b. Attribute Section:
➢ The attribute section specifies the attributes (features) of the dataset.
➢ Each attribute is defined using the @attribute keyword, followed by the attribute
name and its type.
Example:
@attribute sepal_length
numeric @attribute
sepal_width numeric
@attribute petal_length
numeric @attribute
petal_width numeric
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica}
In the example above, the attributes are sepal_length, sepal_width, petal_length,
petal_width (numeric), and class (categorical with three possible values).
c. Data Section:
➢ The data section begins with the @data keyword and contains the actual instances or
examples of the dataset.
➢ Each instance is represented as a comma-separated list of attribute values.

8
Example:
@data 5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
...
Each line corresponds to a data instance, with values in the order of attributes specified in the
attribute section.
Key Points:
➢ Numeric attributes can take real or integer values, while nominal attributes represent
categorical values.
➢ Missing values are represented by a question mark (?).
➢ Comments can be added using the % symbol.
Example ARFF File (Iris Dataset):
@relation Iris
@attribute sepal_length
numeric @attribute
sepal_width numeric
@attribute petal_length
numeric @attribute
petal_width numeric
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica} @data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
...
In this example, we have a dataset named "Iris" with four numeric attributes and one nominal
attribute representing the class. The data section contains instances of the dataset with
corresponding attribute values. ARFF files provide a standardized and human-readable way to
represent datasets, making it easy to share and use data across different machine learning
platforms that support the ARFF format.

9
4. Explore the available data sets in WEKA.

Answer: In WEKA, there are several built-in datasets that you can use for experimenting
with various machine learning algorithms. These datasets cover a wide range of domains and
are often used for educational purposes, testing algorithms, and exploring machine learning
concepts.
Here's how you can explore the available datasets in WEKA:
a. Launch WEKA:
➢ Open the WEKA GUI (Explorer) or use the command line to access WEKA.
b. Load Dataset:
➢ In the Explorer, go to the "Preprocess" tab.
➢ Click on the "Open file" button and navigate to the "data" folder within the WEKA
installation directory.
c. Select Dataset:
➢ In the "data" folder, you will find various datasets in ARFF format.
➢ Select a dataset of interest and open it. Alternatively, you can use the "Open"
button to load a dataset.
d. Explore Attributes:
➢ Once a dataset is loaded, go to the "Preprocess" panel to explore the attributes.
➢ You can view statistics, distributions, and visualize attribute values using various
tools in WEKA.
e. Classify and Visualize:
➢ Move to the "Classify" panel to apply machine learning algorithms on the dataset.
➢ You can choose a classifier, set its parameters, and evaluate its performance.
➢ Use the "Visualize" panel to explore the results graphically.
f. Knowledge Flow Interface:
➢ Alternatively, if you prefer the Knowledge Flow interface, you can use it to load
datasets and build workflows visually.

10
These datasets cover different types of problems and are useful for practicing with various
machine learning algorithms. They can be accessed and explored within the WEKA
environment to gain hands-on experience with data analysis and model building.

11
5. Load a data set and observe the following:
• List attribute names and their types.
• Number of records in the dataset.
• Identify the class attribute (if any).
• Visualize the data in various dimensions.

Answer: Dataset(breast-cancer)

12
13
14
6. Explore various options in WEKA for preprocessing data and apply
Discretization filter and Resample filter etc. in each dataset.

Answer: Dataset (Glass)

Before filter:

15
After Discretization filter:

Discretization in Weka involves converting numeric attributes into nominal (categorical)


attributes by dividing the range of numeric values into intervals (or bins).

16
After Resample filter:

Resample filter in Weka is useful when you need to adjust the dataset size, balance class
distributions, or create a random sample for training/testing.

17
7. Load each dataset into WEKA and run the Apriori algorithm with
different support and confidence values. Study the rules generated.

The Apriori algorithm is a classical algorithm in data mining and machine learning used for
association rule mining in transactional databases. It aims to find interesting relationships or
associations among a set of items in large datasets. The most common application of the
Apriori algorithm is in market basket analysis, where it helps identify associations between
products that are frequently purchased together.

Dataset (Supermarket)

18
19
8. Load each dataset into WEKA and run j48 classification algorithm.
Study the classifier output. Compute entropy values, Kappa statistics.

The J48 algorithm is a classification technique based on the C4.5 decision tree algorithm. It
generates a decision tree for a dataset by splitting it into subsets based on attribute values. This
process uses statistical measures such as information gain or gain ratio to decide the best way
to divide the data at each node. The resulting decision tree can classify new data instances by
traversing the tree based on attribute values.

Dataset (Weather.symbolic)

20
Decision Tree:

21
9. Extract the if-then rule from the decision tree generated by the
classifier.
The Naive Bayes classifier is a probabilistic machine learning algorithm used for classification
tasks. It is based on Bayes' Theorem, which calculates the probability of a class given a set of
features (also called evidence). The algorithm assumes that all features are conditionally
independent given the class label, which simplifies computation.

Dataset (Vote):

22
23
10. Load a dataset into WEKA and perform Naive-bayes classification
and K-Nearest Neighbor classification.
Naive Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem, used for
classification tasks. It assumes that the features (attributes) of a dataset are independent of each
other given the class label. Despite this "naive" independence assumption, the algorithm often
performs surprisingly well in practice, especially for text classification and spam filtering.

k-Nearest Neighbor (k-NN) is one of the simplest and most intuitive machine learning
algorithms used for both classification and regression tasks. The idea behind k-NN is based on
the assumption that similar instances exist in close proximity to each other in the feature space.
It is an instance-based learning algorithm, meaning it does not construct an explicit model
during training but rather uses the entire dataset to make predictions.

Dataset (Glass)

Naive-bayes classification

24
25
26
K-Nearest Neighbor classification

27
11. Load dataset and perform k-means clustering algorithm with
different values of k.
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning
a dataset into a predetermined number of clusters. The goal of K means is to minimize the
within-cluster variance, also known as inertia or sum of squared distances from each point in
the cluster to the centroid of that cluster.

Dataset (Iris)

K=2

28
K=4

29
12. Load dataset and build Linear Regression model.

Dataset (CPU)

30
13. Write a python program to implement Apriori algorithm.
Question: (2080)
Transaction ID Items Purchased
1 Bread, Cheese, Egg, Juice
2 Bread, Cheese, Juice
3 Bread, Milk, Yogurt
4 Bread, Juice, Milk
5 Cheese, Juice, Milk
Assuming min. support is 50% and min confidence is 75%.

Source Code:

Output:

31
14. Write a python program to implement Naive-bayes classification.

Naive Bayes Classification is a probabilistic machine learning algorithm based on Bayes'


Theorem. It is used for classification tasks and is particularly effective for text classification,
spam detection, and sentiment analysis. The algorithm assumes that the features are
conditionally independent given the class label, which simplifies the computation.

Question: (2078)
Confident Studied Sick Result
Yes No No Fail
Yes No No Pass
No Yes Yes Fail
No Yes Yes Pass
Yes Yes Yes Pass
Find out whether the object with attribute Confident = Yes, Sick = No will Fail or Pass using
Bayesian classification.

Source Code:

32
Output:

33
15. Write a python program to implement k-means algorithm.

Source Code:

Output:

34

You might also like