0% found this document useful (0 votes)
54 views18 pages

DHW Lab (Ex1 To 3)

Uploaded by

dideep1624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views18 pages

DHW Lab (Ex1 To 3)

Uploaded by

dideep1624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

INTRODUCTION TO WEKA TOOL

Weka is a collection of machine learning algorithms for data mining tasks. It


contains tools for data preparation, classification, regression, clustering, association
rules mining, and visualization.

Found only on the islands of New Zealand, the Weka is a flightless bird with an
inquisitive nature. The name is pronounced like this, and the bird sounds like this.
Weka is open source software issued under the GNU General Public License.

Downloading and/or installation of WEKA data mining toolkit

1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and download


the software. On the left-hand side, click on the link that says download.
2. Select the appropriate link corresponding to the version of the software based on
your operating system.
3. You can download the software from a site. Save the self-extracting executable to
disk and then double click on it to install Weka. Answer yes or next to the questions
during the installation.
4. Click yes to accept the Java agreement if necessary. After you install the program
Weka should appear on your start menu under Programs.
5. Running Weka from the start menu select Programs, then Weka. You will see the
Weka GUI Chooser. Select Explorer. The Weka Explorer will then launch.

Understand the features of WEKA toolkit such as Explorer, Knowledge Flow


interface, Experimenter, command-line interface.
The Weka GUI Chooser provides a starting point for launching Weka‘s main GUI
applications and supporting tools. The GUI Chooser consists of four buttons—one
for each of the four major Weka applications— and four menus.

The buttons can be used to start the following applications:


Explorer: An environment for exploring data with WEKA
a) Click on explorer button to bring up the explorer window.
b) Make sure the preprocess tab is highlighted.
c) Open a new file by clicking on Open New file and choosing a file with .arff
extension from the Data directory.
d) Attributes appear in the window below.
e) Click on the attributes to see the visualization on the right.
f) Click visualize all to see them all

Experimenter: An environment for performing experiments and conducting


statistical tests between learning schemes.
a) Experimenter is for comparing results.
b) Under the set up tab click New.
c) Click on Add New under Data frame. Choose a couple of arff format files from
Data directory one at a time.
d) Click on Add New under Algorithm frame. Choose several algorithms, one at a
time by clicking OK in the window and Add New.
e) Under the Run tab click Start.
f) Wait for WEKA to finish.
g) Under Analyses tab click on Experiment to see results.

Knowledge Flow: This environment supports essentially the same functions as the
Explorer but with a drag and drop interface. One advantage is that it supports
incremental learning. Simple CLI: Provides a simple command line interface that
allows direct execution of WEKA commands for operating systems that do not
provide their own command line interface.

Navigate the options available in the WEKA (ex. Select attributes panel,
Preprocess panel,classify panel, Cluster panel, Associate panel and Visualize
panel)

When the Explorer is first started only the first tab is active. This is because it is
necessary to open a data set before starting to explore the data.
The tabs are as follows:

 Preprocess: Choose and modify the data being acted on.


 Classify: Train and test learning schemes that classify or perform regression.
 Cluster: Learn clusters for the data.
 Associate: Learn association rules for the data.
 Select attributes: Select the most relevant attributes in the data.
 Visualize: View an interactive 2D plot of the data.
1. Preprocessing

Loading Data: The first four buttons at the top of the preprocess section enable you
to load data into WEKA:
 Open file.... Brings up a dialog box allowing you to browse for the data file
on the local file system.
 Open URL Asks for a Uniform Resource Locator address for where the data
is stored.
 Open DB Reads data from a database.
 Generate. Enables you to generate artificial data from a variety of Data
Generators.
Using the Open file ... button you can read files in a variety of formats:

WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format.
ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a
.data and .names extension, and serialized Instances objects a .bsi extension.
2. Classification:

Selecting a Classifier: At the top of the classify section is the Classifier box. This
box has a text field that gives the name of the currently selected classifier, and its
options. Clicking on the text box with the left mouse button brings up a Generic
Object Editor dialog box, just the same as for filters that you can use to configure
the options of the current classifier. With a right click (or Alt+Shift+left click) you
can once again copy the setup string to the clipboard or display the properties in a
Generic Object Editor dialog box. The Choose button allows you to choose on4eof
the classifiers that are available in WEKA.

Test Options: The result of applying the chosen classifier will be tested according
to the options that are set by clicking in the Test options box.

There are four test modes:


1. Use training set: The classifier is evaluated on how well it predicts the class of
the instances it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts the class of
a set of instances loaded from a file. Clicking the Set. button brings up a dialog
allowing you to choose the file to test on.
3. Cross-validation: The classifier is evaluated by cross-validation, using the
number of folds that are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held out
depends on the value entered in the % field.

3. Clustering:

Cluster Modes: The Cluster mode box is used to choose what to cluster and how to
evaluate the results. The first three options are the same as for classification: Use
training set, Supplied test set and Percentage split.
4. Associating:

Setting Up: This panel contains schemes for learning association rules, and the
learners are chosen and configured in the same way as the clusterers, filters, and
classifiers in the other panels.

5. Selecting Attributes:
Searching and Evaluating: Attribute selection involves searching through all
possible combinations of attributes in the data to find which subset of attributes
works best for prediction. To do this, two objects must be set up: an attribute
evaluator and a search method. The evaluator determines what method is used to
assign a worth to each subset of attributes. The search method determines what style
of search is performed.

6. Visualizing:

WEKA‘s visualization section allows you to visualize 2D plots of the current


relation.
Study the arff file format
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a
list of instances sharing a set of attributes. ARFF files were developed by the
Machine Learning Project at the Department of Computer Science of The University
of Waikato for use with the Weka machine learning software.
Overview: ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes
(the columns in the data), and their types. An example header on the standard IRIS
dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

Explore the available data sets in WEKA


There are 23 different datasets are available in weka (C:\Program Files\Weka-3-6\)
by default for testing purpose. All the datasets are available in. arff format. Those
data sets are listed
EX:No:1 DATA EXPLORATION AND INTEGRATION USING WEKA
AIM:

To explore and integrate Data using weka tool

Description:

Step 1: Load Your Data

 Open WEKA Tool.


 Click on the "Explorer" tab.
 In the "Preprocess" panel, click on the "Open file" button to load your dataset. WEKA
supports various file formats like CSV, ARFF, etc.

Step 2: Explore Your Data

 Once dataset is loaded, explore it in the "Preprocess" panel.


 Now view summary statistics and information about your dataset by clicking on the
"Summary" button.
 This will give you a quick overview of the data's distribution, missing values, and other
statistics.
 You can also visualize your data using the "Visualize" button. This allows you to generate
various plots and charts to understand the data's patterns and relationships.

Step 3: Preprocess Data

Data preprocessing is a critical step in data integration and exploration. You may need to clean,
transform,and preprocess your data to make it suitable for machine learning. Here are some
common preprocessing

Data Preprocessing Steps:


Handling Missing Values: Use the "Filter" option in the "Preprocess" panel to apply filters like
"ReplaceMissingValues" to handle missing data.

Feature Selection: WEKA provides various feature selection methods to choose the most
relevant features for your machine learning model.

Data Transformation: You can use filters like "Normalize" or "Standardize" to scale your
features. This ensures that all features are on the same scale, which can be important for many
machine learning algorithms.

Data Discretization: If you have continuous variables, you may want to discretize them into
bins using filters like "Discretize."
Step 4: Integration

Integration often involves combining data from multiple sources. In WEKA, you can load
multiple datasets and merge, append them using the "Merge Two Files" filter or use external
tools to combine data before loading it into WEKA.

Example Scenario:

Let's consider two datasets: one containing information about customers (StudentDetails.csv) and
another containing their purchase history (WeatherDetails.csv). Now integrate these datasets.
Preprocess each dataset separately to handle missing values, feature selection, and transformation.
Use the "Merge Two Files" and Append Two Files filter in WEKA. Once the datasets are
integrated, proceed with clustering or classification tasks to segment.

Load datasets in WEKA.

PROCEDURE 1:
1. Open the weka tool

2. Click explore button

3. Click open file button under preprocess tap

4. Choose the file weather.nominal.arff and click open

5. Select the outlook attribute and observe the attributes and the charts

6. Uncheck all the attributes

7. Select play attributes

8. Click visualize all button to view different charts

9. Stop the process

Data Integration after Loading

PROCEDURE 2:
STEP 1: Create a new dataset Sample1.arff

@relation Sample 1.arff


@attribute col1
@attribute col2
@attribute result {Yes, No}
@data
10, 20, Yes
20, 30, No
STEP 2: Create the new dataset Sample2.arff

@relation sample2
@attribute col1
@attribute col2
@attribute result {Yes, No}
@data
30, 40, Yes
40, 50, No

STEP 3: Open the weka tool

STEP 4: Click simple CLI button

STEP 5: Java weka.core.Instances append z:/sample1.arff z:/sample2.arff > z:/sample3.arff

RESULT:

Thus, the data exploration and integration using weka has been completed successfully.
Ex:No:2 Apply Weka tool for data validation

Aim:

To apply weka tool for data validation.

Description

Cross Validation (Using 10 folds)


 If Weka takes 100 labeled data
 It produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are
used for training and 10 labeled data are used for testing.
 it produces a classifier with an algorithm from 90 labeled data and applies that on the 10
testing data for set 1.
 It does the same thing for set 2 to 10 and produces 9 more classifiers it averages the
performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing)
sets

Procedure 1:

1. Open Weka and Click on Explorer.


2. Under Preprocess tab click “Open File” and load “Credit-g.arff” dataset.
3. Under classify tab select J-48 classifier in trees and test option as Cross Validation
with fold 10
4. Click start and note the result.
5. Repeat the step 3 and 4 by changing the test option as Cross validation with
fold (3 and 5).
6. Compare the generated results
7. Visualize the results by right-clicking the result-list and click Tree visualize tree.

Training Set and Test Set

Training data is an extremely large dataset that is used to teach a machine


learning model. Training data is used to teach prediction models that use machine
learning algorithms how to extract features that are relevant to specific business
goals. For supervised ML models, the training data is labeled. The data used to
train unsupervised ML models is not labeled. Training data is also known as a
training set, training dataset or learning set.
The test set is a separate set of data used to test the model after completing the
training.

Procedure 2:

1. Open weka and click on explorer.

2. Under preprocess tab click on ‘open-file’ and load ‘segment-challenge’ dataset.

3. Under classify tab select J-48 Classifier in trees.

4. Select ‘Use Training set’ under test option.

5. Click start button and observe the generated results.

6. Select ‘Supplied test set’ under test options.

7. Click set button,Click open file choose ‘Segment-Test.arff’ file.

8. Click start button and compare the training and test results.

9. Stop the process.

Result:

Thus Data validation using Weka tool has been completed successfully.
Plan the architecture for Real time application

AIM:

To plan the Web Services based Real time Data Warehouse Architecture

Procedure:

A web services-based real-time data warehouse architecture enables the integration


of data from various sources in near real-time using web services as a
communication mechanism. Here's an overview of such an architecture:

Data Sources: These are the systems or applications where the raw data originates
from. They could include operational databases, external APIs, logs, etc.

Web Service Clients (WS Client): These components are responsible for extracting
data changes from the data sources using techniques such as Change Data Capture
(CDC) and sending them to the web service provider. They make use of web service
calls to transmit data.

Web Service Provider: The web service provider receives data from the clients and
processes them for further integration into the real-time data warehouse. It
decomposes the received data, performs necessary transformations, generates SQL
statements, and interacts with the data warehouse for insertion.

This is a web service that receives data from the WS Client and adds it to the Real-
Time Partition. It decomposes the received Data Transfer Object into data and
metadata. It then uses metadata to generate SQL via an SQL-Generator to insert the
data into RTDW log tables and executes the generated SQL on the RTDW database.

Metadata: Metadata describes the structure and characteristics of the data. In this
context, it's used by the Web Service Provider to generate SQL for inserting data
into RTDW log tables.In a web services-based architecture, metadata plays a crucial
role in understanding data formats, schemas, and transformations. It is often
managed centrally to ensure consistency across the system.

ETL (Extract, Transform, Load): ETL processes are employed to collect data
from various sources, transform it into a consistent format, and load it into the data
warehouse. In a real-time context, this process may involve continuous or near real-
time transformations to ensure that data is available for analysis without significant
delays.

Real-Time Partition: This is a section of the data warehouse dedicated to storing


real-time or near real-time data. It may utilize techniques such as in-memory
databases or specialized storage structures optimized for high-speed data ingestion
and query processing. There are three stages:

 Putting the CDC data into the log table.


 Cleaning the CDC log data on demand.
 Aggregating the cleaned CDC data on demand.
Data Warehouse: The data warehouse stores both historical and real-time data. It
provides a unified repository for storing and querying data for analytical purposes.
In a web services-based architecture, the data warehouse may be accessed through
APIs exposed as web services.

Real-Time Data Integration: This component facilitates the integration of real-


time data into the data warehouse. It ensures that data from various sources are
combined seamlessly and made available for analysis in real-time or near real-time.

Query Interface: Users interact with the system through a query interface, which
could be a web-based dashboard, API endpoints, or other client applications. The
query interface allows users to retrieve and analyze data stored in the data
warehouse, including both historical and real-time data.

Web Services based Real time Data Warehouse Architecture


Overall, a web services-based real-time data warehouse architecture provides a
scalable and flexible framework for integrating and analyzing data from diverse
sources in real-time, enabling organizations to make data-driven decisions more
effectively.

Result:

Thus the web services based real time data warehouse application has been studied successfully

You might also like