0% found this document useful (0 votes)
640 views51 pages

DWDM Lab Manual

The document is a lab manual for a Data Warehousing and Data Mining course aimed at third-year students, detailing course objectives, outcomes, software requirements, and a comprehensive list of experiments. Students will gain hands-on experience in designing data warehouses, utilizing data mining tools like WEKA, and implementing various algorithms for data analysis. The manual includes specific tasks such as building data warehouses, performing ETL processes, and applying machine learning techniques on real datasets.

Uploaded by

sreevs749
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
640 views51 pages

DWDM Lab Manual

The document is a lab manual for a Data Warehousing and Data Mining course aimed at third-year students, detailing course objectives, outcomes, software requirements, and a comprehensive list of experiments. Students will gain hands-on experience in designing data warehouses, utilizing data mining tools like WEKA, and implementing various algorithms for data analysis. The manual includes specific tasks such as building data warehouses, performing ETL processes, and applying machine learning techniques on real datasets.

Uploaded by

sreevs749
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

DWDM LAB MANUAL

DATA WAREHOUSING & DATA MINING LABORATORY


Regulation – 20
Year / Semester: III / I
2

L T P C
III Year – I Semester
0 0 3 1.5
DATA WAREHOUSING AND DATA MINING LAB

Course Objectives: The main objective of the course is to


 Inculcate Conceptual, Logical, and Physical design of Data Warehouses OLAP applications and
OLAP deployment
 Design a data warehouse or data mart to present information needed by management in a form that
is usable
 Emphasize hands-on experience working with all real data sets.
 Test real data sets using popular data mining tools such as WEKA, Python Libraries
 Develop ability to design various algorithms based on data mining tools.

Course Outcomes: By the end of the course student will be able to


 Design a data mart or data warehouse for any organization
 Extract knowledge using data mining techniques and enlist various algorithms used in information
analysis of Data Mining Techniques
 Demonstrate the working of algorithms for data mining tasks such as association rule mining,
classification for realistic data
 Implement and Analyze on knowledge flow application on data sets and Apply the suitable
visualization techniques to output analytical results

Software Requirements: WEKA Tool/Python/R-Tool/Rapid Tool/Oracle Data mining

List of Experiments:
1. Creation of a Data Warehouse.
 Build Data Warehouse/Data Mart (using open source tools like Pentaho Data Integration Tool,
Pentaho Business Analytics; or other data warehouse tools like Microsoft-SSIS, Informatica,
Business Objects, etc.,)
 Design multi-dimensional data models namely Star, Snowflake and Fact Constellation schemas for
any one enterprise (ex. Banking, Insurance, Finance, Healthcare, manufacturing, Automobiles,
sales etc).
 Write ETL scripts and implement using data warehouse tools.
 Perform Various OLAP operations such slice, dice, roll up, drill up and pivot

2. Explore machine learning tool ―WEKA‖


 Explore WEKA Data Mining/Machine Learning Toolkit.
 Downloading and/or installation of WEKA data mining toolkit.
 Understand the features of WEKA toolkit such as Explorer, Knowledge Flow interface,
Experimenter, command-line interface.
 Navigate the options available in the WEKA (ex. Select attributes panel, Preprocess panel, Classify
panel, Cluster panel, Associate panel and Visualize panel)
 Study the arff file format Explore the available data sets in WEKA. Load a data set (ex. Weather
dataset, Iris dataset, etc.)
 Load each dataset and observe the following:
1. List the attribute names and they types
2. Number of records in each dataset
3. Identify the class attribute (if any)
4. Plot Histogram
5. Determine the number of records for each class.
6. Visualize the data in various dimensions
3

3. Perform data preprocessing tasks and Demonstrate performing association rule mining on data sets
 Explore various options available in Weka for preprocessing data and apply Unsupervised filters
like Discretization, Resample filter, etc. on each dataset
 Load weather. nominal, Iris, Glass datasets into Weka and run Apriori
Algorithm with different support and confidence values.
 Study the rules generated. Apply different discretization filters on numerical attributes and run the
Apriori association rule algorithm. Study the rules generated.
 Derive interesting insights and observe the effect of discretization in the rule generation process.

4. Demonstrate performing classification on data sets


 Load each dataset into Weka and run 1d3, J48 classification algorithm. Study the classifier output.
Compute entropy values, Kappa statistic.
 Extract if-then rules from the decision tree generated by the classifier, Observe the confusion
matrix.
 Load each dataset into Weka and perform Naïve-bayes classification and k-Nearest Neighbour
classification. Interpret the results obtained.
 Plot RoC Curves
 Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each dataset, and
deduce which classifier is performing best and poor for each dataset and justify.

5. Demonstrate performing clustering of data sets


 Load each dataset into Weka and run simple k-means clustering algorithm with different values of
k (number of desired clusters).
 Study the clusters formed. Observe the sum of squared errors and centroids, and derive insights.
 Explore other clustering techniques available in Weka.
 Explore visualization features of Weka to visualize the clusters. Derive interesting insights and
explain.

6. Demonstrate knowledge flow application on data sets


 Develop a knowledge flow layout for finding strong association rules by using Apriori, FP Growth
algorithms
 Set up the knowledge flow to load an ARFF (batch mode) and perform a cross validation using J48
algorithm
 Demonstrate plotting multiple ROC curves in the same plot window by using j48 and Random
forest tree
7. Demonstrate ZeroR technique on Iris dataset (by using necessary preprocessing technique(s)) and
share your observations
8. Write a java program to prepare a simulated data set with unique instances.
9. Write a Python program to generate frequent item sets / association rules using Apriori algorithm
10. Write a program to calculate chi-square value using Python. Report your observation.
11. Write a program of Naive Bayesian classification using Python programming language.
12. Implement a Java program to perform Apriori algorithm
13. Write a program to cluster your choice of data using simple k-means algorithm using JDK
14. Write a program of cluster analysis using simple k-means algorithm Python programming language.
15. Write a program to compute/display dissimilarity matrix (for your own dataset containing at least four
instances with two attributes) using Python
16. Visualize the datasets using matplotlib in python.(Histogram, Box plot, Bar chart, Pie chart etc.,)
4

Data Warehousing
Experiments:

1. Build Data Warehouse and Explore WEKA

A. Build a Data Warehouse/Data Mart (using open source tools like Pentaho Data
Integration tool, Pentaho Business Analytics; or other data warehouse tools like
Microsoft- SSIS, Informatica, Business Objects, etc.).

(i). Identify source tables and populate sample data

In this task, we are going to use MySQL administrator, SQLyog Enterprise tools
for building & identifying tables in database & also for populating (filling) the sample data in
those tables of a database. A data warehouse is constructed by integrating data from multiple
heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and
decision making. We are building a data warehouse by integrating all the tables in database &
analyzing those data. In the below figure we represented MySQL Administrator connection
establishment.

After successful login, it will open new window as shown below.


5

There are different options available in MySQL administrator. Another tool SQLyog
Enterprise, we are using for building & identifying tables in a database after successful
connection establishment through MySQL Administrator. Below we can see the window of
SQLyog Enterprise.

On left-side navigation, we can see different databases & it‗s related tables. Now we are
going to build tables & populate table‗s data in database through SQL queries. These tables in
database can be used further for building data warehouse.
6

In the above two windows, we created a database named “sample”& in that database we
created two tables named as “user_details”& “hockey”through SQL queries.
Now, we are going to populate (filling) sample data through SQL queries in those two
created tables as represented in below windows.
7

Through MySQL administrator & SQLyog, we can import databases from other sources (.XLS,
.CSV, .sql) & also we can export our databases as backup for further processing. We can connect
MySQL to other applications for data analysis & reporting.
8

(ii). Design multi-dimensional data models namely Star, snowflake and Fact constellation
schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare,
Manufacturing, Automobile, etc.).

Multi-Dimensional model was developed for implementing data warehouses & it provides both a
mechanism to store data and a way for business analysis. The primary components of
dimensional model are dimensions & facts. There are different of types of multi-dimensional
data models. They are:
1. Star Schema Model
2. Snow Flake Schema Model
3. Fact Constellation Model.

Now, we are going to design these multi-dimensional models for the Marketing
enterprise.
First, we need to built the tables in a database through SQLyog as shown below.

In the above window, left side navigation bar consists of a database named as ―sales_dw‖
in which there are six different tables (dimcustdetails, dimcustomer, dimproduct, dimsalesperson,
dimstores, factproductsales) has been created.

After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi-dimensional models.
9

In the above window, we are seeing Microsoft Visual Studio before creating a project In
which right side navigation bar contains different options like Data Sources, Data Source Views,
Cubes, Dimensions etc.

Through Data Sources, we can connect to our MySQL database named as “sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for creating multi-
dimensional models.

By data source views & cubes, we can see our retrieved tables in multi-dimensional
models. We need to add dimensions also through dimensions option. In general, Multi-
dimensional models consists of dimension tables & fact tables.

Star Schema Model:

A Star schema model is a join between a fact table and a no. of dimension tables. Each
dimensional table are joined to the fact table using primary key to foreign key join but
dimensional tables are not joined to each other. It is the simplest style of data warehouse schema.

Star schema is a entity relationship diagram of this schema resembles a star with point
radiating from central table as we seen in the below implemented window in visual studio.
10

Snow Flake Schema:

It is slightly different from star schema in which dimensional tables from a star schema
are organized into a hierarchy by normalizing them.
Snow flake schema is represented by centralized fact table which are connected to
multiple dimension tables. Snow flake effects only dimension tables not fact tables. we
developed a snowflake schema for sales_dw database by visual studio tool as shown below.
11

Fact Constellation Schema:


Fact Constellation is a set of fact tables that share some dimension tables. In this schema
there are two or more fact tables. We developed fact constellation in visual studio as shown
below. Fact tables are labelled in yellow color.
12

2. Write ETL scripts and implement using data warehouse tools

ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a
process of how the data are loaded from the source system to the data warehouse. Currently, the
ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-
Transform-Load. Let us briefly describe each step of the ETL process.
Process
Extract:
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible. The extract step should be designed in a
way that it does not negatively affect the source system in terms or performance, response time
or any kind of locking.
There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification that a record
has been changed and describe the change, this is the easiest way to get the data.
 Incremental extract - some systems may not be able to provide notification that an update
has occurred, but they are able to identify which records have been modified and provide
an extract of such records. During further ETL steps, the system needs to identify
changes and propagate it down. Note, that by using daily extract, we may not be able to
handle deleted records properly.
 Full extract - some systems are not able to identify which data has been changed at all, so
a full extract is the only way one can get the data out of the system. The full extract
requires keeping a copy of the last extract in the same format in order to be able to
identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using
the same units so that they can later be joined. The transformation step also requires joining data
from several sources, generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
13

Load:
During the load step, it is necessary to ensure that the load is performed correctly and with as
little resources as possible. The target of the Load process is often a database. In order to make
the load process efficient, it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity needs to be maintained
by ETL tool to ensure consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is a possibility
that the ETL process fails. This can be caused by missing extracts from one of the systems,
missing values in one of the reference tables, or simply a connection or power outage. Therefore,
it is necessary to design the ETL process keeping fail-recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We
can ensure this by implementing proper staging. Staging means that the data is simply dumped to
the location (called the Staging Area) so that it can then be read by the next processing phase.
The staging area is also used during ETL process to store intermediate results of processing. This
is ok for the ETL process which uses for this purpose. However, The staging area should is be
accessed by the load ETL process only. It should never be available to anyone else; particularly
not to end users as it is not intended for data presentation to the end-user.may contain incomplete
or in-the-middle-of-the-processing data.

ETL Tool Implementation:


When you are about to use an ETL tool, there is a fundamental decision to be made: will the
company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the preferred
approach for a small number of data sources which reside in storage of the same type. The reason
for that is the effort to implement the necessary transformation is little due to similar data
structure and common system architecture. Also, this approach saves licensing cost and there is
no need to train the staff in a new tool. This approach, however, is dangerous from the TOC
point of view. If the transformations become more sophisticated during the time or there is a
need to integrate other systems, the complexity of such an ETL system grows but the
manageability drops significantly. Similarly, the implementation of your own tool often
resembles re-inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-shelf
ETL tools is the fact that they are optimized for the ETL process by providing connectors to
common data sources like databases, flat files, mainframe systems, xml, etc. They provide a
means to implement data transformations easily and consistently across various data sources.
This includes filtering, reformatting, sorting, joining, merging, aggregation and other operations
ready to use. The tools also support transformation scheduling, version control, monitoring and
unified metadata management. Some of the ETL tools are even integrated with BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator, and SAP Data Integrator.
There are several open source ETL tools are OpenRefine,Apatar, CloverETL, Pentaho and Talend.
14

In these above tools, we are going to use OpenRefine 2.8 ETL tool to different sample datasetsfor
extracting, data cleaning, transforming & loading.

(iv). Perform various OLAP operations such slice, dice, roll up, drill down and pivot.

OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations
 Roll-up (Drill-up)
 Drill-down
 Slice and dice
 Pivot (rotate)

Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways
 By climbing up a concept hierarchy for a dimension
 By dimension reduction
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.
 The data is grouped into cities rather than countries.
15

 When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways
 By stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level
of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.

Slice:
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube.

Dice:
Dice selects two or more dimensions from a given cube and provides a new sub-cube.

Pivot (rotate):
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using Microsoft Excel.

Procedure for OLAP Operations:

1. Open Microsoft Excel, go to Data tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more” option should be
clicked for importing .cub extension file for performing OLAP Operations. For sample, I took
music.cub file.
16

3. As shown in above window, select ―PivotTable Report” and click “OK”.

4. We got all the music.cub data for analyzing different OLAP Operations. Firstly, we performed
drill-down operation as shown below.
17

In the above window, we selected year „2008‟ in „Electronic‟ Category, then


automatically the Drill-Down option is enabled on top navigation options. We will click on
„Drill-Down‟ option, then the below window will be displayed.

5. Now we are going to perform roll-up (drill-up) operation, in the above window I selected
January month then automatically Drill-up option is enabled on top. We will click on Drill-up
option, then the below window will be displayed.
18

6. Next OLAP operation Slicing is performed by inserting slicer as shown in top navigation
options.

While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.
CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a slice&
adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will get
table as shown below.
19

7. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions


(CategoryName, Year, RegionCode)& 2 Measures (Sum of Quantity, Sum of Sales) through
„insert slicer‟ option. After that adding a filter for CategoryName, Year & RegionCode as
shown below.

8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-Year)
& columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation bar
as shown below.
20

After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.

(v). Explore visualization features of the tool for analysis like identifying trends etc.
There are different visualization features for analyzing the data for trend analysis in
datawarehouses. Some of the popular visualizations are:

1. Column Charts
21

2. Line Charts
3. Pie Chart
4. Bar Graphs
4. Area Graphs
5. X & Y Scatter Graphs
6. Stock Graphs
7. Surface Charts
8. Radar Graphs
9. Treemap
10. Sunburst
11. Histogram
12. Box & Whisker
13. Waterfall
14. Combo Graphs
15. Geo Map
16. Heat Grid
17. Interactive Report
18. Stacked Column
19. Stacked Bar
20. Scatter Area

These type of visualizations can be used for analyzing data for trend analysis. Some of
the tools for data visualization are Microsoft Excel, Tableau, Pentaho Business Analytics Online
etc. Practically different visualization features are tested with different sample datasets.
22

In the below window, we used 3D-Column Charts of Microsoft Excel for analyzing data
in data warehouse.

Below window, represents the data visualization through Pentaho Business Analytics
tool online (http://www.pentaho.com/hosted-demo) for some sample dataset.
23

B. Explore WEKA Data Mining/Machine Learning Toolkit

(i). Downloading and/or installation of WEKA data mining toolkit

Procedure:
1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and download the software.
On the left-hand side, click on the link that says download.
2. Select the appropriate link corresponding to the version of the software based on your
operating system and whether or not you already have Java VM running on your machine (if you
don‗t know what Java VM is, then you probably don‗t).
3. The link will forward you to a site where you can download the software from a mirror site.
Save the self-extracting executable to disk and then double click on it to install Weka. Answer
yes or next to the questions during the installation.
4. Click yes to accept the Java agreement if necessary. After you install the program Weka
should appear on your start menu under Programs (if you are using Windows).
5. Running Weka from the start menu select Programs, then Weka. You will see the Weka GUI
Chooser. Select Explorer. The Weka Explorer will then launch.
24

(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge Flow
interface, Experimenter, command-line interface.

The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting pointfor launching
Weka‗s main GUI applications and supporting tools. If one prefersa MDI (―multiple document
interface‖) appearance, then this is provided by an alternative launcher called ―Main‖
(class weka.gui.Main).
The GUI Chooser consists of four buttons—one for each of the four major Weka applications—
and four menus.

The buttons can be used to start the following applications:


Explorer- An environment for exploring data with WEKA
a) Click on ―explorer‖ button to bring up the explorer window.
b) Make sure the ―preprocess‖ tab is highlighted.
c) Open a new file by clicking on ―Open New file‖ and choosing a file with ―.arff‖
extension from the ―Data‖ directory.
d) Attributes appear in the window below.
e) Click on the attributes to see the visualization on the right.
f) Click ―visualize all‖ to see them all

Experimenter- An environment for performing experiments and conducting statistical tests


between learning schemes.
a) Experimenter is for comparing results.
b) Under the ―set up‖ tab click ―New‖.
c) Click on ―Add New‖ under ―Data‖ frame. Choose a couple of arff format files from
―Data‖ directory one at a time.
d) Click on ―Add New‖ under ―Algorithm‖ frame. Choose several algorithms, one at a time
by clicking ―OK‖ in the window and ―Add New‖.
e) Under the ―Run‖ tab click ―Start‖.
25

f) Wait for WEKA to finish.


g) Under ―Analyses‖ tab click on ―Experiment‖ to see results.

Knowledge Flow- This environment supports essentially the same functions as the Explorer but
with a drag-and-drop interface. One advantage is that it supports incremental learning.
Simple CLI - Provides a simple command-line interface that allows direct execution of
WEKAcommands for operating systems that do not provide their own command line interface.
(iii). Navigate the options available in the WEKA (ex. Select attributes panel, Preprocess
panel, classify panel, Cluster panel, Associate panel and Visualize panel)

When the Explorer is first started only the first tab is active; the others are greyed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting to explore
the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box,
the log button, and the Weka bird) stays visible regardless of which section you are in.

1. Preprocessing

Loading Data:
The first four buttons at the top of the preprocess section enable you to load data into WEKA:
26

1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local
filesystem.
2. Open URL .... Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB .....Reads data from a database. (Note that to make this work you might have to edit
the file in weka/experiment/DatabaseUtils.props.)
4. Generate. ... Enables you to generate artificial data from a variety of Data Generators.
Using the Open file ... button you can read files in a variety of formats:
WEKA‗s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files
typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and .names
extension, and serialized Instances objects a .bsi extension.
2. Classification:

Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field that gives the
name of the currently selected classifier, and its options. Click in gon the text box with the left
mouse button brings up a Generic Object Editor dialog box, just the same as for filters, that you
can use to configure the options of the current classifier. With a right click (or Alt+Shift+left
click) you can once again copy the setup string to the clipboard or display the properties in a
Generic Object Editor dialog box. The Choose button allows you to choose on4eof the classifiers
that are available in WEKA.
Test Options
27

The result of applying the chosen classifier will be tested according to the options that are set
by clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts the class of the instances
it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts the class of a set of
instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to
choose the file to test on.
3. Cross-validation: The classifier is evaluated by cross-validation, using the number of
folds that are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts a certain percentage of
thedata which is held out for testing. The amount of data held out depends on the value entered
in the % field.
3. Clustering:

Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first
three options are the same as for classification: Use training set, Supplied test set and Percentage
split.
4. Associating:
28

Setting Up
This panel contains schemes for learning association rules, and the learners are chosen and
configured in the same way as the clusterers, filters, and classifiers in the other panels.

5. Selecting Attributes:

Searching and Evaluating


Attribute selection involves searching through all possible combinations of attributes in the datato
find which subset of attributes works best for prediction. To do this, two objects must be set up: an
attribute evaluator and a search method. The evaluator determines what method is used to assign a
worth to each subset of attributes. The search method determines what style of search is performed.
29

6. Visualizing:

WEKA‗s visualization section allows you to visualize 2D plots of the current relation.
(iv). Study the arff file format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list
of instances sharing a set of attributes. ARFF files were developed by the Machine Learning
Project at the Department of Computer Science of The University of Waikato for use with
the Weka machine learning software.

Overview

ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset looks like
this:

% 1. Title: Iris Plants Database


%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
30

@RELATION iris

@ATTRIBUTE sepallength NUMERIC


@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

(v). Explore the available data sets in WEKA

There are 23 different datasets are available in weka (C:\Program Files\Weka-3-6\) by default for
testing purpose. All the datasets are available in. arff format. Those datasets are listed below.
31

(vi). Load a data set (ex. Weather dataset, Iris dataset, etc.)

Procedure:
1. Open the weka tool and select the explorer option.
2. New window will be opened which consists of different options (Preprocess, Association etc.)
3. In the preprocess, click the ―open file‖ option.
4. Go to C:\Program Files\Weka-3-6\data for finding different existing. arff datasets.
5. Click on any dataset for loading the data then the data will be displayed as shown below.

(vii). Load each dataset and observe the following:


32

Here we have taken IRIS.arff dataset as sample for observing all the below things.

i. List the attribute names and they types

There are 5 attributes & its data type present in the above loaded dataset (IRIS.arff)
sepallength – Numeric
sepalwidth – Numeric
petallength – Numeric
petallength – Numeric
Class – Nominal
ii. Number of records in each dataset

There are total 150 records (Instances) in dataset (IRIS.arff).

iii. Identify the class attribute (if any)

There is one class attribute which consists of 3 labels. They are:


1. Iris-setosa
2. Iris-versicolor
3. Iris-virginica

iv. Plot Histogram


33

v. Determine the number of records for each class.

There is one class attribute (150 records) which consists of 3 labels. They are shown below
1. Iris-setosa - 50 records
2. Iris-versicolor – 50 records
3. Iris-virginica – 50 records

vi. Visualize the data in various dimensions


34

2. Perform data preprocessing tasks and Demonstrate performing association rule


mining on data sets

A. Explore various options available in Weka for preprocessing data and apply (like
Discretization Filters, Resample filter, etc.) on each dataset
Procedure:
1. For preprocessing the data after selecting the dataset (IRIS.arff).
2. Select Filter option & apply the resample filter & see the below results.
35

3. Select another filter option & apply the discretization filter, see the below results

Likewise, we can apply different filters for preprocessing the data & see the results in
different dimensions.

B. Load each dataset into Weka and run Aprori algorithm with different support and
confidence values. Study the rules generated.
Procedure:
1. Load the dataset (Breast-Cancer.arff) into weka tool
2. Go to associate option & in left-hand navigation bar we can see different association
algorithms.
3. In which we can select Aprori algorithm & click on select option.
4. Below we can see the rules generated with different support & confidence values for that
selected dataset.
36

C. Apply different discretization filters on numerical attributes and run the Apriori
association rule algorithm. Study the rules generated. Derive interesting insights and
observe the effect of discretization in the rule generation process.

Procedure:
1. Load the dataset (Breast-Cancer.arff) into weka tool& select the discretize filter & apply it.
2. Go to associate option & in left-hand navigation bar we can see different association
algorithms.
3. In which we can select Aprori algorithm & click on select option.
4. Below we can see the rules generated with different support & confidence values for that
selected dataset.
37

3. Demonstrate performing classification on data sets

A. Load each dataset into Weka and run Id3, J48 classification algorithm. Study the
classifier output. Compute entropy values, Kappa statistic.
Procedure for Id3:
1. Load the dataset (Contact-lenses.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under tree section.
3. In which we selected Id3 algorithm, in more options select the output entropy evaluation
measures& click on start option.
4. Then we will get classifier output, entropy values& Kappa Statistic as represented below.
38

5. In the above screenshot, we can run classifiers with different test options (Cross-validation,
Use Training Set, Percentage Split, Supplied Test set).
The result of applying the chosen classifier will be tested according to the options that are set
by clicking in the Test options box. There are four test modes:
A. Use training set: The classifier is evaluated on how well it predicts the class of the instances
it was trained on.
B. Supplied test set: The classifier is evaluated on how well it predicts the class of a set
of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you
to choose the file to test on.
C. Cross-validation: The classifier is evaluated by cross-validation, using the number of
folds that are entered in the Folds text field.
D. Percentage split: The classifier is evaluated on how well it predicts a certain percentage of
thedata which is held out for testing. The amount of data held out depends on the value entered
in the % field.

Procedure for J48:


1. Load the dataset (Contact-lenses.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under tree section.
3. In which we selected J48 algorithm, in more options select the output entropy
evaluationmeasures& click on start option.
4. Then we will get classifier output, entropy values & Kappa Statistic as represented below.
5. In the below screenshot, we can run classifiers with different test options (Cross-validation,
Use Training Set, Percentage Split, Supplied Test set).
39

B. Extract if-then rules from the decision tree generated by the classifier, Observe the
confusion matrix and derive Accuracy, F-measure, TPrate, FPrate, Precision and Recall
values. Apply cross-validation strategy with various fold levels and compare the accuracy
results.
Procedure:
1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under rules section.
3. In which we selected JRip (If-then) algorithm & click on start option with ―use training set‖
test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.
40

Using Cross-Validation Strategy with 10 folds:


Here, we enabled cross-validation test option with 10 folds & clicked start button as represented
below.
41

Using Cross-Validation Strategy with 20 folds:


Here, we enabled cross-validation test option with 20 folds & clicked start button as represented
below.

If we see the above results of cross validation with 10 folds & 20 folds. As per our
observation the error rate is lesser with 20 folds got 97.3% correctness when compared to 10
folds got 94.6% correctness.

C. Load each dataset into Weka and perform Naive-bayes classification and k-Nearest
Neighbour classification. Interpret the results obtained.

Procedure for Naïve-Bayes:


1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under bayes section.
3. In which we selected Naïve-Bayes algorithm & click on start option with ―use training set‖
test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.
42

Procedure for K-Nearest Neighbour (IBK):


1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under lazy section.
3. In which we selected K-Nearest Neighbour (IBK) algorithm & click on start option with ―use
training set‖ test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.
43

D. Plot RoC Curves

Procedure:
1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under bayes section.
3. In which we selected Naïve-Bayes algorithm & click on start option with ―use training set‖
test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix.
5. For plotting RoC Curves, we need to right click on ―bayes. NaiveBayes‖ for getting more
options, In which we will select the ―Visualize Threshold Curve‖ & go to any class (Iris-setosa,
Iris-versicolor, Iris-Virgincia) as shown in below snapshot.
6. After selecting an class, RoC (Receiver Operating Characteristic) Curve plot will be displayed
which has X-Axis –False Positive (FP) rate and Y-Axis – True Positive (TP) rate.
44
45

E. Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each
dataset, and deduce which classifier is performing best and poor for each dataset and
justify.

Procedure for ID3:


1. Load the dataset (Contact-Lenses. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under trees section.
3. In which we selected ID3 algorithm & click on start option with ―use training set‖ test option
enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.

Procedure for J48:


1. Load the dataset (Contact-Lenses. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under trees section.
3. In which we selected J48 algorithm & click on start option with ―use training set‖ test option
enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values & Confusion Matrix as represented below.
46

Procedure for Naïve-Bayes:


1. Load the dataset (Contact-Lenses. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under bayes section.
3. In which we selected Naïve-Bayes algorithm & click on start option with ―use training set‖
test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values & Confusion Matrix as represented below.
47

Procedure for K-Nearest Neighbour (IBK):


1. Load the dataset (Contact-Lenses. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different classification
algorithms under lazy section.
3. In which we selected K-Nearest Neighbour (IBK) algorithm & click on start option with ―use
training set‖ test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.

By observing all these algorithms (ID3, K-NN, J48 & Naïve Bayes) results, we will conclude
that

Hence,
ID3 Algorithm‟saccuracy & performance is best.
J48 Algorithm‟s accuracy & performance is poor.
48

4. Demonstrate performing clustering on data sets

A. Load each dataset into Weka and run simple k-means clustering algorithm with
different values of k (number of desired clusters). Study the clusters formed. Observe the
sum of squared errors and centroids, and derive insights.

Procedure:
1. Load the dataset (Iris.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see different clustering
algorithms under lazy section.
3. In which we selected Simple K-Means algorithm & click on start option with ―use training
set‖ test option enabled.
4. Then we will get the sum of squared errors, centroids, No. of iterations & clustered instances
as represented below.

5. If we right click on simple k means, we will get more options in which ―Visualize cluster
assignments‖ should be selected for getting cluster visualization as shown below.
49

B. Explore other clustering techniques available in Weka.

Clustering:
50

Selecting a Clusterer:
By now you will be familiar with the process of selecting and configuring objects. Clicking on
the clustering scheme listed in the Clusterer box at the top of the window brings up a
GenericObjectEditor dialog with which to choose a new clustering scheme.
Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first
three options are the same as for classification: Use training set, supplied test set and Percentage
split, now the data is assigned to clusters instead of trying to predict a specific class. The fourth
mode, Classes to clusters evaluation, compares how well the chosen clusters match up with a
pre-assigned class in the data. The drop-down box below this option selects the class, just as in
the Classify panel. An additional option in the Cluster mode box, the Store clusters for
visualization tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so large that memory becomes a
problem it may be helpful to disable this option.
Ignoring Attributes:
Often, some attributes in the data should be ignored when clustering. The Ignore attributes button
brings up a small window that allows you to select which attributes are ignored. Clicking on an
attribute in the window highlight sit, holding down the SHIFT key selects a range of consecutive
attributes, and holding down CTRL toggles individual attributes on and off. To cancel the
selection, back out with the Cancel button. To activate it, click the Select button. The next time
clustering is invoked, the selected attributes are ignored.

There are 12 clustering algorithms available in weka tool. They are shown below.

Through visualize cluster assignments, we can clearly see the clusters in graphical visualization.
51

C. Explore visualization features of Weka to visualize the clusters. Derive interesting


insights and explain.
 If we right click on simple k means, we will get more options in which ―Visualize cluster
assignments ‖ should be selected for getting cluster visualization as shown below.
 In that cluster visualization we are having different features to explore by changing the
X-axis, Y-axis, Color, Jitter& Select instance (Rectangle, Polygon & Polyline) for getting
different sets of cluster outputs.

 As shown in above screenshot, all the dataset (Iris.arff) tuples are represented in X-axis &in
similar way it will represented for y-axis also. For each cluster, the color will be different. In
the above figure, there are two clusters which are represented in blue & red colors.
 In the select instance we can select different shapes for choosing clustered area as shownin
below screenshot, rectangle shape is selected.

 By this visualization feature we can observe different clustering outputs for an dataset by
changing those X-axis, Y-axis, Color & Jitter options.

You might also like