MCSL 223 (English)
MCSL 223 (English)
This lab course has been designed for providing you hands-on experience on
Computer Networks and Data Mining. You have studied the theoretical aspects
of Computer Networks and Data Mining in MCS-218 and MCS-221 courses
respectively.
The Course on Data Communication and Computer Network (MCS-218)
explained how a computer system/ device communicates with other computers
through wired or wireless channels. It also explained various network
topologies and protocols. This course provides you opportunities to implement
and test those topologies and protocols in a simulated environment. Here, you
will also be able to test various networking devices with different loads and
configurations helping you to understand the challenges in the actual operations
of these devices. Many open-source code are available online for network
simulation, which you can use for your reference. The network architecture and
topologies can be implemented and tested using Network Simulator (NS-3), an
open-source software platform.
Data mining is the process of analyzing a data set to find hidden patterns. Once
data is collected in the data warehouse, the data mining process begins and
involves everything from cleaning the data of incomplete records to creating
visualizations of findings. Data mining is usually associated with the analysis of
the large data sets present in the fields of big data, machine learning and artificial
intelligence. The process looks for patterns, anomalies and associations in the
data with the goal of extracting value. Waikato Environment for Knowledge
Analysis (WEKA) suite is used to demonstrate most of the examples. It is a
suite of machine learning tools written in Java. A collection of visualization
tools for predictive modelling helps you build your data models and test them,
observing the model performances graphically. In Data Mining Lab, enough
number of illustrative examples are given to facilitate you to work with the
session wise practical problems. Use trial/demo versions only for exploring
them.
A session wise list of practical problems is given at the end of each section.
Please go through the guidelines carefully for documenting them in the practical
records.
This course consists of 2 sections and is organized as follows:
Section-1 comprises of 10 sessions where in session wise practical problems
are given along with some examples. Attempt all the problems given in Section
1.8.
Section-2 comprises of 10 sessions where in session wise practical problems
are given along with some illustrative examples. Attempt all the problems given
in Section 2.11.
Happy programming!! We wish you an eventful and interesting journey to the
third semester computer lab.
3
4
SECTION 1 COMPUTER NETWORKS LAB
Structure
1.0
Introduction
1.1
Objectives
1.2 General Guidelines
1.3 Introduction to NS-3
1.4 Basic Features of NS-3
1.5 Installing NS-3
1.6 Examples
1.7 Terms and Concepts
1.8 List of Lab Assignments – Session wise
1.9
Web References
1.0 INTRODUCTION
This is the lab course, wherein you will have the hands-on experience. You
have studied the support course material (MCS-218 Data Communication
and Computer Networks). In this section, Computer Network Programming
using an Open-Source platform, Network Simpulator-3 (NS-3) is provided
illustratively. A list of programming problems is also provided at the end for
each session. You are also provided some example codes in the end for your
reference (Appendix). Please go through the general guidelines and the program
documentation guidelines carefully.
1.1 OBJECTIVES
After completing this lab course, you will be able to:
• Explore and understand the features available in the NS-3 software
• Create wired and wireless connection
• Test and verify various routing/ communication protocols
• Test and verify user defined routing protocols
• Perform analysis and evaluation of various available protocols
Prerequisite Package/version
C++ compiler clang++ or g++ (g++ version 8 or greater)
Python python3 version >=3.6
CMake cmake version >=3.10
Build system make, ninja, xcodebuild (XCode)
Git any recent version (to access ns-3 from GitLab.com)
tar any recent version (to unpack an ns-3 release)
bunzip2 any recent version (to uncompress an ns-3 release)
Following are the steps involved in installation of NS-3
1. Download and unzip tar file from official website.
2. Run the build.py script file as
./buid.py --enable-examples --enable-tests
3. Check for list of Modules built and Modules not built.
4. After successfully build operation, you should see different files and
directories like source directory src, execution script waf and one script
directory scratch. It should also contain the examples and tutorials files.
(A detailed description is provided on this link for installation of NS-3 on
Windows https://www.nsnam.org/wiki/Installation#Windows ).
At first you will see building of netanim animator build then pybindgen and
finally NS-3. Now, NS-3 is installed and ready to be used, you can run few
examples on the system and try to learn the working of the system. Before
running actual program, you should run test programs to verify the running
environment. To run the test programs, you can run ./test.py --no-build command
see if the systems pass all the tests. Once it passes all tests, you are ready to run
your own code.
1.6 EXAMPLES
Let us now try to run an example on the recently installed system. Create a
file with any name and save it with an extension as .cc or you can just go to
examples/tutorials directory and see first.cc example file.
To run the first example, copy the first.cc file to scratch directory and run the
following commands in the NS-3 terminal (cygwin) from the parent directory
./waf –run first
If nothing goes wrong, the program is compiled and run.
Argument Passing: NS-3 also supports the command line options and can pass
the arguments to the program using command line. The following example with
file second.cc program shows the argument value 3 given to nCsma variable.
8
./waf --run "second --nCsma=3" Computer
Networks Lab
Error Compilation: NS-3 by default treat the warnings as errors (enabling –
Werror option) which is a good sign for professional but a little more exercise
for beginners. You can disable this option by editing the waf-tools/cflags.py and
change the following line:
self.warnings_flags = [['-Wall'], ['-Werror'], ['-Wextra']]
to the following line
self.warnings_flags = [['-Wall'], ['-Wextra']]
Now, you need to run the following commands.
./waf configure
./waf build
Once you are aware about few of the basic commands, you can see the examples
given in the documentation of the NS-3. The link for the documentation is given
below. A sample program is provided at the end of this file for tracing packets.
https://www.nsnam.org/doxygen/tcp-bbr-example_8cc_source.html
9
Computer Networks https://www.nsnam.org/docs/release/3.35/tutorial/html/conceptual-overview.
and Data Mining Lab html#key-abstractions
12
Appendix Computer
Networks Lab
Sample programs available online for different applications using NS-3
CODE-1: Source- https://www.nsnam.org/doxygen/
traceroute-example_8cc_source.html
/* Mode:C++; c-file-style:"gnu"; indent-tabs-mode:nil;
*/
/*
* Copyright (c) 2019 Ritsumeikan University, Shiga,
Japan
*
* This program is free software; you can redistribute
it *and/or modify it under the terms of the GNU General
Public *License version 2 as published by the Free
Software *Foundation;
*
* This program is distributed in the hope that it will
be *useful, but WITHOUT ANY WARRANTY; without even
the implied *warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR *PURPOSE. See the GNU General Public
License for more details.
*
* You should have received a copy of the GNU General
Public *License along with this program; if not, write
to the Free *Software Foundation, Inc., 59 Temple
Place, Suite 330, *Boston, MA 02111-1307 USA
*
* Author: Alberto Gallegos Ramonet <ramonet@
fc.ritsumei.ac.jp>
*
*
* TraceRoute application example using AODV routing
protocol.
*
*
*/
#include "ns3/aodv-module.h"
#include "ns3/core-module.h"
13
Computer Networks #include "ns3/network-module.h"
and Data Mining Lab
#include "ns3/internet-module.h"
#include "ns3/mobility-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/wifi-module.h"
#include "ns3/v4traceroute-helper.h"
#include <iostream>
#include <cmath>
using namespace ns3;
class TracerouteExample
{
public:
TracerouteExample ();
bool Configure (int argc, char **argv);
void Run ();
void Report (std::ostream & os);
private:
// parameters
uint32_t size;
double step;
double totalTime;
bool pcap;
bool printRoutes;
NodeContainer nodes;
NetDeviceContainer devices;
Ipv4InterfaceContainer interfaces;
private:
void CreateNodes ();
void CreateDevices ();
void InstallInternetStack ();
void InstallApplications ();
14 };
int main (int argc, char **argv) Computer
Networks Lab
{
TracerouteExample test;
if (!test.Configure (argc, argv))
{
NS_FATAL_ERROR ("Configuration failed. Aborted.");
}
test.Run ();
test.Report (std::cout);
return 0;
}
//--------------------------------------------------
---------------------------
TracerouteExample::TracerouteExample ()
: size (10),
step (50),
totalTime (100),
pcap (false),
printRoutes (false)
{
}
bool
TracerouteExample::Configure (int argc, char **argv)
{
// Enable AODV logs by default. Comment this if too
noisy
// LogComponentEnable("AodvRoutingProtocol", LOG_
LEVEL_ALL);
SeedManager::SetSeed (12345);
CommandLine cmd (__FILE__);
cmd.AddValue ("pcap", "Write PCAP traces.", pcap);
cmd.AddValue ("printRoutes", "Print routing table
dumps.", printRoutes);
15
Computer Networks cmd.AddValue ("size", "Number of nodes.", size);
and Data Mining Lab
cmd.AddValue ("time", "Simulation time, s.",
totalTime);
cmd.AddValue ("step", "Grid step, m", step);
cmd.Parse (argc, argv);
return true;
}
void
TracerouteExample::Run ()
{
CreateNodes ();
CreateDevices ();
InstallInternetStack ();
InstallApplications ();
std::cout << "Starting simulation for " << totalTime
<< " s ...\n";
Simulator::Stop (Seconds (totalTime));
Simulator::Run ();
Simulator::Destroy ();
}
void
TracerouteExample::Report (std::ostream &)
{
}
void
TracerouteExample::CreateNodes ()
{
std::cout << "Creating " << (unsigned)size << " nodes
" << step << " m apart.\n";
nodes.Create (size);
// Name nodes
for (uint32_t i = 0; i < size; ++i)
16 {
std::ostringstream os; Computer
Networks Lab
os << "node-" << i;
Names::Add (os.str (), nodes.Get (i));
}
// Create static grid
MobilityHelper mobility;
m o b i l i t y . S e t P o s i t i o n A l l o c a t o r
("ns3::GridPositionAllocator",
"MinX", DoubleValue (0.0),
"MinY", DoubleValue (0.0),
"DeltaX", DoubleValue (step),
"DeltaY", DoubleValue (0),
"GridWidth", UintegerValue (size),
"LayoutType", StringValue ("RowFirst"));
mobility.SetMobilityModel ("ns3::ConstantPositionMob
ilityModel");
mobility.Install (nodes);
}
void
TracerouteExample::CreateDevices ()
{
WifiMacHelper wifiMac;
wifiMac.SetType ("ns3::AdhocWifiMac");
YansWifiPhyHelper wifiPhy;
YansWifiChannelHelper wifiChannel =
YansWifiChannelHelper::Default ();
wifiPhy.SetChannel (wifiChannel.Create ());
WifiHelper wifi;
wifi.SetRemoteStationManager ("ns3::ConstantRateWifiM
anager", "DataMode", StringValue ("OfdmRate6Mbps"),
"RtsCtsThreshold", UintegerValue (0));
devices = wifi.Install (wifiPhy, wifiMac, nodes);
if (pcap)
17
Computer Networks {
and Data Mining Lab
wifiPhy.EnablePcapAll (std::string ("aodv"));
}
}
void
TracerouteExample::InstallInternetStack ()
{
AodvHelper aodv;
// you can configure AODV attributes here using aodv.
Set(name, value)
InternetStackHelper stack;
stack.SetRoutingHelper (aodv); // has effect on the
next Install ()
stack.Install (nodes);
Ipv4AddressHelper address;
address.SetBase ("10.0.0.0", "255.0.0.0");
interfaces = address.Assign (devices);
if (printRoutes)
{
Ptr<OutputStreamWrapper> routingStream =
Create<OutputStreamWrapper> ("aodv.routes",
std::ios::out);
aodv.PrintRoutingTableAllAt (Seconds (8),
routingStream);
}
}
void
TracerouteExample::InstallApplications ()
{
V4TraceRouteHelper traceroute (Ipv4Address
("10.0.0.10")); //size - 1
traceroute.SetAttribute ("Verbose", BooleanValue
(true));
ApplicationContainer p = traceroute.Install (nodes.
18 Get (0));
// Used when we wish to dump the traceroute results Computer
into a file Networks Lab
//Ptr<OutputStreamWrapper> printstrm =
Create<OutputStreamWrapper> ("mytrace",
std::ios::out);
//traceroute.PrintTraceRouteAt(nodes.
Get(0),printstrm);
p.Start (Seconds (0));
p.Stop (Seconds (totalTime) - Seconds (0.001));
}
CODE-2: Source- https://www.nsnam.org/doxygen/dhcp-
example_8cc_source.html
/* -*- Mode:C++; c-file-style:"gnu"; indent-tabs-
mode:nil; -*- */
/*
* Copyright (c) 2011 UPB
* Copyright (c) 2017 NITK Surathkal
*
* This program is free software; you can redistribute
it and/or modify
* it under the terms of the GNU General Public License
version 2 as
* published by the Free Software Foundation;
*
* This program is distributed in the hope that it will
be useful,
* but WITHOUT ANY WARRANTY; without even the implied
warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General
Public License
* along with this program; if not, write to the Free
Software
19
Computer Networks * Foundation, Inc., 59 Temple Place, Suite 330, Boston,
and Data Mining Lab MA 02111-1307 USA
*
* Author: Radu Lupu <[email protected]>
* Ankit Deepak <[email protected]>
* Deepti Rajagopal <[email protected]>
*
*/
/*
*
*/
#include "ns3/core-module.h"
#include "ns3/network-module.h"
#include "ns3/internet-apps-module.h"
#include "ns3/csma-module.h"
#include "ns3/internet-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/applications-module.h"
using namespace ns3;
NS_LOG_COMPONENT_DEFINE ("DhcpExample");
int
20
main (int argc, char *argv[]) Computer
Networks Lab
{
CommandLine cmd (__FILE__);
bool verbose = false;
bool tracing = false;
cmd.AddValue ("verbose", "turn on the logs", verbose);
cmd.AddValue ("tracing", "turn on the tracing",
tracing);
cmd.Parse (argc, argv);
// GlobalValue::Bind ("ChecksumEnabled", BooleanValue
(true));
if (verbose)
{
LogComponentEnable ("DhcpServer", LOG_LEVEL_ALL);
LogComponentEnable ("DhcpClient", LOG_LEVEL_ALL);
LogComponentEnable ("UdpEchoServerApplication", LOG_
LEVEL_INFO);
LogComponentEnable ("UdpEchoClientApplication", LOG_
LEVEL_INFO);
}
Time stopTime = Seconds (20);
NS_LOG_INFO ("Create nodes.");
NodeContainer nodes;
NodeContainer router;
nodes.Create (3);
router.Create (2);
NodeContainer net (nodes, router);
NS_LOG_INFO ("Create channels.");
CsmaHelper csma;
csma.SetChannelAttribute ("DataRate", StringValue
("5Mbps"));
csma.SetChannelAttribute ("Delay", StringValue
("2ms"));
csma.SetDeviceAttribute ("Mtu", UintegerValue (1500));
21
Computer Networks NetDeviceContainer devNet = csma.Install (net);
and Data Mining Lab
NodeContainer p2pNodes;
p2pNodes.Add (net.Get (4));
p2pNodes.Create (1);
PointToPointHelper pointToPoint;
pointToPoint.SetDeviceAttribute ("DataRate",
StringValue ("5Mbps"));
pointToPoint.SetChannelAttribute ("Delay", StringValue
("2ms"));
NetDeviceContainer p2pDevices;
p2pDevices = pointToPoint.Install (p2pNodes);
InternetStackHelper tcpip;
tcpip.Install (nodes);
tcpip.Install (router);
tcpip.Install (p2pNodes.Get (1));
Ipv4AddressHelper address;
address.SetBase ("172.30.1.0", "255.255.255.0");
Ipv4InterfaceContainer p2pInterfaces;
p2pInterfaces = address.Assign (p2pDevices);
// manually add a routing entry because we don't want
to add a dynamic routing
Ipv4StaticRoutingHelper ipv4RoutingHelper;
Ptr<Ipv4> ipv4Ptr = p2pNodes.Get (1)->GetObject<Ipv4>
();
Ptr<Ipv4StaticRouting> staticRoutingA =
ipv4RoutingHelper.GetStaticRouting (ipv4Ptr);
staticRoutingA->AddNetworkRouteTo (Ipv4Address
("172.30.0.0"), Ipv4Mask ("/24"),
Ipv4Address ("172.30.1.1"), 1);
NS_LOG_INFO ("Setup the IP addresses and create DHCP
applications.");
DhcpHelper dhcpHelper;
// The router must have a fixed IP.
Ipv4InterfaceContainer fixedNodes = dhcpHelper.
22 InstallFixedAddress (devNet.Get (4), Ipv4Address
("172.30.0.17"), Ipv4Mask ("/24")); Computer
Networks Lab
// Not really necessary, IP forwarding is enabled by
default in IPv4.
fixedNodes.Get (0).first->SetAttribute ("IpForward",
BooleanValue (true));
// DHCP server
ApplicationContainer dhcpServerApp = dhcpHelper.
InstallDhcpServer (devNet.Get (3), Ipv4Address
("172.30.0.12"),
Ipv4Address ("172.30.0.0"), Ipv4Mask ("/24"),
Ipv4Address ("172.30.0.10"), Ipv4Address
("172.30.0.15"),
Ipv4Address ("172.30.0.17"));
// This is just to show how it can be done.
DynamicCast<DhcpServer> (dhcpServerApp.Get
(0))->AddStaticDhcpEntry (devNet.Get (2)->GetAddress
(), Ipv4Address ("172.30.0.14"));
dhcpServerApp.Start (Seconds (0.0));
dhcpServerApp.Stop (stopTime);
// DHCP clients
NetDeviceContainer dhcpClientNetDevs;
dhcpClientNetDevs.Add (devNet.Get (0));
dhcpClientNetDevs.Add (devNet.Get (1));
dhcpClientNetDevs.Add (devNet.Get (2));
ApplicationContainer dhcpClients = dhcpHelper.
InstallDhcpClient (dhcpClientNetDevs);
dhcpClients.Start (Seconds (1.0));
dhcpClients.Stop (stopTime);
UdpEchoServerHelper echoServer (9);
ApplicationContainer serverApps = echoServer.Install
(p2pNodes.Get (1));
serverApps.Start (Seconds (0.0));
serverApps.Stop (stopTime);
UdpEchoClientHelper echoClient (p2pInterfaces.
GetAddress (1), 9);
23
Computer Networks echoClient.SetAttribute ("MaxPackets", UintegerValue
and Data Mining Lab (100));
echoClient.SetAttribute ("Interval", TimeValue
(Seconds (1.0)));
echoClient.SetAttribute ("PacketSize", UintegerValue
(1024));
ApplicationContainer clientApps = echoClient.Install
(nodes.Get (1));
clientApps.Start (Seconds (10.0));
clientApps.Stop (stopTime);
Simulator::Stop (stopTime + Seconds (10.0));
if (tracing)
{
csma.EnablePcapAll ("dhcp-csma");
pointToPoint.EnablePcapAll ("dhcp-p2p");
}
NS_LOG_INFO ("Run Simulation.");
Simulator::Run ();
Simulator::Destroy ();
NS_LOG_INFO ("Done.");
}
CODE-3: Source- https://www.nsnam.org/doxygen/csma-
bridge_8cc_source.html
/* -*- Mode:C++; c-file-style:"gnu"; indent-tabs-
mode:nil; -*- */
/*
* This program is free software; you can redistribute
it *and/or modify it under the terms of the GNU General
Public *License version 2 as published by the Free
Software *Foundation;
*
* This program is distributed in the hope that it will
be *useful, but WITHOUT ANY WARRANTY; without even
the implied *warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR *PURPOSE. See the GNU General Public
License for more details.
24
* Computer
Networks Lab
* You should have received a copy of the GNU General
Public *License along with this program; if not, write
to the Free *Software Foundation, Inc., 59 Temple
Place, Suite 330, *Boston, MA 02111-1307 USA
*/
// Network topology
//
// n0 n1
// | |
// ----------
// | Switch |
// ----------
// | |
// n2 n3
//
// - CBR/UDP flows from n0 to n1 and from n3 to n0
// - DropTail queues
// - Tracing of queues and packet receptions to file
"csma-bridge.tr"
#include <iostream>
#include <fstream>
#include "ns3/core-module.h"
#include "ns3/network-module.h"
#include "ns3/applications-module.h"
#include "ns3/bridge-module.h"
#include "ns3/csma-module.h"
#include "ns3/internet-module.h"
using namespace ns3;
NS_LOG_COMPONENT_DEFINE ("CsmaBridgeExample");
int
main (int argc, char *argv[])
{
25
Computer Networks //
and Data Mining Lab
// Users may find it convenient to turn on explicit
debugging
// for selected modules; the below lines suggest how
to do this
//
#if 0
LogComponentEnable ("CsmaBridgeExample", LOG_LEVEL_
INFO);
#endif
//
// Allow the user to override any of the defaults and
the above Bind() at
// run-time, via command-line arguments
//
CommandLine cmd (__FILE__);
cmd.Parse (argc, argv);
//
// Explicitly create the nodes required by the topology
(shown above).
//
NS_LOG_INFO ("Create nodes.");
NodeContainer terminals;
terminals.Create (4);
NodeContainer csmaSwitch;
csmaSwitch.Create (1);
NS_LOG_INFO ("Build Topology");
CsmaHelper csma;
csma.SetChannelAttribute ("DataRate", DataRateValue
(5000000));
csma.SetChannelAttribute ("Delay", TimeValue
(MilliSeconds (2)));
// Create the csma links, from each terminal to the
switch
26 NetDeviceContainer terminalDevices;
NetDeviceContainer switchDevices; Computer
Networks Lab
for (int i = 0; i < 4; i++)
{
NetDeviceContainer link = csma.Install (NodeContainer
(terminals.Get (i), csmaSwitch));
terminalDevices.Add (link.Get (0));
switchDevices.Add (link.Get (1));
}
// Create the bridge netdevice, which will do the
packet switching
Ptr<Node> switchNode = csmaSwitch.Get (0);
BridgeHelper bridge;
bridge.Install (switchNode, switchDevices);
// Add internet stack to the terminals
InternetStackHelper internet;
internet.Install (terminals);
// We've got the "hardware" in place. Now we need to
add IP addresses.
//
NS_LOG_INFO ("Assign IP Addresses.");
Ipv4AddressHelper ipv4;
ipv4.SetBase ("10.1.1.0", "255.255.255.0");
ipv4.Assign (terminalDevices);
//
// Create an OnOff application to send UDP datagrams
from node zero to node 1.
//
NS_LOG_INFO ("Create Applications.");
uint16_t port = 9; // Discard port (RFC 863)
OnOffHelper onoff ("ns3::UdpSocketFactory",
Address (InetSocketAddress (Ipv4Address ("10.1.1.2"),
port)));
onoff.SetConstantRate (DataRate ("500kb/s"));
27
Computer Networks ApplicationContainer app = onoff.Install (terminals.
and Data Mining Lab Get (0));
// Start the application
app.Start (Seconds (1.0));
app.Stop (Seconds (10.0));
// Create an optional packet sink to receive these
packets
PacketSinkHelper sink ("ns3::UdpSocketFactory",
Address (InetSocketAddress (Ipv4Address::GetAny (),
port)));
app = sink.Install (terminals.Get (1));
app.Start (Seconds (0.0));
//
// Create a similar flow from n3 to n0, starting at
time 1.1 seconds
//
onoff.SetAttribute ("Remote",
AddressValue (InetSocketAddress (Ipv4Address
("10.1.1.1"), port)));
app = onoff.Install (terminals.Get (3));
app.Start (Seconds (1.1));
app.Stop (Seconds (10.0));
app = sink.Install (terminals.Get (0));
app.Start (Seconds (0.0));
NS_LOG_INFO ("Configure Tracing.");
//
// Configure tracing of all enqueue, dequeue, and
NetDevice receive events.
// Trace output will be sent to the file "csma-bridge.
tr"
//
AsciiTraceHelper ascii;
csma.EnableAsciiAll (ascii.CreateFileStream ("csma-
bridge.tr"));
28 //
// Also configure some tcpdump traces; each interface Computer
will be traced. Networks Lab
29
Computer Networks
and Data Mining Lab
SECTION 2 DATA MINING LAB
Structure
2.0 Introduction
2.1 Objectives
2.2 Introduction to WEKA
2.3 Latest Version and Downloads
2.4 Data Sets
2.5 Installation of WEKA
2.6 Features of WEKA Explorer
2.7 Data Preprocessing
2.8 Association Rule Mining
2.9 Classification
2.10 Clustering
2.11 Practical Sessions
2.12 Summary
2.13 Further Readings
2.14 Website References
2.15 Online Lab Resources
2.0 INTRODUCTION
This is the lab course, wherein you will get the hands on experience. You
have studied the course material (MCS-221 Data Warehousing and Data
Mining). Along with the examples discussed in this section, separately a list of
lab-sessions to be performed sessionwise, is given towards the end at
Section 2.11. Please go through the general guidelines and the program
documentation guidelines carefully.
2.1 OBJECTIVES
After going through this practical course, you will be able to:
• Understand how to handle data mining tasks using a data mining toolkit
(such as open source WEKA)
• Understand the various kinds of algorithms available in WEKA
• Understand the data sets and data preprocessing.
• Demonstrate the classification, clustering and etc. in large data sets.
• Demonstrate the working of algorithms for data mining tasks such
30 association rule mining, classification, clustering and regression.
• Exercise the data mining techniques with varied input values for different Data
parameters. Mining Lab
2.4 DATASETS
Below are some sample WEKA data sets available in .arff format :
• airline.arff
• breast-cancer.arff
32 • contact-lens.arff
• cpu.arff Data
Mining Lab
• cpu.with-vendor.arff
• credit-g.arff
• diabetes.arff
• glass.arff
• hypothyroid.arff
• ionospehre.arff
• iris.2D.arff
• iris.arff
• labor.arff
• ReutersCorn-train.arff
• ReutersCorn-test.arff
• ReutersGrain-train.arff
• ReutersGrain-test.arff
• segment-challenge.arff
• segment-test.arff
• soybean.arff
• supermarket.arff
• unbalanced.arff
• vote.arff
• weather.numeric.arff
• weather.nominal.arff
Once you download and configure WEKA on your system, the WEKA datasets
can be explored from the “C:\Program Files\Weka-3-8\data” link. The
datasets are in . arff format.
Miscellaneous Collections of Datasets can be Downloaded
• A jarfile containing 37 classification problems originally obtained from the
UCI repository of machine learning datasets are available at http://archive.
ics.uci.edu/ml/index.php .
• A jarfile containing 37 regression problems obtained from various sources
(datasets-numeric.jar available at https://sourceforge.net/projects/weka/
files/datasets/datasets-numeric/datasets-numeric.jar/download?use_
mirror=netix .
• A jarfile containing 6 agricultural datasets obtained from agricultural
researchers in New Zealand (agridatasets.jar, 31,200 Bytes). 33
Computer Networks • A jarfile containing 30 regression datasets collected by Professor Luis Torgo
and Data Mining Lab (regression-datasets.jar, 10,090,266 Bytes).
• A gzip'ed tar containing UCI ML at http://archive.ics.uci.edu/ml/index.php
and UCI KDD datasets at https://kdd.ics.uci.edu/ .
• A gzip'ed tar containing StatLib datasets at https://sourceforge.net/projects/
weka/files/datasets/UCI%20and%20StatLib/statlib-20050214.tar.gz/
download?use_mirror=kumisystems
• A gzip'ed tar containing ordinal, real-world datasets donated by Professor
Arie Ben David at https://sourceforge.net/projects/weka/files/datasets/
regression-datasets/datasets-arie_ben_david.tar.gz/download?use_
mirror=pilotfiber
• A zip file containing 19 multi-class (1-of-n) text datasets donated
by Dr George Forman available at https://sourceforge.net/projects/
weka/files/datasets/text-datasets/19MclassTextWc.zip/download?use_
mirror=netactuate&download= (19MclassTextWc.zip, 14,084,828 Bytes).
• A bzip'ed tar file containing the Reuters21578 dataset split into separate files
according to the ModApte split reuters21578-ModApte.tar.bz2, 81,745,032
Bytes.
• A zip file containing 41 drug design datasets formed using the Adriana.Code
software donated by Dr Mehmet Fatih Amasyali at https://sourceforge.net/
projects/weka/files/datasets/Drug%20design%20datasets/Drug-datasets.
zip/download?use_mirror=netix&use_mirror=internode .
• A zip file containing 80 artificial datasets generated from the Friedman
function donated by Dr. M. Fatih Amasyali (Yildiz Technical Unversity)
(Friedman-datasets.zip, 5,802,204 Bytes)
• A zip file containing a new, image-based version of the classic iris data,
with 50 images for each of the three species of iris. The images have size
600x600. Please see the ARFF file for further information (iris_reloaded.
zip, 92,267,000 Bytes). After expanding into a directory using your jar
utility (or an archive program that handles tar-archives/zip files in case of
the gzip'ed tars/zip files), these datasets may be used with WEKA.
34
4. According to your requirements, select the components to be installed. Full Data
component installation is recommended. Click on Next. Mining Lab
2.6.1 Dataset
A dataset is made of items. It represents an object. For example: in the
marketing database, it will represent customers and products. The datasets are
described by attributes. The dataset contains data tuples in a database. A dataset
has attributes that can be nominal, numeric, or string. In WEKA, the dataset
is represented by weka.core.Instances class. Representation of dataset with 5
examples:
@data
sunny, FALSE,85,85,no
sunny, TRUE,80,90,no
overcast, FALSE,83,86,yes
rainy, FALSE,70,96,yes
rainy, FALSE,68,80,yes
Attribute and its Types
An attribute is a data field representing the characteristic of a data object. For
example, in a customer database, the attributes will be customer_id, customer_
email, customer_address, etc. Attributes have different types:
(i) Nominal Attributes: Attribute which relates to a name and has predefined
values such as color, weather. These attributes are called categorical attributes.
These attributes do not have any order and their values are also called
enumerations.
@attribute outlook {sunny, overcast, rainy}: declaration of the nominal
attribute.
(ii) Binary Attributes: These attributes represent only values 0 and 1. These
are the type of nominal attributes with only 2 categories. These attributes are
also called Boolean.
(iii) Ordinal Attributes: The attributes which preserve some order or ranking
amongst them are ordinal attributes. Successive values cannot be predicted but
only order is maintained. Example: size, grade, etc.
(iv) Numeric Attributes: Attributes representing measurable quantities are
numeric attributes. These are represented by real numbers or integers. Example:
temperature, humidity.
@attribute humidity real: declaration of a numeric attribute
36
(v) String Attributes: These attributes represent a list of characters represented Data
in double-quotes. Mining Lab
2.6.6 Clustering
WEKA uses the Cluster tab to predict the similarities in the dataset. Based on
clustering, the user can find out the attributes useful for analysis and ignore
other attributes. The available algorithms for clustering in WEKA are k-means,
EM, Cobweb, X-means, and FarhtestFirst.
2.6.7 Association
The only algorithm available in WEKA for finding out association rules is
Apriori.
2.6.9 Visualization
WEKA supports the 2D representation of data, 3D visualizations with rotation,
and 1D representation of single attribute. It has the “Jitter” option for nominal
attributes and “hidden” data points.
39
Computer Networks
and Data Mining Lab
Synopsis
An instance filter that discreteness a range of numeric attributes in the dataset
into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the
default).
Options
attributeIndices -- Specify range of attributes to act on. This is a comma
separated list of attribute indices, with "first" and "last" valid values. Specify an
inclusive range with "-". Example: "first-3,5,6-10,last".
invertSelection -- Set attribute selection mode. If false, only selected (numeric)
attributes in the range will be discretized; if true, only non-selected attributes
will be discretized.
makeBinary -- Make resulting attributes binary.
useBetterEncoding -- Uses a more efficient split point encoding.
useKononenko -- Use Kononenko's MDL criterion. If set to false uses the
Fayyad & Irani criterion.
Select the outlook attribute based on class temperature to visualize below :
40
Select the temperature attribute based on class temperature to visualize below: Data
Mining Lab
41
Computer Networks Select the play attribute based on class temperature to visualize below:
and Data Mining Lab
42
Data
Mining Lab
Synopsis
Produces a random subsample of a dataset using either sampling with replacement
or without replacement. The original dataset must fit entirely in memory. The
number of instances in the generated dataset may be specified. The dataset must
have a nominal class attribute. If not, use the unsupervised version. The filter
can be made to maintain the class distribution in the subsample, or to bias the
class distribution toward a uniform distribution. When used in batch mode (i.e.
in the Filtered Classifier), subsequent batches are NOT resampled.
Options
randomSeed -- Sets the random number seed for subsampling.
biasToUniformClass -- Whether to use bias towards a uniform class. A value of
0 leaves the class distribution as-is, a value of 1 ensures the class distribution is
uniform in the output data.
debug -- If set to true, filter may output additional info to the console.
noReplacement -- Disables the replacement of instances.
doNotCheckCapabilities -- If set, filters capabilities are not checked before filter
is built (Use with caution to reduce runtime).
sampleSizePercent -- The subsample size as a percentage of the original set.
invertSelection -- Inverts the selection (only if instances are drawn WITHOUT
replacement).
43
Computer Networks Select the outlook attribute based on class outlook (Nom) to visualize below:
and Data Mining Lab
Select the humidity attribute based on class outlook (Nom) to visualize below:
44
Select the windy attribute based on class outlook (Nom) to visualize below: Data
Mining Lab
Select the play attribute based on class outlook (Nom) to visualize below:
45
Computer Networks 2.7.3 Attribute Selection in WEKA
and Data Mining Lab
Attribute selection or variable subset selection, is the process of selecting a
subset of relevant features in this there are four algorithms are as follows:
• CFS Subset eval algorithm.
• Information gain algorithm.
• Correlation attribute evaluated.
• Gain ratio attribute evaluated.
Start→ weka→ select Explorer→ select dataset (Buys Computers)
46
It evaluates the worth of a subset of attributes by considering the individual Data
predictive ability of each feature along with the degree of redundancy between Mining Lab
them.
Subsets of features that are highly correlated with the class while having low
inter-correlation are preferred.
Options
numThreads: The number of threads to use, which should be >= size of thread
pool.
Debug: Output debugging info
MissingSeparate: Treat missing as a separate value. Otherwise, counts
for missing values are distributed across other values in proportion to their
frequency.
PoolSize: The size of the thread pool, for example, the number of cores in the
CPU.
DoNotCheckCapabilities: If set, evaluator capabilities are not checked before
evaluator is built (Use with caution to reduce runtime).
PreComputeCorrelationMatrix: Recomputed the full correlation matrix at
the outset, rather than compute correlations lazily (as needed) during the search.
Use this in conjunction with parallel processing in order to speed up a backward
search.
LocallyPredictive: Identify locally predictive attributes. Iteratively adds
attributes with the highest correlation with the class as long as there is not
already an attribute in the subset that has a higher correlation with the attribute
in question weka.attributeSelection.CfsSubsetEval
47
Computer Networks Attribute selection CfsSubsetEval Output
and Data Mining Lab
=== Run information ===
Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search: weka.attributeSelection.BestFirst -D 1 -N 5
Relation: dessision.arff
Instances: 14
Attributes: 6
Rid
Age
Income
student
Credit-Rating
Class: Buys-computers
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 19
Merit of best subset found: 0.247
Attribute Subset Evaluator (supervised, Class (nominal):
6 Class: Buys-computers):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 2, 4 : 2
Age
student
48
2.7.3.2 Information Gain Algorithm Data
Mining Lab
weka.attributeSelection.information gain algorithm
49
Computer Networks Attribute selection output
and Data Mining Lab
=== Run information ===
Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: dessision.arff
Instances: 14
Attributes: 6
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 6 Class: Buys-computers):
Information Gain Ranking Filter
Ranked attributes:
0.2467 2 Age
0.1518 4 student
0.0481 5 Credit-Rating
0.0292 3 Income
0 1 Rid
Selected attributes: 2, 4, 5, 3, 1 : 5
2.7.3.3 Correlation Attribute Evaluated Algorithm
weka.attributeSelection.CorrelationAttributeEval
50
CorrelationAttributeEval Data
Mining Lab
Evaluates the worth of an attribute by measuring the correlation (Pearson's)
between it and the class.
Nominal attributes are considered on a value by value basis by treating each
value as an indicator. An overall correlation for a nominal attribute is arrived at
via a weighted average.
Options
OutputDetailedInfo Output per value correlation for nominal attributes
DoNotCheckCapabilities If set, evaluator capabilities are not checked before
evaluator is built (Use with caution to reduce runtime).
Evaluates the worth of an attribute by measuring the gain ratio with respect to
the class.
GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute).
Options
MissingMerge: Distribute counts for missing values. Counts are distributed
across other values in proportion to their frequency. Otherwise, missing is
treated as a separate value.
DoNotCheckCapabilities : If set, evaluator capabilities are not checked before
evaluator is built (Use with caution to reduce runtime).
Attribute selection gain ratio attribute algorithm output
52
=== Run information === Data
Mining Lab
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: dessision.arff
Instances: 14
Attributes: 6
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ==
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 6 Class: Buys-computers):
Gain Ratio feature evaluator
Ranked attributes:
0.1564 2 Age
0.1518 4student
0.0488 5 Credit-Rating
0.0188 3 Income
0 1 Rid
Selected attributes: 2, 4, 5, 3, 1 : 5
Pseudo code:
car: If enabled class association rules are mined instead of (general) association
rules.
classIndex: Index of the class attribute. If set to -1, the last attribute is taken as
class attribute.
delta: Iteratively decrease support by this factor. Reduces support until min
support is reached or required number of rules has been generated.
55
Computer Networks LowerBoundMinSupport: Lower bound for minimum support.
and Data Mining Lab
MetricType: Set the type of metric by which to rank rules. Confidence is the
proportion of the examples covered by the premise that are also covered by
the consequence (Class association rules can only be mined using confidence).
Lift is confidence divided by the proportion of all examples that are covered by
the consequence. This is a measure of the importance of the association that
is independent of support. Leverage is the proportion of additional examples
covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of
examples that this represents is presented in brackets following the leverage.
Conviction is another measure of departure from independence.
MinMetric: Minimum metric score. Consider only rules with scores higher
than this value.
NumRules: Number of rules to find.
OutputItemSets: If enabled the item sets are output as well.
RemoveAllMissingCols: Remove columns with all missing values.
SignificanceLevel: Significance level. Significance test (confidence metric
only).
UpperBoundMinSupport: Upper bound for minimum support. Start iteratively
decreasing minimum support from this value.
Verbose : If enabled the algorithm will be run in verbose mode.
How to open the Apriori Algorithm in WEKA
Start→weka→select the explorer→select the open file(weather Nominal)→select
the Association →chooses the algorithm(Apriori)→click on start.
(a) Using 10 numrules by using confidence
56
=== Run information === Data
Mining Lab
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 47
Size of set of large itemsetsL(3): 39
Size of set of large itemsetsL(4): 6
Best rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)
57
Computer Networks === Run information ===
and Data Mining Lab
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 47
Size of set of large itemsetsL(3): 39
Size of set of large itemsetsL(4): 6
Best rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)
58
(b) Using 10 numrules by using Lift Data
Mining Lab
60
Data
Mining Lab
62
=== Run information === Data
Mining Lab
Scheme: weka.associations.Apriori -N 10 -T 3 -C 1.1 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
=== Associator model (full training set) ===
Minimum support: 0.25 (3 instances)
Minimum metric <conviction>: 1.1
Number of cycles performed: 15
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 26
Size of set of large itemsetsL(3): 4
Best rules found:
1. temperature=cool 4 ==> humidity=normal 4 conf:(1) lift:(2) lev:(0.14) [2]
<conv:(2)>
2. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1) lift:(2.8) lev:(0.14)
[1] <conv:(1.93)>
3. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1) lift:(2) lev:(0.11)
[1] <conv:(1.5)>
4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1) lift:(2)
lev:(0.11) [1] <conv:(1.5)>
5. outlook=overcast 4 ==> play=yes 4 conf:(1) lift:(1.56) lev:(0.1) [1]
<conv:(1.43)>
6. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) lift:(1.56)
lev:(0.1) [1] <conv:(1.43)>
7. play=no 5 ==> outlook=sunny humidity=high 3 conf:(0.6) lift:(2.8) lev:(0.14)
[1] <conv:(1.31)>
8. humidity=high play=no 4 ==> outlook=sunny 3 conf:(0.75) lift:(2.1)
lev:(0.11) [1] <conv:(1.29)>
9. outlook= rainy play= yes 3 = => windy=FALSE 3 conf:(1) lift:(1.75) lev:(0.09)
[1] <conv:(1.29)>
10. humidity=normal 7 ==> play=yes 6 conf:(0.86) lift:(1.33) lev:(0.11) [1]
<conv:(1.25)>
63
Computer Networks
and Data Mining Lab
2.9 CLASSIFICATION
The concept of classification is basically distributing data among the various
classes defined on a data set. Classification algorithms learn this form of
distribution from a given set of training and then try to classify it correctly
when it comes to test data for which the class is not specified. The values that
specify these classes on the dataset are given a label name and are used to
determine the class of data to be given during the test.
64
The algorithm learns a coefficient for each input value, which are linearly Data
combined into a regression function and transformed using a logistic (s-shaped) Mining Lab
function. Logistic regression is a fast and simple technique, but can be very
effective on some problems.
The logistic regression only supports binary classification problems, although the
WEKA implementation has been adapted to support multi-class classification
problems.
Choose the logistic regression algorithm:
1. Click the “Choose” button and select “Logistic” under the “functions”
group.
2. Click on the name of the algorithm to review the algorithm configuration.
The algorithm can run for a fixed number of iterations (maxIts), but by default will
run until it is estimated that the algorithm has converged. The implementation
uses a ridge estimator which is a type of regularization. This method seeks
to simplify the model during training by minimizing the coefficients learned
by the model. The ridge parameter defines how much pressure to put on the
algorithm to reduce the size of the coefficients. Setting this to 0 will turn off this
regularization.
1. Click “OK” to close the algorithm configuration.
2. Click the “Start” button to run the algorithm on the Ionosphere dataset.
65
Computer Networks You can see that with the default configuration that logistic regression achieves
and Data Mining Lab an accuracy of 88%.
66
Data
Mining Lab
The depth of the tree is defined automatically, but a depth can be specified in
the maxDepth attribute.
68
You can also choose to turn of pruning by setting the noPruning parameter to Data
True, although this may result in worse performance. The minNum parameter Mining Lab
defines the minimum number of instances supported by the tree in a leaf node
when constructing the tree from the training data.
1. Click “OK” to close the algorithm configuration.
2. Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that the decision tree algorithm
achieves an accuracy of 89%.
Another more advanced decision tree algorithm that you can use is the C4.5
algorithm, called J48 in WEKA.You can review a visualization of a decision
tree prepared on the entire training data set by right clicking on the “Result list”
and clicking “Visualize Tree”.
WEKA Configuration for the Search Algorithm in the k-Nearest Neighbors Algorithm
2.10 CLUSTERING
K-Means Algorithm Using WEKA Explorer
Let us see how to implement the K-means algorithm for clustering using WEKA
Explorer.
Cluster Analysis
Clustering Algorithms are unsupervised learning algorithms used to create
groups of data with similar characteristics. It aggregates objects with similarities
into groups and subgroups thus leading to the partitioning of datasets. Cluster
analysis is the process of portioning of datasets into subsets. 73
Computer Networks These subsets are called clusters and the set of clusters is called clustering.
and Data Mining Lab Cluster Analysis is used in many applications such as image recognition,
pattern recognition, web search, and security, in business intelligence such as
the grouping of customers with similar likings.
K-Means Clustering
K means clustering is the simplest clustering algorithm. In the K-Clustering
algorithm, the dataset is partitioned into K-clusters. An objective function is
used to find the quality of partitions so that similar objects are in one cluster and
dissimilar objects in other groups.
In this method, the centroid of a cluster is found to represent a cluster. The
centroid is taken as the center of the cluster which is calculated as the mean
value of points within the cluster. Now the quality of clustering is found by
measuring the Euclidean distance between the point and center. This distance
should be maximum.
How Does K-Mean Clustering Algorithm Work?
Step #1: Choose a value of K where K is the number of clusters.
Step #2: Iterate each point and assign the cluster which is having the nearest
center to it. When each element is iterated then compute the centroid of all the
clusters.
Step #3: Iterate every element from the dataset and calculate the Euclidean
distance between the point and the centroid of every cluster. If any point is
present in the cluster which is not nearest to it then reassign that point to the
nearest cluster and after performing this to all the points in the dataset, again
calculate the centroid of each cluster.
Step #4: Perform Step#3 until there is no new assignment that took place
between the two consecutive iterations.
K-means Clustering Implementation Using WEKA
The steps for implementation using WEKA are as follows:
1) Open WEKA Explorer and click on Open File in the Preprocess tab. Choose
dataset “vote.arff”.
74
2) Go to the “Cluster” tab and click on the “Choose” button. Select the Data
clustering method as “SimpleKMeans”. Mining Lab
4) Click on Start in the left panel. The algorithm display results on the white
screen. Let us analyze the run information:
• Scheme, Relation, Instances, and Attributes describe the property of the
dataset and the clustering method used. In this case, vote.arff dataset
has 435 instances and 13 attributes.
75
Computer Networks • With the Kmeans cluster, the number of iterations is 5.
and Data Mining Lab
• The sum of the squared error is 1098.0. This error will reduce with an
increase in the number of clusters.
• The 5 final clusters with centroids are represented in the form of a table.
In our case, Centroids of clusters are 168.0, 47.0, 37.0, 122.0.33.0 and
28.0.
• Clustered instances represent the number and percentage of total
instances falling in the cluster.
77
Computer Networks Output
and Data Mining Lab
K means clustering is a simple cluster analysis method. The number of clusters
can be set using the setting tab. The centroid of each cluster is calculated as
the mean of all points within the clusters. With the increase in the number of
clusters, the sum of square errors is reduced. The objects within the cluster
exhibit similar characteristics and properties. The clusters represent the class
labels.
2.12 SUMMARY
Data mining (also known as knowledge discovery from databases) is the
process of extraction of hidden, previously unknown and potentially useful
information from databases. The outcome of the extracted data can be
analyzed for the future planning and development perspectives.
82
Data mining steps in the knowledge discovery process are as follows: Data
Mining Lab
• Data Cleaning- The removal of noise and inconsistent data.
• Data Integration - The combination of multiple sources of data.
• Data Selection - The data relevant for analysis is retrieved from the
database.
• Data Transformation - The consolidation and transformation of data
into forms appropriate for mining.
• Data Mining - The use of intelligent methods to extract patterns from
data.
• Pattern Evaluation - Identification of patterns that are interesting.
• Knowledge Presentation - Visualization and knowledge representation
techniques are used to present the extracted or mined knowledge to
the end user
In this Data Mining lab course you were given exposure to work with WEKA.
WEKA is a collection of machine learning algorithms for data mining tasks.
It contains tools for data preparation, classification, regression, clustering,
association rules mining, and visualization. Found only on the islands of
New Zealand, the WEKA. WEKA is open source software issued under the
GNU General Public License. The video links for the courses are available at
Online Lab Resources.
83
Computer Networks 9. Mei Yu Yuan (2016) Data Mining and Machine Learning: WEKA
and Data Mining Lab Technology and Practice, Tsinghua University Press (in Chinese).
10. Jürgen Cleve, Uwe Lämmel (2016) Data Mining, De Gruyter (in German).
11. Eric Rochester (2015) Clojure Data Analysis Cookbook - Second Edition,
Packt Publishing.
12. Boštjan Kaluža (2013) Instant Weka How-to, Packt Publishing.
13. Hongbo Du (2010) Data Mining Techniques and Applications, Cengage
Learning.
84