0% found this document useful (0 votes)
21 views40 pages

DWBI Lab Manual 2023-24 Final

Uploaded by

nipaw83060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views40 pages

DWBI Lab Manual 2023-24 Final

Uploaded by

nipaw83060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Datawarehousing LAB

MANUAL
Semester: V
Batch: 2022-26

Prepared By: Dr. Ashish K Sharma/Prof. Umashankar Pandey

Verified By:
Mr. Jagannath Thawait

SCHOOL OF ENGINEERING
Department of Computer Science And Engineering
O P JINDAL UNIVERSITY
O P JINDAL KNOWLEDGE PARK, PUNJIPATRA,
RAIGARH-496109

1
COURSE OBJECTIVE:
The objective is to explain students the core information assurance (IA)
principles and introduces the key components of cyber security network
architecture, security tools and hardening techniques. This course combines the
discipline of technology, business, laws and organizational behaviour.

COURSE OUTCOMES:
1. Design a data mart or data warehouse for any organization Learn to perform
data mining tasks using a data mining toolkit
2. Extract knowledge using data mining .
3. Adapt to new data mining tools.
4. Explore recent trends in data mining such as web mining, spatial-
temporal mining

CO-PO & PSO Correlation

Course Name: Data Mining Lab


Program Outcomes PSOs
Course
1 2 3 4 5 1 2 3
Outcomes
CO1: 1 3
CO2: 2 1 1 2
CO3: 2 1 1 1
2
CO4: 1 2 1 1 1

Note: 1.: Low 2.: Moderate 3.: High


Prerequisites
1. The students should have the basic knowledge of python and Artificial
Intelligence.
2. The students should be familiar with the concepts of different types of
classifiers in Data mining.
3. The students must have a knowledge of Data visualization in Python.

2
List of Experiments:
1. Explore WEKA Data Mining/Machine Learning Toolkit.
2. Create an Employee Table with the help of Data Mining Tool WEKA.
3. Apply Pre-Processing techniques to the training data set of Weather
Table .
4. Finding Association Rules for Banking data using apriori algorithm.
5. Demonstration of classification rule process on dataset weather.arff using
j48 algorithm
6. Demonstration of Linear Regression Algorithm using weka
7. Write a procedure for Clustering Customer data using Simple KMeans
Algorithm.
8. Extract if-then rues from decision tree generated by classifier, Observe
the confusion matrix and derive Accuracy, F- measure, TPrate, FPrate ,
Precision and recall values. Apply cross-validation strategy with various
fold levels and compare the accuracy results.
9. Load each dataset into Weka and perform Naïve-bayes classification and
k-
Nearest Neighbor classification, Interpret the results obtained.
10. Explore visualization features of weka to visualize the clusters. Derive
interesting insights and explain.
11. Load each dataset into Weka and perform k-Nearest Neighbor
classification, Interpret the results obtained.

3
Experiment 1:

Aim of the Experiment:


Explore WEKA Data Mining/Machine Learning Toolkit.
Description:
Install Steps for WEKA a Data Mining Tool
1. Download the software as your requirements from the below given link.
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
2. The Java is mandatory for installation of WEKA so if you have already Java on
your machine then download only WEKA else download the software with JVM.
3. Then open the file location and double click on the file

4. Click next

4
5. Click I Agree.

6. As your requirement do the necessary changes of settings and click Next. Full and Associate files
are the recommended settings.

5
7. Change to your desire installation location.

6
8. If you want a shortcut then check the box and click Install.

9. The Installation will start wait for a while it will finish within a minute.

7
10. After complete installation click on Next.

10. That’s all click on the Finish & start Weka

8
This is the GUI you get when started. You have 4 options Explorer, Experimenter,
KnowledgeFlow and Simple CLI.

The buttons can be used to start the following applications:


Explorer An environment for exploring data with WEKA (the rest of this Documentation deals
with this application in more detail).
Experimenter An environment for performing experiments and conducting statistical tests
between learning schemes.

Knowledge Flow This environment supports essentially the same functions as the Explorer but
with a drag-and-drop interface. One advantage is that it supports incremental learning.

SimpleCLI Provides a simple command-line interface that allows direct execution of WEKA
commands for operating systems that do not provide their own command line interface.

Download All the datset from following website:


https://waikato.github.io/weka-wiki/datasets/?__s=3p0xsa0ke15qod6m7w2z

Interview/Viva-Voce Questions:
1. What is a weka tool ?
2. What is data mining ?
3. What does arff stands for ?
4. What are different applications of weka ?
5. What is the role of explorer in Weka ?

9
ARFF File Format
An ARFF (= Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes.

ARFF files are not the only format one can load, but all files that can be converted with
Weka’s “core converters”. The following formats are currently supported:

 ARFF (+ compressed)
 C4.5
 CSV
 libsvm
 binary serialized instances
 XRFF (+ compressed)

Overview

ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information. The Header of the ARFF file contains the name of the relation, a
list of the attributes (the columns in the data), and their types.

CSV(Comma Separated values ) files:


A comma-separated values file is a delimited text file that uses a comma to separate values. Each
line of the file is a data record. Each record consists of one or more fields, separated by commas.

Experiment 2:
Create an Employee Table(dataset) with the help of Data Mining Tool WEKA.
Description:
We need to create an Employee Table with training data set which includes attributes
like name, id, salary, experience, gender, phone number.

Procedure:
1) Open Start -> Programs -> Accessories -> Notepad
2) Type the following training data set with the help of Notepad for Employee Table.
@relation employee
@attribute name {x,y,z,a,b}
@attribute id numeric
@attribute salary {low,medium,high}
@attribute exp numeric
@attribute gender {male,female}
@attribute phone numeric
@data
x,101,low,2,male,250311
y,102,high,3,female,251665
z,103,medium,1,male,240238

10
a,104,low,5,female,200200
b,105,high,2,male,240240
3) After that the file is saved with .arff file format.
4) Minimize the arff file and then open Start  Programs  weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows employee table on weka.
Output:
Training Data Set -> Employee Table

Assignment
1. Load iris dataset using Weka?
2. Load labor dataset using Weka?
3. Create Student. csv files and load following data

11
Save above excel file with student.csv extensions
Load above csv files using weka.

Interview/Viva-Voce Questions:
1. What does @relation tag do ?
2. What does @attribute tag do ?
3. What is .arff file format? What does arff stands for ?
4. What does @data tad do ?
5. What is csv files ?

Pre-processing techniques
The data that is collected from the field contains many unwanted things that
leads to wrong analysis. For example, the data may contain null fields, it may
contain columns that are irrelevant to the current analysis, and so on. Thus, the
data must be preprocessed to meet the requirements of the type of analysis you
are seeking. This is the done in the preprocessing module.
To demonstrate the available features in preprocessing, we will use the Weather
database that is provided in the installation. Using the Open file option under
the Preprocess tag select the weathernominal.arff file.

12
Experiment 3:

Aim of the Experiment:


Apply Pre-Processing techniques to the training data set of Weather Table .

Description :
Real world databases are highly influenced to noise, missing and inconsistency due to their queue size
so the data can be pre-processed to improve the quality of data and missing results and it also
improves the efficiency.
There are 3 pre-processing techniques they are:
1) Add
2) Remove
3) Normalizaon
Procedure:
Output:
1. Open Start -> Programs -> Accessories -> Notepad
2) Type the following training data set with the help of Notepad for Weather
Table.
@relation weather
@attribute outlook {sunny,rainy,overcast}
@attribute temparature numeric
@attribute humidity numeric
@attribute windy {true,false}
@attribute play {yes,no}
@data
sunny,85.0,85.0,false,no
overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes
rainy,70.0,86.0,false,yes
rainy,68.0,80.0,false,yes
rainy,65.0,70.0,true,no
overcast,64.0,65.0,false,yes
sunny,72.0,95.0,true,no
sunny,69.0,70.0,false,yes
rainy,75.0,80.0,false,yes
3) After that the file is saved with .arff file format.
4) Minimize the arff file and then open Start  Programs  weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff
file
8) Click on edit button which shows weather table on weka.

13
Add ->Pre-Processing Technique:
Procedure:
1) Start -> Programs -> Weka-3-4 -> Weka-3-4
2) Click on explorer.
3) Click on open file.
4) Select Weather.arff file and click on open.
5) Click on Choose button and select the Filters option.
6) In Filters, we have Supervised and Unsupervised data.
7) Click on Unsupervised data.
8) Select the attribute Add.
9) A new window is opened.
10) In that we enter attribute index, type, data format, nominal label
values for Climate.
11) Click on OK.
12) Press the Apply button, then a new attribute is added to the
Weather Table.
13) Save the file.
14) Click on the Edit button, it shows a new Weather Table on Weka

14
Remove -> Pre-Processing Technique:
Procedure:
1) Start -> Programs -> Weka-3-4
2) Click on explorer.
3) Click on open file.
4) Select Weather.arff file and click on open.
5) Click on Choose button and select the Filters option.
6) In Filters, we have Supervised and Unsupervised data.
7) Click on Unsupervised data.
8) Select the attribute Remove.
9) Select the attributes windy, play to Remove.
10) Click Remove button and then Save.
11) Click on the Edit button, it shows a new Weather Table on Weka.

Weather Table after removing attributes WINDY, PLAY:

15
Normalize -> Pre-Processing Technique:
Procedure:
1) Start -> Programs -> Weka-3-4
2) Click on explorer.
3) Click on open file.
4) Select Weather.arff file and click on open.
5) Click on Choose button and select the Filters option.
6) In Filters, we have Supervised and Unsupervised data.
7) Click on Unsupervised data.
8) Select the attribute Normalize.
9) Select the attributes temparature, humidity to Normalize.
10) Click on Apply button and then Save.
11) Click on the Edit button, it shows a new Weather Table with normalized
values on Weka.

16
Weather Table after Normalizing TEMPARATURE, HUMIDITY:

Result:
This program has been successfully executed.

Assignment
1. Apply Pre-Processing techniques to the training data set of Employee
Table?
2. Explore various options in Weka for preprocessing data and apply in each
dataset.Eg: creditg,Soybean, Vote, Iris, Contactlense,
3. Write the steps to apply missing values in a dataset ?

Interview/Viva-Voce Questions:
1. Why do we pre-process the data?
2. What are the steps involved in data pre-processing?
3. Define virtual data warehouse ?
4. What are Data mining techniques ?
5. Explain about supervised and unsupervised data ?

17
Association Rule:
An association rule has two parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in
combination with the antecedent.
Association rules are created by analyzing data for frequent if/then patterns and
using the criteria support and confidence to identify the most important
relationships. Support is an indication of how frequently the items appear in the
database. Confidence indicates the number of times the if/then statements have
been found to be true.
In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in shopping basket data analysis, product
clustering, catalog design and store layout.
Support and Confidence values:

Support count: The support count of an itemset X, denoted by X.count, in a data set T is the
number of transactions in T that contain X. Assume T has n transactions.
Then,
`
X ∪Y
support =
X
.count

X ∪Y
confidence =
X . count
.count
support = support({A U C})
confidence = support({A U C})/support({A})
Experiment 4:

Aim of the Experiment:


Finding Association Rules for Banking data using apriori algorithm.

Description:
In data mining, association rule learning is a popular and well researched method
for discovering interesting relations between variables in large databases. It can
be described as analyzing and presenting strong rules discovered in databases
using different measures of interestingness. In market basket analysis
association rules are used and they are also employed in many application areas
including Web usage mining, intrusion detection and bioinformatics.

Creation of Banking Table:

Procedure:

1. Open Start -> Programs -> Accessories -> Notepad

18
2. Type the following training data set with the help of Notepad for Banking
Table.

@relation bank
@attribute cust {male,female}
@attribute accno
{0101,0102,0103,0104,0105,0106,0107,0108,0109,0110,0111,0112,0113,011
4,0115}
@attribute bankname {sbi,hdfc,sbh,ab,rbi}
@attribute location {hyd,jmd,antp,pdtr,kdp}
@attribute deposit {yes,no}

@data
male,0101,sbi,hyd,yes
female,0102,hdfc,jmd,no
male,0103,sbh,antp,yes
male,0104,ab,pdtr,yes
female,0105,sbi,jmd,no
male,0106,ab,hyd,yes
female,0107,rbi,jmd,yes
female,0108,hdfc,kdp,no
male,0109,sbh,kdp,yes
male,0110,ab,jmd,no
female,0111,rbi,kdp,yes
male,0112,sbi,jmd,yes
female,0113,rbi,antp,no
male,0114,hdfc,pdtr,yes
female,0115,sbh,pdtr,no

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start -> Programs -> weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff
file
8) Click on edit button which shows banking table on weka.

Training Data Set -> Banking Table

19
Procedure for Association rules

1) Open Start -> Programs -> Weka-3-4 -> Weka-3-4


2) Open explorer.
3) Click on open file and select bank.arff
4) Select Associate option on the top of the Menu bar.
5) Select Choose button and then click on Apriori Algorithm.
6) Click on Start button and output will be displayed on the right side of the
window.
Output

Assignment
1. Apply different discretization filters on numerical attributes and run the Apriori
association rule algorithm. Study the rules generated. Derive interesting insights
and observe the effect of discretization in the rule generation process.Eg:Dataset
like Vote,soybean,supermarket,Iris.

20
2. Finding Association Rules for Buying data.

Interview/Viva-Voce Questions:
1. What are data mining functionality?
2. Define support and confidence.?
3. What is the difference between dependent data warehouse and independent data warehouse?
4. What is Apriori association rule algorithm?

Knowledge Flow
The Knowledge Flow provides an alternative to the Explorer as a graphical front
end to WEKA’s core algorithms.

The Knowledge Flow presents a data-flow inspired interface to WEKA. The user can
select WEKA components from a palette, place them on a layout canvas and
connect them together in order to form a knowledge flow for processing and
analyzing data. At present, all of WEKA’s classifiers, filters, clusterers,
associators, loaders and savers are available in the Knowledge Flow along with
some extra tools.

21
Experiment 5:
Aim:

Demonstration of classification rule process on dataset weather.arff using j48


algorithm

Theory:

Cross-validation, sometimes called rotation estimation, is a technique for


assessing how the results of a statistical analysis will generalize to an
independent data set. It is mainly used in settings where the goal is prediction,
and one wants to estimate how accurately a predictive model will perform in
practice. One round of cross-validation involves partitioning a sample of data
into complementary subsets, performing the analysis on one subset (called the
training set), and validating the analysis on the other subset (called the
validation set or testing set).
Decision tree learning, used in data mining and machine learning, uses a
decision tree as a predictive model which maps observations about an item to
conclusions about the item's target value In these tree structures, leaves
represent classifications and branches represent conjunctions of features that
lead to those classifications. In decision analysis, a decision tree can be used to
visually and explicitly represent decisions and decision making. In data mining,
a decision tree describes data but not decisions; Rather the resulting
classification tree can be an input for decision making. This page deals with
decision trees in data mining.

Creation of Weather Table:


Procedure:

1) Open Start -> Programs -> Accessories -> Notepad


2) Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@attribute outlook {sunny, rainy, overcast}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes

22
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start  Programs  weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows weather table on weka.

Training Data Set  Weather Table

Procedure:
1) Start -> Programs -> Weka 3.4

2) Open Knowledge Flow.


3) Select Data Source tab & choose Arff Loader.
4) Place Arff Loader component on the layout area by clicking on that
component.

23
5) Specify an Arff file to load by right clicking on Arff Loader icon, and then a
pop-up menu will appear. In that select Configure & browse to the location of
weather.arff
6) Click on the Evaluation tab & choose Class Assigner & place it on the layout.
7) Now connect the Arff Loader to the Class Assigner by right clicking on Arff
Loader, and then select Data Set option, now a link will be established.
8) Right click on Class Assigner & choose Configure option, and then a new
window will appear & specify a class to our data.
9) Select Evaluation tab & select Cross-Validation Fold Maker & place it on the
layout.
10) Now connect the Class Assigner to the Cross-Validation Fold Maker.
11) Select Classifiers tab & select J48 component & place it on the layout.
12) Now connect Cross-Validation Fold Maker to J48 twice; first choose
Training Data Set option and then Test Data Set option.
13) Select Evaluation Tab & select Classifier Performance Evaluator
component & place it on the layout.
14) Connect J48 to Classifier Performance Evaluator component by right
clicking on J48 & selecting Batch Classifier.
15) Select Visualization tab & select Text Viewer component & place it on the
layout.
16) Connect Text Viewer to Classifier Performance Evaluator by right clicking
on Text Viewer & by selecting Text option.
17) Start the flow of execution by selecting Start Loading from Arff Loader.
18) For viewing result, right click on Text Viewer & select the Show Results,
and then the result will be displayed on the new window.

Output:

24
Assignment

1. Load each dataset into Weka and run id3, j48 classification algorithm,
study the classifier output. Compute entropy values, Kappa statistic.
2. Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can
you classify correctly?(This is also called testing on the training set) why do you
think can not get 100% training accuracy?
If we used our above model trained on the complete dataset and classified
credit as good/bad
for each of the examples in that dataset. We can not get 100% training
accuracy. only 85.5% of
examples, we can classify correctly
3. Sometimes, the cost of rejecting an applicant who actually has a good credit
Case 1. Might be higher than accepting an applicant who has bad credit
Case 2. Instead of counting the misclassifications equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You can
do this by using a cost matrix in WEKA. Train your Decision Tree again and report
the Decision Tree and cross-validation results. Are they significantly different
from results obtained in problem 9 (using equal cost)?

Result:
The program has been successfully executed.
Interview/Viva-Voce Questions:
1. What is a Decision Tree Algorithm?
2. What is the benefit of Decision Tree Algorithm?
3. What is rule antecedent & rule consequent ?
4. What is if-then rule ?

25
Experiment 6
Aim: Demonstration of Linear Regression Algorithm using weka
Demonstration:
Regression is a process of finding the correlations between dependent and
independent variables. The task of the regression algorithm is to find the mapping
function to map the input variable(x) to the continuous output variable (y)

Procedure.
1. Start -> Programs -> Weka 3.4
2. Then select Explorer
3. Select any of the dataset by selecting from open file option
4. Then goto classify tab and choose Linear Regression classifier
5. Set cross Validation Folds to 10
6. Then click start
Assignment:
1. Load Car_salesman_commision datasets and calculate its future commission and perform
following operations ?
a. Explain why the predicted commission class value in the Classify tab, is equal to the
Mean value in the Preprocess tab.
b. What was the Correlation coefficient value displayed in the Run information- Classify
tab?
c. What was the StDev value displayed in the Preprocess tab?
d. What the Correlation coefficient value represented in the datasets?
The Correlation coefficient value represented an error in the correlation of the
datasets used.
e. What does StDev mean? and What does it represent?

Viva voce/Interview Questions


1. Regression is operated upon ____ values ?
2. What is dependent and independent variables ?
3. What is Linear Regression?

26
Experiment 7
Aim:

Write a procedure for Clustering Customer data using Simple KMeans Algorithm.

Description:

K-means clustering is a simple unsupervised learning algorithm. In this, the data


objects (‘n’) are grouped into a total of ‘k’ clusters, with each observation belonging to
the cluster with the closest mean. It defines ‘k’ sets, one for each cluster k n (the point
can be thought of as the center of a one or two-dimensional figure). The clusters are
separated by a large distance.

Procedure:
1. Open Start -> Programs -> Accessories -> Notepad
2. Type the following training data set with the help of Notepad for Buying Table
@relation customer
@attribute name {x,y,z,u,v,l,w,q,r,n}
@attribute age {youth,middle,senior}
@attribute income {high,medium,low}
@attribute class {A,B}
@data
x,youth,high,A
y,youth,low,B
z,middle,high,A
u,middle,low,B
v,senior,high,A
l,senior,low,B
w,youth,high,A
q,youth,low,B
r,middle,high,A
n,senior,high,A
3) After that the file is saved with .arff file format.
4) Minimize the arff file and then open Start -> Programs -> weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff
file
8) Click on edit button which shows buying table on weka.
Training Data Set -> Customer Table

Procedure:
27
1) Click Start -> Programs -> Weka 3.4
2) Click on Explorer.
3) Click on open file & then select Customer.arff file.
4) Click on Cluster menu. In this there are different algorithms are there.
5) Click on Choose button and then select SimpleKMeans algorithm.
6) Click on Start button and then output will be displayed on the screen.
Output:

Assignment:
1. Study the clusters formed. Observe the sum of squared errors and centroids,
and derive insights.

2. Apply Iris datasets using k-means clustering algorithm and select number of
clusters and select classes to clusters evaluation and check Number of instances,
Number of attributes, Number of Iterations and Incorrectly clustered instances.
And visualize their cluster.

Interview/Viva-Voce Questions:
1. What is a clustering?
2. What do you mean by k-means clustering ?
3. What are different clustering techniques ?
4. How to increases number of clusters in weka ?

28
Experiment 8

Aim:

extract if-then rules from decision tree generated by classifier, Observe the
confusion matrix and derive Accuracy, F- measure, TPrate, FPrate , Precision and
recall values. Apply cross-validation strategy with various fold levels and
compare the accuracy results.

Description:
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy _computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a
class.

The benefits of having a decision tree are as follows −


 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
IF-THEN Rules:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We
can express a rule in the following from −
IF condition THEN conclusion Let us consider a rule R1,
R1: IF age=youth AND student=yes
THEN buy_computer=yes

29
Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
The consequent part consists of class prediction.

We can also write rule R1 as follows:


R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules
from a decision tree.
Points to remember −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the
training data. We do not require to generate a decision tree first. In this
algorithm, each rule for a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are
learned, a tuple covered by the rule is removed and the process continues for
the rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.

The Following is the sequential learning Algorithm where rules are learned for
one class at a time. When learning a rule from a class Ci, we want the rule to
cover all the tuples from class C only and no tuple form any other class.

Algorithm: Sequential Covering


Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules. Method:
Rule_set={ }; // initial set of rules learned is empty for each
class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c); remove
tuples covered by Rule form D; until termination
condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −

30
 The Assessment of quality is made on the original set of training data. The
rule may perform well on training data but less well on subsequent data.
That's why the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned
version of R has greater quality than what was assessed on an
independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule
R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence,
if the FOIL_Prune value is higher for the pruned version of R, then we prune R.

Steps for run decision tree algorithms in WEKA


1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
 Click on open file button.
 Choose WEKA folder in C drive.
 Select and Click on data option button.
 Choose iris data set and open file.
 Click on classify tab and Choose decision table algorithm and select
cross-validation folds value-10 test option.
 Click on start button.

OUTPUT

Assignment

Interview/Viva-Voce Questions:
1. What is rule antecedent ?
2. What is rule consequent ?

31
3. What is decision tree ?

Experiment 9:
Aim:
Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbor classification, Interpret the results obtained.
Description:
Naive Bayes classifier assumes that the presence (or absence) of a particular feature
of a class is unrelated to the presence (or absence) of any other feature. For example,
a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.
Even though these features depend on the existence of the other features, a naive
Bayes classifier considers all of these properties to independently contribute to the
probability that this fruit is an apple.
An advantage of the naive Bayes classifier is that it requires a small amount of
training data to estimate the parameters (means and variances of the variables)
necessary for classification. Because independent variables are assumed, only the
variances of the variables for each class need to be determined and not the entire
covariance matrix The naive Bayes probabilistic model .
Procedure:
Steps for run Naïve-bayes and Classification algorithms in WEKA
o Open WEKA Tool.
o Click on WEKA Explorer.
 Click on Preprocessing tab button.
 Click on open file button.
 Choose WEKA folder in C drive.
o Select and Click on data option button.
o Choose iris data set and open file.
o Click on classify tab and Choose Naïve-bayes algorithm and select use training
set test
option.
o Click on start button.

Output:

32
Plot RoC(Receiver Operating characterstics) Curves.
Steps for identify the plot RoC Curves.
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Visualize button.
4. Click on right click button.
5. Select and Click on polyline option button.

Assignment:
1. Load soybean iris datasets using Naïve based classification and check its
confusion matrix to test positive and negative and apply ROC.
2. Load neighbor datasets using Naïve based classification and perform ROC on
the datasets.

Interview/Viva-Voce Questions:
1. What is naïve based classifier ?
2. Can you choose a classifier based on the size of the training set?

33
3. What are the advantages of Naïve based classifiers ?

Experiment 10:
Aim:
Explore visualization features of weka to visualize the clusters. Derive interesting
insights and explain.

Description:
WEKA’s visualization allows you to visualize a 2-D plot of the current working
relation.Visualization is very useful in practice, it helps to determine difficulty of
the learning problem.WEKA can visualize single attributes (1-d) and pairs of
attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option
to deal with nominal attributes and to detect “hidden” data points.
Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is
Available From a Popup Menu. Click The Right Mouse Button Over An Entry In The Result
List To Bring Up The Menu.
You Will Be Presented With Options For Viewing Or Saving The Text Output And
Depending On The Scheme Further Options For Visualizing Errors, Clusters, Trees Etc.
To open Visualization screen, click ‘Visualize’ tab.

34
Procedure:
Select a square that corresponds to the attributes you would like to visualize. For
example, let’s choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere
inside the square that corresponds to ‘play o’.

Changing the View:


In the visualization window, beneath the X-axis selector there is a drop-down list,‘Colour’, for
choosing the color scheme. This allows you to choose the color of points based on the attribute
selected. Below the plot area, there is a legend that describes what values the colors correspond to.
In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you should
change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select lighter color
from the color palette.n the left and ‘outlook’ at the top.

Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A
special case is the ‘UserClassifier’, which lets you to build your own classifier by
interactively selecting instances.Below the Y – axis there is a drop-down list that
allows you to choose a selection method. A group of points on the graph can be
selected in four ways

35
1. Select Instance. Click on an individual data point. It brings up a window listing
attributes of the point. If more than one point will appear at the same location, more than one
set of attributes will be shown.

2. Rectangle. You can create a rectangle by dragging it around the point.

3. Polygon. You can select several points by building a free-form polygon. Left-click on the
graph to add vertices to the polygon and right-click to complete it.

36
4. Polyline. To distinguish the points on one side from the once on another, you can build a
polyline. Left-click on the graph to add vertices to the polyline and right-click to finish.

37
Assignment:
1. Write a procedure for Visualization for Weather Table.
2. Write a procedure for Visualization of Banking Table.

Viva /Interview Question


1. Why Visualization is required ?
2. What are the various data visualization techniques ?
3. What do you mean by data transformation ?

38
Experiment 11
Aim
Load each dataset into Weka and perform k-Nearest Neighbor classification, Interpret
the results obtained.

Description
K Nearest Neighbour is a Supervised Machine Learning Algorithm that can be used
for various Instance-Based Classification problems. Instances themselves represent
knowledge; can the classification be performed on a two-dimensional instance space.
KNN captures the idea of similarity, also sometimes known as distance, proximity, or
closeness, by calculating the distance between points on a graph. KNN algorithm
stores all the available data and classifies a new data point based on its similarity. In
KNN, the ‘K’ parameter refers to the number of nearest neighbours to include in the
majority of the voting process for classification.
Procedure:
In Weka, K-NN is implemented by using IBK
Steps for run K-NN and Classification algorithms in WEKA
o Open WEKA Tool.
o Click on WEKA Explorer.
 Click on Preprocessing tab button.
 Click on open file button.
 Choose WEKA folder in C drive.
o Select and Click on data option button.
o Choose iris data set and open file.
o Click on classify tab and choose IBk classifier under lazy/weak
o Then click on start
- we will get accuracy rate.
o To change the configuration, click on IBk classifier and set KNN values

39
 And compare its configuration values.

Assignment
1. Capture the results for changing values of K, distance weighting and the
neighborhood search algorithm ?

Viva/interview Question
1. What is K-nearest neighbor algorithm?
2. What are the issues regarding classification and prediction?
3. What is decision tree classifier?
4. What is multimedia data mining?

40

You might also like