0% found this document useful (0 votes)
21 views

data mining file

this is data mining file

Uploaded by

VaiShali Negi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

data mining file

this is data mining file

Uploaded by

VaiShali Negi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Practical 1:- Build Data Warehouse/Data Mart (using open source tools like Pentaho Data

Integration Tool, Pentaho Business Analytics).

Create the Data Warehouse

So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse:
DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions
but we’ll leave the fact table empty. The purpose of this article is to show how to populate the
fact table using SSIS.

First I’ll show you how it looks when it’s done:

Date Dimension:-

Customer Dimension:
Van Dimension:

And then we do it. This is the script to create and populate those dim and fact tables:

-- Create the data warehouse

create database TopHireDW go

use TopHireDW go

-- Create Date Dimension


if exists (select * from sys.tables where name =
'DimDate') drop table DimDate
go

create table DimDate


( DateKey int not null primary key,
[Year] varchar(7), [Month] varchar(7), [Date] date, DateString
varchar(10)) go

-- Populate Date Dimension


truncate table
DimDate go

declare @i int, @Date date, @StartDate date, @EndDate date,


@DateKey int, @DateString varchar(10), @Year varchar(4),
@Month varchar(7), @Date1
varchar(20) set @StartDate =
'2006-01-01'
set @EndDate =
'2016-12-31' set
@Date =
@StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)


values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row
while @Date <=
@EndDate begin
set @DateString = convert(varchar(10), @Date, 20)
set @DateKey = convert(int,
replace(@DateString,'-','')) set @Year =
left(@DateString,4)
set @Month = left(@DateString, 7)
insert into DimDate (DateKey, [Year], [Month], [Date],
DateString) values (@DateKey, @Year, @Month, @Date,
@DateString)
set @Date = dateadd(d,
1, @Date) end
go

select * from DimDate

-- Create Customer dimension


if exists (select * from sys.tables where name =
'DimCustomer') drop table DimCustomer
go

create table DimCustomer


( CustomerKey int not null identity(1,1) primary
key, CustomerId varchar(20) not null,
CustomerName varchar(30), DateOfBirth date, Town varchar(50),
TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,


DrivingLicenceNo, Occupation)
select * from

HireBase.dbo.Customer

select * from DimCustomer

-- Create Van dimension


if exists (select * from sys.tables where name =
'DimVan') drop table DimVan
go

create table DimVan


( VanKey int not null identity(1,1)
primary key, RegNo varchar(10) not
null,
Make varchar(30), Model varchar(30), [Year]
varchar(4), Colour varchar(20), CC int, Class
varchar(10)
)
go
insert into DimVan (RegNo, Make, Model, [Year], Colour,
CC, Class) select * from HireBase.dbo.Van
go

select * from DimVan

-- Create Hire fact table


if exists (select * from sys.tables where name =
'FactHire') drop table FactHire
create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension
Keys HireId varchar(10) not null, --Degenerate Dimension
NoOfDays int, VanHire money, SatNavHire
money, Insurance money, DamageWaiver money,
TotalBill money
)
go

select * from FactHire

B. Explore WEKA Data Mining/Machine Learning Toolkit

B. (i) Downloading and/or installation of WEKA data mining toolkit.

Ans: Install Steps for WEKA a Data Mining Tool

1. Download the software as your requirements from the below


given link. http://www.cs.waikato.ac.nz/ml/weka/downloading.html
2. The Java is mandatory for installation of WEKA so if you have already Java
on your machine then download only WEKA else download the software with
JVM.
3. Then open the file location and double click on the file
4. Click Next
5. Click I Agree.
6. Click I Agree.

7. As your requirement do the necessary changes of settings and click Next. Full and
Associate files are the recommended settings.

8. Change to your desire installation location.


9. If you want a shortcut then check the box and click Install.

10. The Installation will start wait for a while it will finish within a minute.

11. After complete installation click on Next.


12. Hurray !!!!!!! That’s all click on the Finish and take a shovel and start Mining. Best of
Luck.
This is the GUI you get when started. You have 4 options Explorer, Experimenter,
KnowledgeFlow and Simple CLI.
Practical 2:- Perform data preprocessing tasks and demonstrate performing association rule mining
on data sets.

A. Explore various options in Weka for Preprocessing data and apply (like Discretization
Filters, Resample filter, etc.) n each dataset.
Ans:

Preprocess Tab

1. Loading Data

The first four buttons at the top of the preprocess section enable you to load data
into WEKA:

1. Open file.... Brings up a dialog box allowing you to browse for the data file on the
local file system.

2. Open URL.....Asks for a Uniform Resource Locator address for where the data is stored.

3. Open DB.....Reads data from a database. (Note that to make this work you might have to
edit the
file in weka/experiment/DatabaseUtils.props.)

4. Generate.....Enables you to generate artificial data from a variety of Data Generators.


Using the
Open file. . .button you can read files in a variety of formats: WEKA’s ARFF format, CSV
format, C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension, CSV
files a .csv extension, C4.5 files a .data and .names extension, and serialized Instances objects a .bsi
extension.
Current Relation: Once some data has been loaded, the Preprocess panel shows a
variety of information. The Current relation box (the “current relation” is the
currently loaded data, which can be interpreted as a single relational table in database
terminology) has three entries:

1. Relation. The name of the relation, as given in the file it was loaded from. Filters
(described below) modify the name of a relation.

2. Instances. The number of instances (data points/records) in the data.

3. Attributes. The number of attributes (features) in the data.


Working With Attributes

Below the Current relation box is a box titled Attributes. There are four buttons,
and beneath them is a list of the attributes in the current relation.

The list has three columns:

1. No.. A number that identifies the attribute in the order they are specified in the data file.
2. Selection tick boxes. These allow you select which attributes are present in the relation.
3. Name. The name of the attribute, as it was declared in the data file. When you click on
different rows in the list of attributes, the fields change in the box to the right titled
Selected attribute.
This box displays the characteristics of the currently highlighted attribute in the list:

1. Name. The name of the attribute, the same as that given in the attribute list.

2. Type. The type of attribute, most commonly Nominal or Numeric.

3. Missing. The number (and percentage) of instances in the data for which this attribute
is missing (unspecified).
4. Distinct. The number of different values that the data contains for this attribute.

5. Unique. The number (and percentage) of instances in the data having a value for this
attribute that no other instances have.

Below these statistics is a list showing more information about the values stored in this
attribute, which differ depending on its type. If the attribute is nominal, the list consists
of each possible value for the attribute along with the number of instances that have
that value. If the attribute is numeric, the list gives four statistics describing the
distribution of values in the data— the minimum, maximum, mean and standard
deviation. And below these statistics there is a coloured histogram, colour-coded
according to the attribute chosen as the Class using the box above the histogram. (This
box will bring up a drop-down list of available selections when clicked.) Note that
only nominal Class attributes will result in a colour-coding. Finally, after pressing the
Visualize All button, histograms for all the attributes in the data are shown in a
separate window.
Returning to the attribute list, to begin with all the tick boxes are unticked.

They can be toggled on/off by clicking on them individually. The four buttons
above can also be used to change the selection:

PREPROCESSING

1. All. All boxes are ticked.


2. None. All boxes are cleared (unticked).
3. Invert. Boxes that are ticked become unticked and vice versa.

Once the desired attributes have been selected, they can be removed by clicking the
Remove button below the list of attributes. Note that this can be undone by clicking the
Undo button, which is located next to the Edit button in the top-right corner of the
Preprocess panel.
Working with Filters:-

The preprocess section allows filters to be defined that transform the data in
various ways. The Filter box is used to set up the filters that are required. At the
left of the Filter box is a Choose button. By clicking this button it is possible to
select one of the filters in WEKA. Once a filter has been selected, its name and
options are shown in the field next to the Choose button.

The GenericObjectEditor Dialog Box


The GenericObjectEditor dialog box lets you configure a filter. The
same kind of dialog box is used to configure other objects, such as classifiers
and clusterers

(see below). The fields in the window reflect the available options.

Right-clicking (or Alt+Shift+Left-Click) on such a field will bring up a popup menu, listing
the following options:

1. Show properties... has the same effect as left-clicking on the field, i.e., a
dialog appears allowing you to alter the settings.

2. Copy configuration to clipboard copies the currently displayed configuration


string to the system’s clipboard and therefore can be used anywhere else in WEKA or in
the console. This is rather handy if you have to setup complicated, nested schemes.
3. Enter configuration... is the “receiving” end for configurations that got copied to
the clipboard earlier on. In this dialog you can enter a class name followed by options (if
the class supports these). This also allows you to transfer a filter setting from the
Preprocess panel to a Filtered Classifier used in the Classify panel.
Applying Filters
Once you have selected and configured a filter, you can apply it to the data by
pressing the Apply button at the right end of the Filter panel in the Preprocess panel. The
Preprocess panel will then show the transformed data. The change can be undone by
pressing the Undo button. You can also use the Edit...button to modify your data
manually in a dataset editor. Finally, the Save... button at the top right of the Preprocess
panel saves the current version of the relation in file formats that can represent the
relation, allowing it to be kept for future use.

 Steps for run preprocessing tab in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose labor data set and open file.
8. Choose filter button and select the Unsupervised-Discritize option and apply
Dataset labor.arf

The following screenshot shows the effect of discretization

A. Load each dataset into Weka and run Aprior algorithm with different support
and confidence values. Study the rules generated.

Ans:
Steps for run Aprior algorithm in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose Weather data set and open file.
8. Click on Associate tab and Choose Aprior algorithm
9. Click on start button.

Output : === Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0


-c •
1
Relation:
weather.symbolic
Instances: 14
Attributes: 5 outlook

temp
eratu
re
hum
idity
wind
y
play
=== Associator model (full
training set) === Apriori
=======

Minimum support: 0.15 (2


instances) Minimum metric
<confidence>: 0.9 Number
of cycles performed: 17

1. outlook=overcast 4 ==> play=yes 4 conf:(1)


2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)

Association Rule:-

An association rule has two parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in
combination with the antecedent.

Association rules are created by analyzing data for frequent if/then patterns and using the
criteriasupport and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the database. Confidence indicates the
number of times the if/then statements have been found to be true.

In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in shopping basket data analysis, product
clustering, catalog design and store layout.

Support and Confidence values:


 Support count: The support count of an itemset X, denoted by X.count, in a data
set T is the number of transactions in T that contain X. Assume T has n
transactions.
 Then,
( X  Y ).count
support 
n

( X  Y ).count
confidence 
X .count

support = support({A U C})

confidence = support({A U C})/support({A})

B.Apply different discretization filters on numerical attributes and run the Aprior
association rule algorithm. Study the rules generated. Derive interesting insights and
observe the effect of discretization in the rule generation process.

Ans: Steps for run Aprior algorithm in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose Weather data set and open file.
8. Choose filter button and select the Unsupervised-Discritize option and apply
9. Click on Associate tab and Choose Aprior algorithm
10. Click on start button.
Output : === Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0


-c •
1
Relation: weather.symbolic Instances: 14
Attributes: 5 outlook temperature humidity windy
play
=== Associator model (full
training set) === Apriori
=======
Minimum support: 0.15 (2
instances) Minimum metric
<confidence>: 0.9 Number
of cycles performed: 17
Generated sets of large itemsets:
Size of set of large itemsets L(1): 12
Size of set of large
itemsets L(2): 47 Size of
set of large itemsets L(3):
39
Size of set of large

itemsets L(4): 6 Best

rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
Practical 3:- Demonstrate performing classification on data sets.

Classification Tab

Selecting a Classifier

At the top of the classify section is the Classifier box. This box has a text fieldthat
gives the name of the currently selected classifier, and its options. Clicking on the text
box with the left mouse button brings up a GenericObjectEditor dialog box, just the same
as for filters, that you can use to configure the options of the current classifier.
Test Options

The result of applying the chosen classifier will be tested according to the options
that are set by clicking in the Test options box. There are four test modes:
1. Use training set. The classifier is evaluated on how well it predicts the class of the
instances it was trained on.
2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set
of instances loaded from a file. Clicking the Set... button brings up a dialog allowing
you to choose the file to test on.
3. Cross-validation. The classifier is evaluated by cross-validation, using the number of
folds that are entered in the Folds text field.
4. Percentage split. The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held out depends
on the value entered in the % field.

Classifier Evaluation Options:

1. Output model. The classification model on the full training set is output so that
it can be viewed, visualized, etc. This option is selected by default.

2. Output per-class stats. The precision/recall and true/false statistics for each class
are output. This option is also selected by default.

3. Output entropy evaluation measures. Entropy evaluation measures are included in


the output. This option is not selected by default.
4. Output confusion matrix. The confusion matrix of the classifier’s predictions is
included in the output. This option is selected by default.

5. Store predictions for visualization. The classifier’s predictions are remembered so


that they can be visualized. This option is selected by default.
6. Output predictions. The predictions on the evaluation data are output.

Note that in the case of a cross-validation the instance numbers do not correspond to the
location in the data!

7. Output additional attributes. If additional attributes need to be output alongside the


predictions, e.g., an ID attribute for tracking misclassifications, then the index of this
attribute can be specified here. The usual Weka ranges are supported,“first” and “last”
are therefore valid indices as well (example: “first-3,6,8,12-last”).

8. Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix.


The Set... button allows you to specify the cost matrix used.

9. Random seed for xval / % Split. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.

10. Preserve order for % Split. This suppresses the randomization of the data before
splitting into train and test set.

11. Output source code. If the classifier can output the built model as Java source
code, you can specify the class name here. The code will be printed in the “Classifier
output” area.

The Class Attribute


The classifiers in WEKA are designed to be trained to predict a single ‘class’
attribute, which is the target for prediction. Some classifiers can only learn nominal
classes; others can only learn numeric classes (regression problems) still others can
learn both. By default, the class is taken to be the last attribute in the data. If you
want to train a classifier to predict a different attribute, click on the box below the
Test options box to bring up a drop-down list of attributes to choose from.

Training a Classifier

Once the classifier, test options and class have all been set, the learning process is started by clicking
on the Start button. While the classifier is busy being trained, the little bird moves around. You can
stop the training process at any time by clicking on the Stop button. When training is complete,
several things happen.

A. Load each dataset into Weka and run id3, j48 classification algorithm, study the
classifier output. Compute entropy values, Kappa ststistic.
Ans:

 Steps for run ID3 and J48 Classification algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose iris data set and open file.
8. Click on classify tab and Choose J48 algorithm and select use training set test option.
9. Click on start button.
10. Click on classify tab and Choose ID3 algorithm and select use training set test
option.
11. Click on start button.

Output:
=== Run information ===

Scheme:weka.classifiers.trees.J48 -C
0.25 -M 2 Relation: iris
Instances: 150
Attributes: 5 sepallength sepalwidth
petallength petalwidth class
Test mode:evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree

petalwidth <= 0.6: Iris-setosa (50.0)


petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica

(46.0/1.0) Number of Leaves : 5

Size of the tree : 9

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 147 98 %


Incorrectly Classified Instances 3 2 %

Kappa statistic 0.97


K&B Relative Info Score 14376.1925 %
K&B Information Score 227.8573 bits 1.519
bits/instance Class complexity | order 0 237.7444
bits 1.585 bits/instance Class complexity
| scheme 16.7179 bits 0.1115
bits/instance Complexity improvement (Sf)
221.0265 bits 1.4735
bits/instance Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error

5.2482 % Root relative squared


error

22.9089 % Total Number of


Instances

150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 Iris-setosa
0.98 0.02 0.961 0.98 0.97 0.99 Iris-versicolor
0.96 0.01 0.98 0.96 0.97 0.99 Iris-virginica
Weighted Avg. 0.98 0.01 0.98 0.98 0.98 0.993

=== Confusion

Matrix === a b

c <-- classified

as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica

The Classifier Output Text

The text in the Classifier output area has scroll bars allowing you to
browse the results. Clicking with the left mouse button into the text area, while
holding Alt and Shift, brings up a dialog that enables you to save the displayed
output
in a variety of formats (currently, BMP, EPS, JPEG and PNG). Of
course, you can also resize the Explorer window to get a larger display
area.

The output is
Split into several sections:

1. Run information. A list of information giving the learning scheme options,


relation name, instances, attributes and test mode that were involved in the process

2. Classifier model (full training set). A textual representation of the classification model
that was produced on the full training data.

3. The results of the chosen test mode are broken down thus.

4. Summary. A list of statistics summarizing how accurately the classifier was able to
predict the true class of the instances under the chosen test mode.

5. Detailed Accuracy By Class. A more detailed per-class break down of the


classifier’s
prediction accuracy.

6. Confusion Matrix. Shows how many instances have been assigned to each class.
Elements show the number of test examples whose actual class is the row and whose
predicted class is the column.

7. Source code (optional). This section lists the Java source code if one
choose “Output source code” in the “More options” dialog.

B.Extract if-then rues from decision tree gentrated by classifier, Observe the confusion
matrix and derive Accuracy, F- measure, TPrate, FPrate , Precision and recall values.
Apply cross-validation strategy with various fold levels and compare the accuracy results.

Ans:
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each
leaf node represents a class
The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

C IF-THEN Rules:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −

IF condition THEN
conclusion Let us
consider a rule R1,

R1: IF age=youth AND student=yes


THEN buy_computer=yes

Points to remember −

 The IF part of the rule is called rule antecedent orprecondition.

 The THEN part of the rule is called rule consequent.

 The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.

 The consequent part consists of class prediction.

Note − We can also write rule R1 as follows:

R1: (age = youth) ^ (student = yes))(buys computer = yes)


If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −

 One rule is created for each path from the root to the leaf node.

 To form a rule antecedent, each splitting criterion is logically ANDed.

 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule for a
given class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.

Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN


rules. Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls,
c); remove tuples covered by Rule form
D; until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-


set end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
 The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.

 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.

FOIL is one of the simple and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg


where pos and neg is the number of positive tuples covered by R, respectively.

Note − This value will increase with the accuracy of R on the pruning set.
Hence, if the FOIL_Prune value is higher for the pruned version of R, then we
prune R.

 Steps for run decision tree algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
1. Click on open file button.
2. Choose WEKA folder in C drive.
3. Select and Click on data option button.
4. Choose iris data set and open file.
5. Click on classify tab and Choose decision table algorithm and select cross-
validation folds value-10 test option.
6. Click on start button.
Output:
=== Run information ===
Scheme:weka.classifiers.rules.DecisionTable -X 1 -S
"weka.attributeSelection.BestFirst -D 1 -N 5"
Relation: iris
Instances: 150
Attributes: 5 sepallength sepalwidth
petallength petalwidth class
Test mode:10-fold cross-validation
=== Classifier model (full training

set) === Decision Table:

Number of training
instances: 150 Number
of Rules : 3
Non matches covered by
Majority class. Best first.
Start set: no
attributes
Search
direction:
forward
Stale search after 5 node
expansions Total number of
subsets evaluated: 12 Merit
of best subset found: 96
Evaluation (for feature selection): CV
(leave one out) Feature set: 4,5

C. Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbor classification, Interpret the results obtained.
Ans:

 Steps for run Naïve-bayes and k-nearest neighbor Classification algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose iris data set and open file.
8. Click on classify tab and Choose Naïve-bayes algorithm and select use
training set test option.
9. Click on start button.
10.Click on classify tab and Choose k-nearest neighbor and select use
training set test option.
11.Click on start button.

Output: Naïve Bayes


=== Run information ===

Scheme:weka.classifiers.bayes.Na
iveBayes Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
Test mode:evaluate on training data

=== Classifier model (full training set)

=== Naive Bayes Classifier

Class
Attribute Iris-setosa Iris-versicolor
Iris-virginica (0.33) (0.33) (0.33)
===============================================================
sepallength
mean 4.9913 5.9379 6.5795
std. dev. 0.355 0.5042 0.6353
weight sum 50 50 50
precision 0.1059 0.1059 0.1059

sepalwidth
mean 3.4015 2.7687 2.9629
std. dev. 0.3925 0.3038 0.3088
weight sum 50 50 50
precision 0.1091 0.1091 0.1091

petallength

mean 1.4694 4.2452 5.5516


std. dev. 0.1782 0.4712 0.5529
weight sum 50 50 50
precision 0.1405 0.1405 0.1405
petalwidth
mean 0.2743 1.3097 2.0343
std. dev. 0.1096 0.1915 0.2646
weight sum 50 50 50
precision 0.1143 0.1143 0.1143
Time taken to build model: 0 seconds

=== Evaluation on training set ===

=== Summary ===


Correctly Classified Instances 144 96 %
Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.0324
Root mean squared error 0.1495
Relative absolute error 7.2883 %
Root relative squared error 31.7089 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 Iris-setosa
0.96 0.04 0.923 0.96 0.941 0.993 Iris-versicolor
0.92 0.02 0.958 0.92 0.939 0.993 Iris-virginica
Weighted Avg. 0.96 0.02 0.96 0.96 0.96 0.995

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
0 4 46 | c = Iris-virginica.
Output: KNN (IBK)

=== Run information ===

Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R
first-last\"" Relation: iris
Instances: 150
Attributes: 5 sepallength sepalwidth
petallength

petalwidth class

Test mode:evaluate on training data

=== Classifier model (full training set) ===

IB1 instance-based classifier


using 1 nearest neighbour(s) for classification

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 150 100 %


Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.0085
Root mean squared error 0.0091
Relative absolute error 1.9219 %
Root relative squared error 1.9335 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 Iris-setosa
1 0 1 1 1 1 Iris-versicolor
1 0 1 1 1 1 Iris-virginica
Weighted Avg. 1 0 1 1 1 1

=== Confusion

Matrix === a b c

<-- classified as
50 0 0 | a = Iris-setosa

0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica

C. Plot RoC Curves.


Ans: Steps for identify the plot RoC Curves.

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Visualize button.
4. Click on right click button.
5. Select and Click on polyline option button.

C. Compare classification results of ID3,J48, Naïve-Bayes and k-NN classifiers for


each dataset and reduce which classifier is performing best and poor for each
dataset and justify.

Ans

 Steps for run ID3 and J48 Classification algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose iris data set and open file.
8. Click on classify tab and Choose J48 algorithm and select use training set test
option.

=== Run information ===


Scheme:weka.classifiers.trees.J48 -C
0.25 -M 2 Relation: iris
Instances: 150
Attributes: 5
sepallength sepalwidth
petallength petalwidth
class
Test mode:evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree

petalwidth <= 0.6: Iris-setosa (50.0)


petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica

(46.0/1.0) Number of Leaves : 5

Size of the tree : 9

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 147 98 %


Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 Iris-setosa
0.98 0.02 0.961 0.98 0.97 0.99 Iris-versicolor
0.96 0.01 0.98 0.96 0.97 0.99 Iris-virginica
Weighted Avg. 0.98 0.01 0.98 0.98 0.98 0.993

=== Confusion

Matrix === a b c

<-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
Naïve-bayes:

=== Run information ===

Scheme:weka.classifiers.bayes.Na
iveBayes Relation: iris
Instances: 150
Attributes: 5
sepallength

sepalwidth petallength petalwidth class


Test mode:evaluate on training data
=== Classifier model (full training set)
=== Naive Bayes Classifier
Class
Attribute Iris-setosa Iris-versicolor
Iris-virginica (0.33) (0.33) (0.33)
===============================================================
sepallength
mean 4.9913 5.9379 6.5795

std. dev. 0.355 0.5042 0.6353


weight sum 50 50 50
precision 0.1059 0.1059 0.1059
sepalwidth
mean 3.4015 2.7687 2.9629
std. dev. 0.3925 0.3038 0.3088
weight sum 50 50 50
precision 0.1091 0.1091 0.1091

petallength
mean 1.4694 4.2452 5.5516
std. dev. 0.1782 0.4712 0.5529
weight sum 50 50 50
precision 0.1405 0.1405 0.1405

petalwidth
mean 0.2743 1.3097 2.0343
std. dev. 0.1096 0.1915 0.2646
weight sum 50 50 50
precision 0.1143 0.1143 0.1143

Time taken to build model: 0 seconds

=== Evaluation on training set ===

=== Summary ===


Correctly Classified Instances 144 96 %
Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.0324
Root mean squared error 0.1495
Relative absolute error 7.2883 %
Root relative squared error 31.7089 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 Iris-setosa
0.96 0.04 0.923 0.96 0.941 0.993 Iris-versicolor
0.92 0.02 0.958 0.92 0.939 0.993 Iris-virginica
Weighted Avg. 0.96 0.02 0.96 0.96 0.96 0.995

=== Confusion
Matrix === a b c
<-- classified as
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
0 4 46 | c = Iris-virginica

K-Nearest Neighbor (IBK):

=== Run information ===


Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R
first-last\"" Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
Test mode:evaluate on training data

=== Classifier model (full training set) ===


IB1 instance-based classifier
using 1 nearest neighbour(s) for classification

Time taken to build model: 0 seconds


=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances 150 100 %


Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.0085
Root mean squared error 0.0091
Relative absolute error 1.9219 %
Root relative squared error 1.9335 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 Iris-setosa
1 0 1 1 1 1 Iris-versicolor
1 0 1 1 1 1 Iris-virginica
Weighted Avg. 1 0 1 1 1 1

=== Confusion

Matrix === a b c

<-- classified as
50 0 0 | a = Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica
Practical 4:- Demonstrate performing clustering on data sets Clustering Tab

Selecting a Clusterer
By now you will be familiar with the process of selecting and configuring objects.
Clicking on the clustering scheme listed in the Clusterer box at the top of the window
brings up a GenericObjectEditor dialog with which to choose a new clustering scheme.
Cluster Modes

The Cluster mode box is used to choose what to cluster and how to evaluate
the results. The first three options are the same as for classification: Use training set,
Supplied test set and Percentage split (Section 5.3.1)—except that now the data is
assigned to clusters instead of trying to predict a specific class. The fourth mode, Classes
to clusters evaluation, compares how well the chosen clusters match up with a pre-
assigned class in the data. The drop-down box below this option selects the class, just as
in the Classify panel.
An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so large that memory becomes a
problem it may be helpful to disable this option.
Ignoring Attributes

Often, some attributes in the data should be ignored when clustering. The Ignore
attributes button brings up a small window that allows you to select which attributes are
ignored. Clicking on an attribute in the window highlights it, holding down the SHIFT
key selects a range of consecutive attributes, and holding down CTRL toggles individual
attributes on and off. To cancel the selection, back out with the Cancel button. To activate
it, click the Select button. The next time clustering is invoked, the selected attributes are
ignored.

Working with Filters


The Filtered Clusterer meta-clusterer offers the user the possibility to apply filters
directly before the clusterer is learned. This approach eliminates the manual application of
a filter in the Preprocess panel, since the data gets processed on the fly. Useful if one
needs to try out different filter setups.

Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, a result text
area and a result list. These all behave just like their classification counterparts. Right-
clicking an entry in the result list brings up a similar menu, except that it shows only two
visualization options: Visualize cluster assignments and Visualize tree. The latter is
grayed out when it is not applicable.

A.Load each dataset into Weka and run simple k-means clustering algorithm with
different values of k(number of desired clusters). Study the clusters formed. Observe
the sum of squared errors and centroids, and derive insights.

Ans:

 Steps for run K-mean Clustering algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose iris data set and open file.
8. Click on cluster tab and Choose k-mean and select use training set test option.
9. Click on start button.

Output:

=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I


500
-S 10
Relation: iris
C.Explore visualization features of weka to visualize the clusters. Derive interesting insights
and explain.

Ans: Visualize Features

WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden”
data points.

Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available
From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring
Up The Menu. You Will Be Presented With Options For Viewing Or Saving The Text Output And
--- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc.
To open Visualization screen, click ‘Visualize’ tab.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s
choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that
corresponds to ‘play o

Changing the View:

In the visualization window, beneath the X-axis selector there is a drop-down list,

‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on
the attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better
visibility you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’
box and select lighter color from the color palette.n the left and ‘outlook’ at the top.

Selecting Instances

Sometimes it is helpful to select a subset of the data using visualization tool. A special
case is the ‘UserClassifier’, which lets you to build your own classifier by interactively
selecting instances. Below the Y – axis there is a drop-down list that allows you to choose a
selection method. A group of points on the graph can be selected in four ways [2]:

1. Select Instance. Click on an individual data point. It brings up a window listing

attributes of the point. If more than one point will appear at the same location, more than
one set of attributes will be shown.
Rectangle. You can create a

Polygon. You can select several points by building a free-form polygon. Left-click
on the graph to add vertices to the polygon and right-click to complete it.
Polyline. To distinguish the points on one side from the once on another, you can build
a polyline. Left-click on the graph to add vertices to the polyline and right-click to
finish.
Practical 5:- Demonstrate performing regression on data sets.

Regression:

Regression is a data mining function that predicts a number. Age, weight, distance,
temperature, income, or sales could all be predicted using regression techniques. For
example, a regression model could be used to predict children's height, given their age,
weight, and other factors.
A regression task begins with a data set in which the target values are known. For
example, a regression model that predicts children's height could be developed based on
observed data for many children over a period of time. The data might track age, height,
weight, developmental milestones, family history, and so on. Height would be the target,
the other attributes would be the predictors, and the data for each child would constitute
a case.
Common Applications of Regression

Regression modeling has many applications in trend analysis, business planning,


marketing, financial forecasting, time series prediction, biomedical and drug response
modeling, and environmental modeling.

How Does Regression Work?


You do not need to understand the mathematics used in regression analysis to develop
quality regression models for data mining. However, it is helpful to understand a few
basic concepts.
The goal of regression analysis is to determine the values of parameters for a function
that cause the function to best fit a set of data observations that you provide. The
following equation expresses these relationships in symbols. It shows that regression is
the process of estimating the value of a continuous target (y) as a function (F) of one or
more predictors (x1 , x2 , ..., xn), a set of parameters (θ1 , θ2 , ..., θn), and a measure of
error (e).
y = F(x,θ) + e
The process of training a regression model involves finding the best parameter values for
the function that minimize a measure of the error, for example, the sum of squared
errors.
There are different families of regression functions and different ways of measuring the error.
Linear Regression

The simplest form of regression to visualize is linear regression with a single predictor.
A linear regression technique can be used if the relationship between x and y can be
approximated with a straight line, as shown in Figure 4-1.
Figure 4-1 Linear Relationship Between x and y

Description of "Figure :Linear Relationship Between x and y"

In a linear regression scenario with a single predictor (y = θ 2x + θ1), the regression


parameters (also called coefficients) are:

The slope of the line (θ2) — the angle between a data point and the
regression line and
The y intercept (θ1) — the point where x crosses the y axis (x = 0)

Nonlinear Regression

Often the relationship between x and y cannot be approximated with a straight line. In
this case, a nonlinear regression technique may be used. Alternatively, the data could be
preprocessed to make the relationship linear.

In Figure 4-2, x and y have a nonlinear relationship. Oracle Data Mining supports
nonlinear regression via the gaussian kernel of SVM. (See "Kernel-Based Learning".)

Figure: Nonlinear Relationship Between x and y


Description of "Figure:Nonlinear Relationship Between x and y"

Multivariate Regression

Multivariate regression refers to regression with multiple predictors (x 1 , x2 , ..., xn). For purposes
of illustration, Figure 4-1and Figure 4-2 show regression with a single predictor. Multivariate
regression is also referred to as multiple regression.

Regression Algorithms

Oracle Data Mining provides the following algorithms for regression:

 Generalized Linear Models

Generalized Linear Models (GLM) is a popular statistical technique for linear modeling.
Oracle Data Mining implements GLM for regression and classification. See Chapter 12,
"Generalized Linear Models"

 Support Vector Machines

Support Vector Machines (SVM) is a powerful, state-of-the-art algorithm for linear and
nonlinear regression. Oracle Data Mining implements SVM for regression and other
mining functions. See Chapter 18, "Support Vector Machines"

Both GLM and SVM, as implemented by Oracle Data Mining, are particularly suited for mining
data that includes many predictors (wide data).

Testing a Regression Model

The Root Mean Squared Error and the Mean Absolute Error are statistics for evaluating the
overall quality of a regression model. Different statistics may also be available depending on the
regression methods used by the algorithm.

Root Mean Squared Error

The Root Mean Squared Error (RMSE) is the square root of the average squared distance of a
data point from the fitted line.Figure 4-3 shows the formula for the RMSE.

Figure 4-3 Root Mean Squared Error

Note:
Description of "Figure 4-3 Root Mean Squared Error"

This SQL expression calculates the RMSE.

SQRT(AVG((predicted_value - actual_value) * (predicted_value - actual_value)))

Mean Absolute Error

The Mean Absolute Error (MAE) is the average of the absolute value of the residuals. The MAE
is very similar to the RMSE but is less sensitive to large errors. Figure 4-4 shows the formula for
the MAE.

Figure:Mean Absolute Error

A. Load each dataset into Weka and build Linear Regression model. Study the cluster
formed. Use training set option. Interpret the regression model and derive patterns and
conclusions from the regression results.

Ans:
 Steps for run Aprior algorithm in WEKA
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose labor data set and open file.
8.Click on Classify tab and Click the Choose button then expand the functions
branch.
9.Select the LinearRegression leaf ans select use training set test option.
10. Click on start button.
Output:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: labor-neg-data

Instances: 57

Attributes: 17 duration

wage-increase-first-year

wage-increase-second-year

wage-increase-third-year

cost-of-living-adjustment

working-hours

pension

standby-pay

shift-differential

education-allowance

statutory-holidays

vacation

longterm-disability-assistance

contribution-to-dental-plan

bereavement-assistance

contribution-to-health-plan

class

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===


Linear Regression Model

duration =

0.4689 * cost-of-living-adjustment=tc,tcf +

0.6523 * pension=none,empl_contr +

1.0321 * bereavement-assistance=yes +

0.3904 * contribution-to-health-plan=full +

0.2765

Time taken to build model: 0 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.1967

Mean absolute error 0.6499

Root mean squared error 0.777

Relative absolute error 111.6598 %

Root relative squared error 108.8152 %

Total Number of Instances 56

Ignored Class Unknown Instances 1


B. Use options cross-validation and percentage split and repeat running the Linear
Regression Model. Observe the results and derive meaningful results.

Ans: Steps for run Aprior algorithm in WEKA


1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose labor data set and open file.
8. Click on Classify tab and Click the Choose button then expand the
functions branch.
9. Select the LinearRegression leaf and select test options cross-validation.
10. Click on start button.
11. Select the LinearRegression leaf and select test options percentage split.
12. Click on start button.
Output: cross-validation

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8


Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-
disability-
assistance
contribution-to-
dental-plan
bereavement-
assistance
contribution-to-
health-plan class

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear

Regression

Model

duration =

0.4689 * cost-of-living-
adjustment=tc,tcf + 0.6523 *
pension=none,empl_contr +
1.0321 * bereavement-
assistance=yes + 0.3904 *
contribution-to-health-plan=full
+ 0.2765

Time taken to build model: 0.02 seconds

=== Cross-validation ===


=== Summary ===

Correlation coefficient 0.1967


Mean absolute error 0.6499
Root mean squared error 0.777
Relative absolute error 111.6598 %
Root relative squared error 108.8152 %
Total Number of Instances 56
Ignored Class Unknown Instances 1

Output: percentage split

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0
-R 1.0E-8 Relation: labor-neg-data
Instances: 57
Attributes: 17 duration
wage-increase-
first-year wage-
increase-second-
year wage-
increase-third-
year cost-of-
living-
adjustment
working-hours
pension standby-pay
shift-
differenti
al
education
-
allowance
statutory-
holidays
vacation
longterm-disability-assistance

contribution-to-
dental-plan
bereavement-
assistance
contribution-to-
health-plan class
Test mode: split 66.0% train, remainder test

=== Classifier model (full training set) ===

Linear
Regression
Model
duration =
0.4689 * cost-of-living-
adjustment=tc,tcf + 0.6523 *
pension=none,empl_contr +
1.0321 * bereavement-
assistance=yes + 0.3904 *
contribution-to-health-plan=full
+ 0.2765
Time taken to build model: 0.02 seconds
=== Evaluation on test split ===
=== Summary ===
Correlation coefficient 0.243
Mean absolute error 0.783
Root mean squared error 0.9496
Relative absolute error 106.8823 %
Root relative squared error 114.13 %
Total Number of Instances 19
C.Explore Simple linear regression techniques that only looks at one variable.

Ans: Steps for run Aprior algorithm in WEKA


1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose labor data set and open file.
8. Click on Classify tab and Click the Choose button then expand the functions
branch.
9. Select the S i m p l e Linear Regression leaf and select test options cross-
validation.
Click on start button.
Practical 6:- Sample Programs using German Credit Data.

Task 1: Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness of
an applicant is of crucial importance. You have to develop a system to help a loan
officer decide whether the credit of a customer is good. Or bad. A bank’s business
rules regarding loans must consider two opposing factors. On th one han, a bank
wants to make as many loans as possible.

Interest on these loans is the banks profit source. On the other hand, a bank can not
afford to make too many bad loans. Too many bad loans could lead to the collapse of
the bank. The bank’s loan policy must involved a compromise. Not too strict and not
too lenient.

To do the assignment, you first and foremost need some knowledge about the world of credit.
You can acquire such knowledge in a number of ways.

1. Knowledge engineering: Find a loan officer who is willing to talk. Interview her and
try to represent her knowledge in a number of ways.

2. Books: Find some training manuals for loan officers or perhaps a suitable textbook
on finance. Translate this knowledge from text from to production rule form.

3. Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.

4. Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.

The German Credit Data

Actual historical credit data is not always easy to come by because of


confidentiality rules. Here is one such data set. Consisting of 1000 actual cases collected
in Germany.
In spite of the fact that the data is German, you should probably make use of it for
this assignment(Unless you really can consult a real loan officer!)
There are 20 attributes used in judging a loan applicant( ie., 7 Numerical
attributes and 13 Categoricl or Nominal attributes). The goal is the classify the
applicant into one of two categories. Good or Bad.
The total number of attributes present in German credit data are.

1. Checking_Status
2. Duration
3. Credit_history
4. Purpose
5. Credit_amout
6. Savings_status
7. Employment
8. Installment_Commitment
9. Personal_status
10. Other_parties
11. Residence_since
12. Property_Magnitude
13. Age
14. Other_payment_plans
15. Housing
16. Existing_credits
17. Job
18. Num_dependents
19. Own_telephone
20. Foreign_worker
21. Class
Tasks (Turn in your answers to the following tasks)

Task 1. List all the categorical (or nominal) attributes and the real valued
attributes separately.

Ans) Steps for identifying categorical attributes

1. Double click on credit-g.arff file.


2. Select all categorical attributes.
3. Click on invert.
4. Then we get all real valued attributes selected
5. Click on remove
6. Click on visualize all.

Steps for identifying real valued attributes

1. Double click on
credit-g.arff file.
2.Select all real
valued attributes.

3. Click on invert.
4. Then we get all categorial attributes selected
5. Click on remove
6. Click on visualize all.

The following are the Categorical (or Nominal) attributes)

1. Checking_Status
2. Credit_history
3. Purpose
4. Savings_status
5. Employment
6. Personal_status
7. Other_parties
8. Property_Magnitude
9. Other_payment_plans
10. Housing
11. Job
12. Own_telephone
13. Foreign_worker

The following are the Numerical attributes)

1. Duration
2. Credit_amout
3. Installment_Commitment
4. Residence_since
5. Age
6. Existing_credits
7. Num_dependents

2. What attributes do you think might be crucial in making the credit


assessment? Come up with some simple rules in plain English using your
selected attributes.

Ans) The following are the attributes may be crucial in making the credit assessment.
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. Existing_credits
6. Installment_commitment
7. Property_magnitude

3. One type of model that you can create is a Decision tree . train a
Decision tree using the complete data set as the training data. Report the
model obtained after training.

Ans) Steps to model decision tree.

1. Double click on credit-g.arff file.


2. Consider all the 21 attributes for making decision tree.
3. Click on classify tab.
4. Click on choose button.
5. Expand tree folder and select J48
6. Click on use training set in test options.
7. Click on start button.
8. Right click on result list and choose the visualize tree to get decision tree.
4. We created a decision tree by using J48 Technique for the complete dataset as the

training data. The following model obtained after training.

5. Output:
6.
7. === Run information ===
8.
9. Scheme: weka.classifiers.trees.J48 -
C 0.25 -M 2 Relation: german_credit
10. Instances: 1000
11.Attributes: 21
12.
13.Checking_status duration credit_history purpose credit_amount savings_status
employment installment_commitment personal_status other_parties residence_since
property_magnitude age other_payment_plans housing existing_credits job
num_dependents own_telephone foreign_worker class
14.
15.Test mode: evaluate on training data

16.=== Classifier model (full

training set) === J48 pruned tree

17.Number of Leaves: 103 Size of

the tree: 140


18.Time taken to build model: 0.08 seconds
19.
20.=== Evaluation on training set ===
21.
22.=== Summary ===
23.
24. Correctly Classified Instances 855 85.5 %
25. Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 93.3 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.956 0.38 0.854 0.956 0.902 0.857 good
0.62 0.044 0.857 0.62 0.72 0.857 bad
WeightedAvg.0.855 0.279 0.855 0.855 0.847 0.857

=== Confusion Matrix === a b < classified

as 669
31 | a = good

114 186 | b = bad


26. Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly?(This is also called testing on the
training set) why do you think can not get 100% training accuracy?

Ans) Steps followed are:

1. Double click on credit-g.arff file.


2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on use training set in test options.
6. Click on start button.
7. On right side we find confusion matrix
8. Note the correctly classified instances.
Output:

If we used our above model trained on the complete dataset and classified credit as
good/bad for each of the examples in that dataset. We can not get 100% training
accuracy only 85.5% of examples, we can classify correctly.

27. Is testing on the training set as you did above a good idea? Why or why not?
Ans)It is not good idea by using 100% training data set.

28. One approach for solving the problem encountered in the previous
question is using cross-validation? Describe what is cross validation briefly.
Train a decision tree again using cross validation and report your results. Does
accuracy increase/decrease? Why?

Ans) steps followed are:


1. Double click on credit-g.arff file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on cross validations in test options.
6. Select folds as 10
7. Click on start
8. Change the folds to 5
9. Again click on start
10. Change the folds with 2
11. Click on start.
12. Right click on blue bar under result list and go to visualize tree

Output:

Cross-Validation Definition: The classifier is evaluated by cross validation using the


number of folds that are entered in the folds text field.
In Classify Tab, Select cross-validation option and folds size is 2 then Press Start
Button, next time change as folds size is 5 then press start, and next time change as
folds size is 10 then press start.

i) Fold Size-10
Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.7 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.84 0.61 0.763 0.84 0.799 0.639 good
0.39 0.16 0.511 0.39 0.442 0.639 bad
Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639

=== Confusion Matrix ===

a b <-- classified as
588 112 | a = good
183 117 | b = bad

ii) Fold Size-5


Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 733 73.3 %


Incorrectly Classified Instances 267 26.7 %
Kappa statistic 0.3264
Mean absolute error 0.3293
Root mean squared error 0.4579
Relative absolute error 78.3705 %
Root relative squared error 99.914 %

Coverage of cases (0.95 level) 94.7 %


Mean rel. region size (0.95 level) 93 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.851 0.543 0.785 0.851 0.817 0.685 good
0.457 0.149 0.568 0.457 0.506 0.685 bad
Weighted Avg. 0.733 0.425 0.72 0.733 0.724 0.685
=== Confusion Matrix ===
a b <-- classified as
596 104 | a = good
163 137 | b = bad

iii) Fold Size-2


Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 721 72.1 %


Incorrectly Classified Instances 279 27.9 %
Kappa statistic 0.2443
Mean absolute error 0.3407
Root mean squared error 0.4669
Relative absolute error 81.0491 %
Root relative squared error 101.8806 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.3 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0.891 0.677 0.755 0.891 0.817 0.662 good
0.323 0.109 0.561 0.323 0.41 0.662 bad
Weighted Avg. 0.721 0.506 0.696 0.721 0.695 0.662

=== Confusion

Matrix === a

b <--

classified as

624 76 | a = good
203 97 | b = bad
Note: With this observation, we have seen accuracy is increased when we have
folds size is 5 and accuracy is decreased when we have 10 folds.
29. Check to see if the data shows a bias against “foreign workers” or “personal-
status”.

One way to do this is to remove these attributes from the data set and see if the
decision tree created in those cases is significantly different from the full dataset
case which you have already done. Did removing these attributes have any
significantly effect? Discuss.

Ans) steps followed are:


1. Double click on credit-g.arff file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on cross validations in test options.
6. Select folds as 10
7. Click on start
8. Click on visualization
9. Now click on preprocessor tab
10. Select 9th and 20th attribute
11. Click on remove button
12. Goto classify tab
13. Choose J48 tree
14. Select cross validation with 10 folds
15. Click on start button
16. Right click on blue bar under the result list and go to visualize tree.

Output:

We use the Preprocess Tab in Weka GUI Explorer to remove an attribute


“Foreign• workers” & “Perosnal_status” one by one. In Classify Tab, Select Use
Training set option then

Press Start Button, If these attributes removed from the dataset, we can see change
in the accuracy compare to full data set when we removed.

i) If Foreign_worker is removed
Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 859 85.9 %
Incorrectly Classified Instances 141 14.1 %
Kappa statistic 0.6377
Mean absolute error 0.2233
Root mean squared error 0.3341
Relative absolute error 53.1347 %
Root relative squared error 72.9074 %
Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95
level)91.9 % Total Number of Instances
1000

=== Detailed Accuracy By Class ===

TP
Rate
FP
Rate
0.954 0.363
0.637 0.046
Weighted Avg 0.859 0.268

=== Confusion

Matrix === a

b classified

as
668 32 | a = good
109 191 | b = bad

i) If Personal_status is
removed
Mean rel. region
size (0.95 level)
91.7
0.867

Note: With this observation we have seen, when “Foreign_worker “attribute is removed

Dataset, the accuracy is decreased. So this attribute is important for classification.

2. Another question might be, do you really need to input so many attributes to get good
be only a few would do. For example, you could try just having attributes 2,3,5,7,10,17
combinations.(You had removed two attributes in problem 7. Remember to reload the
all the attributes initially before you start selecting the ones you want.)

Ans) steps followed are:


1. Double click on credit-g.arff file.
2. Select 2,3,5,7,10,17,21 and tick the check boxes.
3. Click on invert
4. Click on remove
5. Click on classify tab
6. Choose trace and then algorithm as J48
7. Select cross validation folds as 2
8. Click on start.

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
Tab in Weka GUI Explorer to remove 21 st attribute (Class). In Classify Tab, Select Use Training set option then
Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy
data set when we removed.

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 963 96.3 %


Incorrectly Classified Instances 37 3.7 %

=== Confusion Matrix ===

a b <-- classified as
963 0 | a = yes
37 0 | b = no

Note: With this observation we have seen, when 3 rd attribute is removed from the Dataset, the
accuracy (83%) is decreased. So this attribute is important for classification. when 2 nd and 10th
attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove any one
among them. when 7th and 17th attributes are removed from the Dataset, the accuracy(85%) is
same. So we can remove any one among them. If we remove 5 th and 21st attributes the accuracy
is increased, so these attributes may not be needed for the classification.

17. Sometimes, The cost of rejecting an applicant who actually has good credit might
be higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5) and
lower cost to the second case. By using a cost matrix in weak. Train your decision tree and
report the Decision Tree and cross validation results. Are they significantly different from
results obtained in problem 6.

Ans) steps followed are:


1. Double click on credit-g.arff file.
2. Click on classify tab.
3. Click on choose button.
4. Expand tree folder and select J48
5. Click on start
6. Note down the accuracy values
7. Now click on credit arff file
8. Click on attributes 2,3,5,7,10,17,21
9. Click on invert
10. Click on classify tab
11. Choose J48 algorithm
12. Select Cross validation fold as 2
13. Click on start and note down the accuracy values.
14. Again make cross validation folds as 10 and note down the accuracy values.
15. Again make cross validation folds as 20 and note down the accuracy values.

OUTPUT:

In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In Classify
Tab then press Choose button in that select J48 as Decision Tree Technique. In Classify Tab
then press More options button then we get classifier evaluation options window in that select
cost sensitive evaluation the press set option Button then we get Cost Matrix Editor. In that
change classes as 2 then press Resize button. Then we get 2X2 Cost matrix. In Cost Matrix (0,1)
location value change as 5, then we get modified cost matrix is as follows.

0.0 5.0
1.0 0.0
Then close the cost matrix editor, then press ok button. Then press start button.
=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances 855 85.5 %


Incorrectly Classified Instances 145 14.5 %

=== Confusion Matrix ===

a b <-- classified as
669 31 | a = good
114 186 | b = bad
Note: With this observation we have seen that ,total 700 customers in that 669 classified as good
customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified as bad
customers and 114 misclassified as good customers.

18. Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias of the
model?

Ans)
steps followed are:-
1)click on credit arff file
2)Select all attributes
3)click on classify tab
4)click on choose and select J48 algorithm
5)select cross validation folds with 2
6)click on start
It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.

19. You can make your Decision Trees simpler by pruning the nodes. One approach is
to use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning for
training your Decision Trees using cross validation and report the Decision Trees you
obtain? Also Report your accuracy using the pruned model Does your Accuracy
increase?

Ans)

steps followed are:- 1)click on credit arff file 2)Select all


attributes 3)click on classify tab
4)click on choose and select REP
algorithm 5)select cross validation
6) click on start
7) Note down the results

We can make our decision tree simpler by pruning the nodes. For that In Weka GUI
Explorer, Select Classify Tab, In that Select Use Training set option . In Classify Tab then
press Choose button in that select J48 as Decision Tree Technique. Beside Choose Button
Press on J48 –c 0.25
–M2 text we get Generic Object Editor. In that select Reduced Error pruning Property as

True then press ok. Then press start button.

12) How can you convert a Decision Tree into “if-then-else rules”. Make up your own small

Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in weka
is rules. PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you predict what
attribute that might be in this data set? OneR classifier uses a single attribute to make
decisions(it chooses the attribute based on minimum error).Report the rule obtained by
training a one R classifier. Rank the performance of j48,PART,oneR.

Ans)
Steps For Analyze Decision Tree:
1)click on credit arff file
2)Select all attributes
3) click on classify tab
4) click on choose and select J48 algorithm
5)select cross validation folds with

6)click on start

7) note down the accuracy value


8) again goto choose tab and select
PART 9)select cross validation folds
with 2 10)click on start
11) note down accuracy value
12) again goto choose tab and select
One R 13)select cross validation folds
with 2 14)click on start
15)note down the accuracy value.

Converting Decision tree into a set of rules is as follows.

Rule1: If age = youth AND student=yes THEN


buys_computer=yes Rule2: If age = youth AND student=no
THEN buys_computer=no Rule3: If age = middle_aged THEN
buys_computer=yes

Rule4: If age = senior AND credit_rating=excellent THEN buys_computer=yes Rule5:


If age = senior AND credit_rating=fair THEN buys_computer=no

In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option .There
also exist different classifiers that output the model in the form of Rules. Such classifiers in
weka are

“PART” and ”OneR” . Then go to Choose and select Rules in that select PART and press
start Button.

== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 897 89.7 %


Incorrectly Classified Instances 103 10.3 %

== Confusion Matrix ===

a b <--
classified as 653
47 | a = good
56 244 | b = bad

Then go to Choose and select Rules in that select OneR and press start Button.
== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 742 74.2 %
Incorrectly Classified Instances 258 25.8 %
=== Confusion Matrix ===
a b <--
classified as 642
58 | a = good
200 100 | b = bad
Then go to Choose and select Trees in that select J48 and press start Button.
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
=== Confusion Matrix ===
a b <--
classified as 669
31 | a = good
114 186 | b = bad
Note: With this observation we have seen the performance of classifier and Rank is as follows
1. PART
2. J48 3. OneR
Task 2: Hospital Management System
Data warehouse consists dimension table and fact

table. REMEMBER the following

Dimension

The dimension object (dimension);

_name

_attributes (levels), with primary key

_hierarchies

One time dimension is

must. About levels and

hierarchies

Dimensions objects(dimension) consists of set of levels and set of hierarchies defined over
those levels.the levels represent levels of aggregation.hierarchies describe-child
relationships among a set of levels.

For example .a typical calander dimension could contain five levels.two hierarchies can be
defined on these levels.

YearL>QuarterL>MonthL>DayL

H2: YearL>WeekL>DayL

The hierarchies are describes from parent to child,so that year is the parent of Quarter,quarter
are parent of month,and so forth.

About Unique key constraints

When you create a definition for a hierarchy,warehouse builder creates an identifier key for
each level of the hierarchy and unique key constraint on the lowest level (base level)

Design a hospital management system data warehouse(TARGET) consists of


dimensions patient,medicine,supplier,time.where measure are ‘ NO UNITS’ ,UNIT
PRICE.

Assume the relational database(SOURCE)table schemas as follows TIME(day,month,year)


PATIENT(patient_name,age,address,etc)
MEDICINE (Medicine_brand_name,Drug_name,supplier,no_units,units_price,etc..,)

SUPPLIER:( Supplier_name,medicine_brand_name,address,etc..,)

If each dimension has 6 levels,decide the levels and hierarchies,assumes the level names
suitably.

Design the hospital management system data warehousing using all schemas.give the example

4-D cube with assumption names.


PRACTICAL FILE
DATA WAREHOUSING AND MINING LAB
(24CSE573)

M.TECH
(IYEAR – I SEM)
(2024-25)

DEPARTMENT OF
COMPUTER SCIENCE AND
ENGINEERING

Submitted To Submitted By

Mr. Rakesh Arya Vaishali Negi

(Assistant Professor) (24MTCSE0008)


DATA WAREHOUSING AND MINING LAB- INDEX

S.No Experiment Name

1 Build Data Warehouse and explore WEKA.

Perform data preprocessing tasks and Demonstrate performing


2 association rule mining on data sets.

3 Demonstrate performing classification on data sets.

4 Demonstrate performing clustering on data sets.

5 Demonstrate performing Regression on data sets.

Task 1: Credit Risk Assessment. Sample Programs using


6 German Credit Data.
7 Task 2: Sample Programs using Hospital Management System.

You might also like