0% found this document useful (0 votes)
95 views58 pages

DM Lab Task-1 Expr's-1

The document introduces the Weka data mining software. It discusses launching Weka and describes the four main applications: Explorer, Experimenter, Knowledge Flow, and Simple CLI. It then provides detailed explanations of the Explorer interface, including preprocessing data, classification, clustering, association rule learning, attribute selection, and visualization. It also summarizes the functionality of the Experimenter for performing experiments and statistical tests.

Uploaded by

Prakash Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views58 pages

DM Lab Task-1 Expr's-1

The document introduces the Weka data mining software. It discusses launching Weka and describes the four main applications: Explorer, Experimenter, Knowledge Flow, and Simple CLI. It then provides detailed explanations of the Explorer interface, including preprocessing data, classification, clustering, association rule learning, attribute selection, and visualization. It also summarizes the functionality of the Experimenter for performing experiments and statistical tests.

Uploaded by

Prakash Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

INTRODUCTION TO WEKA

LAUNCHING WEKA:

WEKA  Waikato Environment for Knowledge Analysis


The WEKA GUI Chooser provides a starting point for launching WEKA’S main GUI
applications and supporting tools. The GUI Chooser consists of four buttons—one for each of the four
major Weka applications—and four menus.
The buttons can be used to start the following applications:
1. Explorer: An environment for exploring data with WEKA.
2. Experimenter: An environment for performing experiments and conducting statistical
tests between learning schemes.
3. Knowledge Flow: It supports essentially the same functions as the explorer but with a
drag and drop interface. One advantage is that it supports incremental learning.
4. Simple CLI: Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line interface.

The menu consists of four sections like Program, Tools, Visualization, and Help.
EXPLORER:
It is a user interface which contains a group of tabs just below the title bar. The tabs are as
Follows:
1. Preprocess
2. Classify
3. Cluster
4. Associate
5. Select Attributes
6. Visualize
The bottom of the window contains status box, log and WEKA bird.

1. PREPROCESSING:
LOADING DATA
The first four buttons at the top of the preprocess section enable you to load Data into WEKA:
1. Open file: It shows a dialog box allowing you to browse for the data file on the local file system.
2. Open URL: Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB: Reads data from a database.
4. Generate: It is used to generate artificial data from a variety of Data Generators.

Using the Open file button we can read files in a variety of formats like WEKA’s ARFF format,
CSV format. Typically ARFF files have .arff extension and CSV files .csv extension.
THE CURRENT RELATION
The Current relation box contains the currently loaded data i.e. interpreted as a single
Relational table in database terminology, which has three entries:
1. Relation: It provides the name of the relation in the file from which it was loaded.
Filters are used modify the name of a relation.
2. Instances: The number of instances (data points/records) in the data.
3. Attributes: The number of attributes (features) in the data.

ATTRIBUTES
It is located below the current relation box which contains four buttons, they are:
1) All is used to tick all boxes 2) None is used to clear all boxes 3) Invert is used make ticked boxes
unticked. 4) Pattern is used to select attributes by representing an expression. E.g. a.* is used to select
all the attributes that begins with a.
SELECTED ATTRIBUTE:
It is located beside the current relation box which contains the following:
1. Name: It specifies the name of the attribute i.e. same as in the attribute list.
2. Type: It specifies the type of attribute, most commonly Nominal or Numeric.
3. Missing: It provides a numeric value of instances in the data for which an attribute is missing.
4. Distinct: It provides the number of different values that the data contains for an attribute.
5. Unique: it provides the number of instances in the data having a value for an attribute that no other
instances have.
FILTERS
By clicking the Choose button at the left of the Filter box, it is possible to select one of the filters
in WEKA. Once a filter has been selected, its name and options are shown in the field next to the Choose
button, by clicking on this box with the left mouse button it shows a GenericObjectEditor dialog box
which is used to configure the filter.

2. CLASSIFICATION
Classification has a text box which gives the name of currently selected classifier, and its
options. By clicking it with the left mouse button it shows a GenericObjectEditor dialog box, which is
same as for filters i.e. used to configure the current classifier options.
TEST OPTIONS
The result of applying the chosen classifier will be tested according to the options that are set by
clicking in the Test options box. There are four test modes:
1. Use training set.
2. Supplied test set.
3. Cross-validation.
4. Percentage split.

Once the classifier, test options and class have all been set, the learning process is started by
clicking on the Start button. We can stop the training process at any time by clicking on the Stop button.
The Classifier output area to the right of the display is filled with text describing the results of
training and testing.
After training several classifiers, the Result List will contain several entries using which we can
move over various results that have been generated. By pressing Delete we can remove a selected entry
from the results.

3. CLUSTERING

By clicking the text box beside the choose button in the Clusterer box, it shows a dialog box used
to choose a new clustering scheme.
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first
three options in it are same as in classification like Use training set, Supplied test set and Percentage
split. The fourth option is classes to clusters evaluation

An additional option in the Cluster mode box is the Store clusters for
visualization which finds whether or not it will be possible to visualize the clusters once training is
complete.
Ignore Attributes: when clustering, some attributes in the data should be ignored. It
Shows a small window that allows you to select which attributes are ignored.
4. ASSOCIATING
It contains schemes for learning association rules, and the learners are chosen and configured in the
same way as the clusterer, filters, and classifiers in the other panels.
5. SELECTING ATTRIBUTES

Attribute selection involves searching through all possible combinations of attributes in the
data to find which subset of attributes works best for prediction. To do this, two objects must be set up:
an attribute evaluator and a search method. The evaluator determines what method is used to assign a
worth to each subset of attributes. The search method determines what style of search is performed.
The Attribute Selection Mode box has two options:
1. Use full training set: The worth of the attribute subset is determined using the full set of
training data.
2. Cross-validation: The worth of the attribute subset is determined by a process of cross-
validation. The Fold and Seed fields set the number of folds to use and the random seed used when
shuffling the data.
6. VISUALIZING

WEKA’s visualization section allows you to visualize 2D plots of the current relation.
EXPERIMENTER: The Weak Experiment Environment enables the user to create, run, modify,
and analyses experiments in a more convenient manner. It can also be run from the command line using
the Simple CLI.
New Experiment:
After clicking on new default parameters for an Experiment are defined.
We can choose the experiment in two different modes 1) Simple and 2) Advanced

Result Destination:
By default, an ARFF file is the destination for the results output. But we can also choose CSV
file as the destination for output file. The advantage of ARFF or CSV files is that they can be created
without any additional classes. The drawback is the lack of ability to resume the interrupted experiment.
Experiment type:
The user can choose between the following three different types:
1. Cross-validation: it is a default type and it performs stratified cross-validation with the given
number of folds.
2. Train/Test Percentage Split: it splits a dataset according to the given percentage into a train and
a test file after the order of the data has been randomized and stratified.
3. Train/Test Percentage Split: As it is impossible to specify an explicit train/test files pair,
one can abuse this type to un-merge previously merged train and test file into the two original files.
Additionally, one can choose between Classification and Regression, depending on the datasets and
classifiers one uses.
Data Sets:
One can add dataset files either with an absolute path or with a relative path. Iteration control:
1. Number of repetitions: In order to get statistically meaningful results, the default number of iterations
is 10.
2. Data sets first/Algorithms first: As soon as one has more than one dataset and algorithm, it can be
useful to switch from datasets being iterated over first to algorithms.
Algorithms: New algorithms can be added via the “Add New” button. Opening this dialog for the first
time, ZeroR is presented.
By clicking on the Choose button one can choose another classifier which is as shown in the
below diagram:
The “Filter...” button enables us to highlight classifiers that can handle certain attributes
and class types. With “Remove Filter” button one can clear the classifiers that are highlighted earlier.

With the Load options... and save options... buttons one can load and save the setup of a selected
classifier from and to XML.
Running an Experiment:
To run the current experiment, click the Run tab at the top of the Experiment Environment
window. The current experiment performs 10 runs of 10-fold stratified cross-validation.
.

If the experiment was defined correctly, the 3 messages shown above will be displayed in the
Log panel.
Advanced
Defining an experiment: When the Experimenter is started in advanced mode, the Setup tab is
displayed. Now click New to initialize an experiment.

To define the dataset to be processed by a scheme, first select Use relative paths in the
Datasets panel of the Setup tab and then click on Add new... button.
Saving Results of the experiment

To identify a dataset to which the results are to be sent, click on the Instances- Result Listener
entry in the Destination panel, which opens a dialog box with a label named as “output file”.
Now give the name of the output file and click on OK button. The dataset name is now
displayed in the Datasets panel of the Setup tab. This is as shown in the following figure:

Now we can run the experiment by clicking the Run tab at the top of the experiment
environment window. The current experiment performs 10 randomized train and test runs.
To change from random train and test experiments to cross-validation experiments, click on the
Result generator entry.
Using analysis tab in experiment environment window one can analyze the results of experiments using
experiment analyzer.

KNOWLEDGE FLOW:
The Knowledge Flow provides an alternative to the Explorer as a graphical front end to WEKA’s
core
Algorithms. It is represented as shown in the following figure. The Knowledge Flow presents a
data- flow inspired interface to WEKA.
The Knowledge Flow offers the following features:
1. Intuitive data flow style layout.
2. Process data in batches or incrementally.
3. Process multiple batches or streams in parallel (each separate flow executes in its own thread).
4. Chain filters together.
5. View models produced by classifiers for each fold in a cross validation.
6. Visualize performance of incremental classifiers during processing
7. Plug-in facility for allowing easy addition of new components to the Knowledge Flow.

Components:

1. Data Sources: All WEKA loaders are available.


2. Data Sinks: All WEKA savers are available.
3. Filters: All WEKA’s filters are available.
4. Classifiers: All WEKA classifiers are available.
5. Clusterers: All WEKA clusterers are available.
6. Evaluation: It contains different kinds of techniques like TrainingSetMaker, TestSetMaker,
CrossValidationFoldMaker, TrainTestSplitMaker, ClassAssigner, ClassValuePicker,
ClassifierPerformanceEvaluator, IncrementalClassifierEvaluator, ClustererPerformanceEvaluator, and
PredictionAppender.
7. Visualization: It contains different models like DataVisualizer, ScatterPlotMatrix,
AttributeSummarizer, ModelPerformanceChart, TextViewer, GraphViewerbased, and StripChart.
Plug-in Facility:
The Knowledge Flow offers the ability to easily add new components via a plug-in mechanism.
SIMPLE CLI:
The Simple CLI provides full access to all Weka classes like classifiers, filters, clusterers, etc., but
without the hassle of the CLASSPATH.
It offers a simple Weka shell with separated command line and output.
The simple command line interface is represented as shown in the following figure:

The following commands are available in the Simple CLI:


1. java <classname> [<args>]: - invokes a java class with the given arguments (if any)
2. break: - it stops the current thread in a friendly manner. e.g., a running classifier
3. kill: - stops the current thread in an unfriendly fashion
4. cls:- clears the output area
5. exit:- exits the Simple CLI
6. help [<command>]:- provides an information about the command available in the simple CLI. Also
provides an overview of all commands available if the command is not specified as an argument.
In order to invoke a Weak class, only the way is one has to prefix the class with “java”.

This command tells the Simple CLI to load a class and execute it with any given parameters.
For example: java weka.classifiers.trees.J48 -t c:/temp/iris.arff which results in the
following output.

Using simple CLI we can also perform command redirection using the operator “>”. For
Example: java weka.classifiers.trees.J48 -t test.arff > j48.txt
TASK-1
CREDIT RISK ASSESSMENT

 The business of banks is making loans. Assessing the credit worthiness of an applicant’s
of crucial importance. We have to develop a system to help a loan officer decide whether
the credit of a customer is good or bad. A bank’s business rules regarding loans must
consider two opposing factors. On the one hand, a bank wants to make as many loans as
possible. Interest on these loans is the banks profit source. On the other hand, a bank cannot
afford to make too many bad loans. To many bad could leads to the collapse of the bank.
The bank’s loan policy must involve a compromise not too strict, and not too lenient.
 Credit risk is an investor's risk of loss arising from a borrower who does not make
payments as promised. Such an event is called a default. Other terms for credit risk are
default risk and counterparty risk.
 Credit risk is most simply defined as the potential that a bank borrower or counterparty
will fail to meet its obligations in accordance with agreed terms.
 The goal of credit risk management is to maximize a bank's risk-adjusted rate of return by
maintaining credit risk exposure within acceptable parameters.
 Banks need to manage the credit risk inherent in the entire portfolio as well as the risk in
individual credits or transactions.
 Banks should also consider the relationships between credit risk and other risks.
 The effective management of credit risk is a critical component of a comprehensive
approach to risk management and essential to the long-term success of any banking
organization.
 A good credit assessment means you should be able to qualify, within the limits of your
income, for most loans.
EXPERIMENT 1

THE GERMAN CREDIT DATA

1.1 List all the categorical (or nominal) attributes and the real-valued attributes
separately.
Description:

• Categorical (or nominal) Attributes: A categorical variable is one that has two
or more categories, but there is no intrinsic ordering to the categories.
– Gender, color,
• Real-valued Attributes: A Real-valued attribute is something, which will have
no categories but a set of continuous or values.
– Salary, age, height, weight.

Procedure: How to Find Categorical and Real Valued / Numeric Attribute

• Open Weka Explorer -> Open and Load the sample data -> Click on Edit
• Check all the attribute name and list all of them.

Classifies people described by a set of attributes as good or bad credit


risks
 1000 examples
 20 attributes (both nominal and numeric)
 2 classes (good / bad)

Categorical attributes (or Qualitative or Nominal attributes):


1.Status of existing checking account
2.Credit history
3. purpose
4.Savings accounts or bonds
5.Present employment since
6.Personal status and sex
7.Other debtors / guarantors
8.Property
9.Other installment plans
10.Housing
11.Job
12.Telephone
13.Foreign Worker
Quantitative (or Numerical or Real valued) attributes

1. Duration in month
2.Credit amount
3. Installment rate in percentage of disposable income
4.Present residence since
5. Age in Years
6.Number of existing credits at this bank
7.Number of people being liable to provide maintenance

Attribute Names Type of Attribute

Status of existing checking account qualitative

Duration in month numerical

Credit history qualitative

Purpose qualitative

Credit amount numerical

Savings account/bonds qualitative

Present employment since qualitative

Installment rate in percentage of disposable income numerical


Personal status and sex qualitative

Other debtors / guarantors qualitative

Present residence since

Property qualitative

Age in years numerical

Other installment plans qualitative

Housing qualitative

Number of existing credits at this bank numerical

Job qualitative

Number of people being liable to provide numerical


maintenance for

Telephone qualitative

Foreign worker qualitative


EXPERIMENT 2
1.2 What attributes do you think might be crucial in making the credit assessment?
Come up with some simple rules in plain English using your selected attributes.

Categorical Attributes:
Attribute 1:Status of checking account: A11: ... < 0 DM A12: 0 <= ... < 200 DM A13: ...
>= 200 DM A14: no checking account.

Attribute 3: Credit history A30: no credits takenA31: all credits paid back duly A32:
Existing credits paid duly till now A33: delay in paying off A34: critical account
Attribute 4:Purpose A40: car (new) A41: car (used) A42: furniture/equipment A43 :
radio/television A44: domestic appliances A45: repairs A46: education A47 : (vacation -does
not exist?) A48 : retraining A49 : business A410 : others
Attribute 6:Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 :
500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account
Attribute 7:Present employment since A71 : unemployed A72 : ... < 1 year A73 : 1 <= ...
< 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years
Attribute 9:Personal status and sex A91 : male : divorced/separated A92 : female :
divorced/separated/married A93 : male : single A94 : male : married/widowed A95 :
female : single

Attribute 10:Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor
Attribute 12:Property A121: real estate A122: if not A121: building society savings agreement/
life insurance A123: if not A121/A122: car or other, not in attribute 6 A124: unknown / no property
Attribute 14:Other installment plans A141: bank A142: stores A143: none
Attribute 15:Housing A151: rent A152: own A153: for free
Attribute 17:Job A171 : unemployed/ unskilled -non-resident A172 : nskilled -resident A173 :
skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer
Attribute 19:Telephone A191 : none A192 : yes, registered under the customers name
Attribute 20:foreign worker A201 : yes A202 : no
Real Valued Attributes:
Attribute 2: Duration in month
Attribute 5: Credit amount
Attribute 8: Installment rate in percentage of disposable income
Attribute 11: Present residence since
Attribute 13: Age in years
Attribute 16: Number of existing credits at this bank
Attribute 18: Number of people being liable to provide maintenance for .
Procedure:
• To find Crucial Attributes

• Open Weka Explorer -> Open and Load the sample data -> Select Attribute
• Attribute Evaluator Category -> Choose InfoGainAttributeEvaluator -> Click on start ->
Check Attribute Selection Attribute -> Check Ranked Attribute

Formulating Rules:
After finding the crucial attributes we can easily define some rules based crucial
attributes by using training data (i.e. By seeing training data By combining crucial
attributes)
EX: IF purpose of loan= Education AND duration =15 months: Good
IF purpose of loan =Car AND duration=50 months: Bad
IF checking status=no checking and existing credit history= paid and purpose =education: good

The Significant Attributes are as follows:


Ranked attributes:
0.094739 1 checking status
0.043618 3 credit history
0.0329 2 duration
0.028115 6 savings status
0.024894 4 purpose
0.018709 5 credit amount
0.016985 12 property magnitude
0.013102 7 employment
0.012753 15 housing
0.011278 13 age
0.008875 14 other payment plans
0.006811 9 personal status
0.005823 20 foreign worker
0.004797 10 other parties
0.001337 17 job
0.000964 19 own telephone
0 18 num dependents
0 8 installment commitment
0 11 residence since
0 16 existing credits
Selected attributes:1,3,2,6,4,5,12,7,15,13,14,9,20,10,17,19,18,8,11,16: 20
1. Checking Status.

2.Credit history. To Know the amount that is credited in a particular account

3. Duration in months. If a Loan has been sanctioned then it should be returned after some
time.
Ex. Education Loan: - After completion of education and joining a Job then he/she has
to start the repayment in scheduled tenure.

4. Savings Account /bonds


5. Purpose. This attribute will specify the reason for the Loan. The Loan can be for a Car,
Furniture, Education, or Business etc.
6. Job

7. Property

8. Age

9. Present Employment since

Example Rules:
1. If Credit History =A31 and savings account =A64 and Job= 174 and Property =121
Then Customer is “Good”.
2. If Credit History =A33 and savings account =A61 and Job= 171 and Property =121
Then Customer is “bad”.
EXPERIMENT 3
1.3. One type of model that you can create is a Decision Tree - train a Decision Tree
using the complete dataset as the training data. Report the model obtained after
training.

Description:
• A Decision Tree: A decision tree is a tree like structure in which each non leaf node
represent test on attribute and leaf node represent class label value. It classifies the given
tuple comparing with nodes starting from root node to towards leaf node.

Procedure:
• Open Weka Explorer -> Open and Load the sample data -> Select Classify

• Choose Classifier -> Select Algorithm J48 -> Select Use Training Set -> Click on Start

• Right Click on Result List -> Select Visualize Tree


=== Summary ===

Correctly Classified Instances 855 85.5 %


Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.956 0.380 0.854 0.956 0.902 0.640 0.857 0.905 good
0.620 0.044 0.857 0.620 0.720 0.640 0.857 0.783 bad
Weighted Avg. 0.855 0.279 0.855 0.855 0.847 0.640 0.857 0.869

=== Confusion Matrix ===

a b <-- classified as
669 31 | a = good
114 186 | b = bad
EXPERIMENT 4
1.4.Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly? (This is also called testing on the training set) Why do you think
you cannot get 100 % training accuracy?

Predictive Accuracy Evaluation:


The main methods of predictive accuracy evaluations are:
 Resubstitution (N ; N)

 Holdout (2N/3 ; N/3)

 x-fold cross-validation (N-N/x ; N/x)

 Leave-one-out (N-1 ; 1)

Where N is the number of records (instances) in the dataset

Training and Testing:

REMEMBER: we must know the classification (class attribute values) of all instances (records) used in the test
procedure.
 Success: instance (record) class is predicted correctly

 Error: instance class is predicted incorrectly

 Error rate: a percentage of errors made over the whole set of instances (records) used for testing

 Predictive Accuracy: a percentage of well classified data in the testing data set.

SOLUTION :-
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining
1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5%
accuracy correctly classified and remaining 14.5% of examples are incorrectly.
Hence, we can’t get 100% training accuracy because out of the 20 attributes, we have Some unnecessary attributes
which are also been analyzed and trained.
EXPERIMENT 5

1.5 Is testing on the training set as you did above a good idea? Why not?

SOLUTION:
Testing on Training Set (2N/3 ; N/3):
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training
set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset
as training set which results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not make a
crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to
the minimum accuracy.
If some part of the dataset is used as a training set and the remaining as test set then it leads to the
accurate results and the time for computation will be less.
This is why, we prefer not to take complete dataset as training set.
In some of the cases it is good, it is better to go with cross validation
X-fold cross-validation (N-N/x; N/x):
The cross-validation is used to prevent the overlap of the test sets
First step: split data into x disjoint subsets of equal size
Second step: use each subset in turn for testing, the remainder for training (repeating cross-
validation)
As resulting rules (if applies) we take the sum of all rules.
The error (predictive accuracy) estimates are averaged to yield an overall error (predictive
accuracy) estimate Standard cross-validation: 10-fold cross-validation
Why 10?
Extensive experiments have shown that this is the best choice to get an accurate estimate.
There is also some theoretical evidence for this. Sointeresting!
EXPERIMENT 6
1.6 One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what cross-validation is briefly. Train a Decision Tree again using
cross-validation and report your results. Does your accuracy increase/decrease? Why?

Cross validation:-

In k-fold cross-validation, the initial data are randomly portioned into ‘k’ mutually exclusive
subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing
is performed ‘k’ times. In iteration I, partition Di is reserved as the test set and the remaining
partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . . .
. . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on
Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on….

SOLUTION:
Step-1: Select Explorer from weak GUI chooser then click on open file option in preprocess tab.
Step-2: select credit-g.arff file and click on open option.
Step-3: select the cross validation option from the test options.
Step-4: select the J48 tree from the choose option ad click on start button.
Step-5: Here we can observe the decreased accuracy rate with cross validation.
=== Summary ===

Correctly Classified Instances 705 70.5 %


Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.840 0.610 0.763 0.840 0.799 0.251 0.639 0.746 good
0.390 0.160 0.511 0.390 0.442 0.251 0.639 0.449 bad
Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.251 0.639 0.657

=== Confusion Matrix ===

a b <-- classified as
588 112 | a = good
183 117 | b = bad
EXPERIMENT 7

1.7 Check to see if the data shows a bias against "foreign workers" (attribute 20), or
"personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to
remove these attributes from the dataset and see if the decision tree created in those cases is
significantly different from the full dataset case which you have already done. To remove an
attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these
attributes have any significant effect? Discuss.

SOLUTION:
This increase in accuracy is because thus two attributes are not much important in training and analyzing by removing
this, the time has been reduced to some extent and then it results in increase in the accuracy.
The decision tree which is created is very large compared to the decision tree which we have trained now. This is the
main difference between these two decision trees.
Step-1: Select Explorer from weak GUI chooser then click on open file option in preprocess tab.
Step-2: After fileopen select the class label to classify the data.
Step-3: select the foreign_worker and personal_status attributes from the list of attributes.
Step-4: now, click the remove button to remove these attributes.
Step-5: now, in the classify tab, select the use training set option and select the J48 tree and click on start button. We
can observe the result improvement in accuracy as shown above.
=== Summary ===

Correctly Classified Instances 861 86.1 %


Incorrectly Classified Instances 139 13.9 %
Kappa statistic 0.6458
Mean absolute error 0.2194
Root mean squared error 0.3312
Relative absolute error 52.208 %
Root relative squared error 72.2688 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.950 0.347 0.865 0.950 0.905 0.656 0.869 0.912 good
0.653 0.050 0.848 0.653 0.738 0.656 0.869 0.805 bad
Weighted Avg. 0.861 0.258 0.860 0.861 0.855 0.656 0.869 0.880

=== Confusion Matrix ===


a b <-- classified as
665 35 | a = good
104 196 | b = bad
** The Difference what we observed is accuracy had improved.
Calculations:
=== Confusion Matrix ===
a b <-- classified as
665 35 | a = good
104 196 | b = bad
Number of Leaves : 97
Size of the tree : 139
Time taken to build model:0.05 seconds

EXPERIMENT 8
1.8 Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3,
5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in problem 7 Remember to reload the ARFF data file to get all the
attributes initially before you start selecting the ones you want.)

Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
Here accuracy is decreased.

Select random attributes and then check the accuracy.


After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over
attributes and visualize them.
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try
random combination of attributes to increase the accuracy.
EXPERIMENT 9

1.9 Sometimes, the cost of rejecting an applicant who actually has a good credit(case 1)
might be higher than accepting an applicant who has bad credit (case 2).Instead of counting
the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and
lower cost to the second case. You can do this by using a cost matrix in Weka. Train your
Decision Tree again and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?

In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider
two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. When we give
such costs in both cases and after training the decision tree, we can observe that almost equal to
that of the decision tree obtained in problem 6.
Case1 (cost 5) Case2 (cost 5)

Total Cost 3820 1705


Average cost 3.82 1.705

We don’t find this cost factor in problem 6. As there we use equal cost. This is the major
difference between the results of problem 6 and problem 9.

The cost matrices we used here:


Case 1: 5 1
1 5

Case 2: 2 1
1 2
1.Select classify tab.
2. Select More Option from Test Option.
3.Tick on cost sensitive Evaluation and go to set.
4.Set classes as 2.
5.Click on Resize and then we’ll get cost matrix.
6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
7.Then confusion matrix will be generated and you can find out the difference
between good and bad attribute.
8.Check accuracy whether it’s changing or not.
EXPERIMENT 10

1.10 Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the bias of
the model?

When we consider long complex decision trees, we will have many unnecessary attributes in
the tree which results in increase of the bias of the model. Because of this, the accuracy of the
model can also effected.

This problem can be reduced by considering simple decision tree. The attributes will be less and
it decreases the bias of the model. Due to this the result will be more accurate.

So it is a good idea to prefer simple decision trees instead of long complex trees.
 Open any existing ARFF file e.g. labour.arff.
 In preprocess tab, select ALL to select all the attributes.
 Go to classify tab and then use training set with J48 algorithm.
1. To generate the decision tree, right click on the result list and select visualize tree
option, by which the decision tree will be generated.
2. Right click on J48 algorithm to get Generic Object Editor window
3. In this,make the unpruned option as true .
4. Then press OK and then start. we find the tree will become more complex if not
pruned.
Visualize tree

8. The tree has become more complex.


EXPERIMENT 11

1.11 You can make your Decision Trees simpler by pruning the nodes. One approach is to
use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for
training your Decision Trees using cross-validation (you can do this in Weka) and report the
Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your
accuracy increase?

Reduced-error pruning: -

The idea of using a separate pruning set for pruning—which is applicable to decision trees as well
as rule sets—is called reduced-error pruning. The variant described previously prunes a rule
immediately after it has been grown and is called incremental reduced-error pruning.

Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding
individual tests.

However, this method is much slower. Of course, there are many different ways to assess the worth
of a rule based on the pruning set. A simple measure is to consider how well the rule would do at
discriminating the predicted class from other classes if it were the only rule in the theory, operating
under the closed world assumption.

If it gets p instances right out of the t instances that it covers, and there are P instances of this class
out of a total T of instances altogether, then it gets positive instances right. The instances that it
does not cover include N - n negative ones, where n = t – p is the number of negative instances
that the rule covers and N = T - P is the total number of negative instances.

Thus, the rule has an overall success ratio of [p + (N - n)] T, and this quantity, evaluated on the
test set, has been used to evaluate the success of a rule when using reduced-error pruning.

1. Right click on J48 algorithm to get Generic Object Editor window


2. In this,make reduced error pruning option as true and also the unpruned option as true
.
3. Then press OK and then start.
1. We find that the accuracy has been increased by selecting the reduced error pruning
option.
EXPERIMENT 12

1.12 (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make
up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules.
There also exist different classifiers that output the model in the form of rules - one such
classifier in Weka is rules. PART, train this model and report the set of rules obtained.
Sometimes just one attribute can be good enough in making the decision, yes, just one ! Can
you predict what attribute that might be in this dataset? OneR classifier uses a single
attribute to make decisions (it chooses the attribute based on minimum error). Report the
rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR.

In weka, rules PART is one of the classifier which converts the decision trees into “IF-THEN-
ELSE” rules.

Converting Decision trees into “IF-THEN-ELSE” rules using rules. PART classifier:-
PART decision list
outlook = overcast: yes (4.0)
windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0)
: yes (3.0)
Number of Rules: 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
sunny -> no
overcast -> yes
rainy -> yes
(10/14 instances correct)
With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd place and PART
gets 3rd place.
J48 PART oneR
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and oneR
gets lst place
J48 PART oneR
ACCURACY (%) 70.5 70.2% 66.8%

1.Open existing file as weather. nomial. arff


2.Select All.
3.Go to classify.
4.Start.
Here the accuracy is 100%
The tree is something like “if-then-else” rule
If outlook=overcast then play=yes
If outlook=sunny and humidity=high
Then play = no
Else play = yes
If outlook=rainy and windy=true
Then play = no
Else play = yes

To click out the rules


1. Go to choose then click on Rule then select PART.
2. Click on Save and start.
3. Similarly, for oneR algorithm.

You might also like