0% found this document useful (0 votes)

304 views34 pages

Data Mining Lab Manual

This document provides instructions for a data mining lab manual on credit risk assessment using the German credit data. It includes 12 subtasks: 1) List categorical and real-valued attributes, 2) Propose simple rules for credit assessment, 3) Train and report a decision tree model, 4) Calculate training accuracy, 5) Discuss testing on the training set, 6) Describe cross-validation and train/report results, 7) Check for bias and remove attributes, 8) Try different attribute combinations, 9) Adjust costs and retrain/report results, 10) Discuss decision tree complexity, 11) Explain reduced error pruning and apply, 12) Convert a decision tree to rules and train/report a rule model

Uploaded by

Keerthana Sudarshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

304 views34 pages

Data Mining Lab Manual

Uploaded by

Keerthana Sudarshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

DATAMINING LAB MANUAL

Data Mining Lab

Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness
of an applicant is of crucial importance. You have to develop a system to help a loan
officer decide whether the credit of a customer is good, or bad. A banks business
rules regarding loans must consider two opposing factors. On the one hand, a bank
wants to make as many loans as possible. Interest on these loans is the bans profit
source. On the other hand, a bank cannot afford to make too many bad loans. Too
many bad loans could lead to the collapse of the bank. The banks loan policy must
involve a compromise not too strict, and not too lenient.

To do the assignment, you first and foremost need some knowledge about the world
of credit . You can acquire such knowledge in a number of ways.

1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and
try to represent her knowledge in the form of production rules.

2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on
finance. Translate this knowledge from text form to production rule form.

3. Common sense. Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.

4. Case histories. Find records of actual cases where competent loan officers correctly
judged when not to, approve a loan application.

The German Credit Data :

Actual historical credit data is not always easy to come by because of confidentiality
rules. Here is one such dataset ( original) Excel spreadsheet version of the German credit
data (download from web).

In spite of the fact that the data is German, you should probably make use of it for this
assignment, (Unless you really can consult a real loan officer !)

A few notes on the German dataset:

DM stands for Deutsche Mark, the unit of currency, worth about 90 cents
Canadian (but looks and acts like a quarter).

Owns_telephone. German phone rates are much higher than in Canada so fewer
people own telephones.

Page 1
DATAMINING LAB MANUAL

Foreign_worker. There are millions of these in Germany (many from Turkey). It

is very hard to get German citizenship if you were not born of German parents.

There are 20 attributes used in judging a loan applicant. The goal is the claify the
applicant into one of two categories, good or bad.

Subtasks : (Turn in your answers to the following tasks)

S.No. Task Description

1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
2. What attributes do you think might be crucial in making the credit assessment? Come up
with some simple rules in plain English using your selected attributes.

3. One type of model that you can create is a Decision Tree - train a Decision Tree using the
complete dataset as the training data. Report the model obtained after training.
4. Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly ? (This is also called testing on the training set) Why do you think you cannot get
100 % training accuracy ?
5. Is testing on the training set as you did above a good idea ?
Why or Why not ?
6. One approach for solving the problem encountered in the previous question is using cross-
validation ? Describe what is cross-validation briefly. Train a Decistion Tree again using
cross-validation and report your results. Does your accuracy increase/decrease ? Why ? (10
marks)
7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-
status" (attribute 9). One way to do this (perhaps rather simple minded) is to remove these
attributes from the dataset and see if the decision tree created in those cases is
significantly different from the full
dataset case which you have already done. To remove an attribute you can use the
preprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant
effect? Discuss.
8. Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3,
5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in
problem 7. Remember to reload the arff data file to get all the attributes initially before you
start selecting the ones you want.)
9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might
be higher than accepting an applicant who has bad credit (case 2). Instead of counting the
misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and
lower cost to the second case. You can do this by using a cost matrix in Weka. Train your
Decision Tree
again and report the Decision Tree and cross-validation results. Are they significantly
different from results obtained in problem 6 (using equal cost)?

10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Page 2
DATAMINING LAB MANUAL

Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your
Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree
you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your
own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also
exist different classifiers that output the model in the form of rules - one such classifier in
Weka is rules.PART, train this model and report the set of rules obtained. Sometimes just
one attribute can
be good enough in making the decision, yes, just one ! Can you predict what attribute that
might be in this dataset ? OneR classifier uses a single attribute to make decisions (it
chooses the attribute based on minimum error). Report the rule obtained by training a one
R classifier. Rank the
performance of j48, PART and oneR.

Laboratory Manual For Data Mining

EXPERIMENT-1

Aim: To list all the categorical(or nominal) attributes and the real valued attributes using Weka
mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

1) Open the Weka GUI Chooser.

2) Select EXPLORER present in Applications.

3) Select Preprocess Tab.

4) Go to OPEN file and browse the file that is already stored in the system bank.csv.

5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute.

SampleOutput:

Page 3
DATAMINING LAB MANUAL

EXPERIMENT-2

Aim: To identify the rules with some of the important attributes by a) manually and b) Using Weka .

Tools/ Apparatus: Weka mining tool..

Theory:

Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set of
transactions called the database. Each transaction in D has a unique transaction ID and contains a
subset of the items in I. A rule is defined as an implication of the form X=>Y where X,Y C I and X
Y= . The sets of items (for short itemsets) X and Y are called antecedent (left hand side or LHS) and
consequent (righthandside or RHS) of the rule respectively.

To illustrate the concepts, we use a small example from the supermarket domain.

The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule
for the supermarket could be meaning that if milk and bread is bought, customers also buy butter.

Note: this example is extremely small. In practical applications, a rule needs a support of several
hundred transactions before it can be considered statistically significant, and datasets often contain
thousands or millions of transactions.

To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The bestknown constraints are minimum thresholds on support
and confidence. The support supp(X) of an itemset X is defined as the proportion of transactions in the

Page 4
DATAMINING LAB MANUAL

data set which contain the itemset. In the example database, the itemset {milk,bread} has a support of
2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions).

The confidence of a rule is defined . For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the
database, which means that for 50% of the transactions containing milk and bread the rule is correct.
Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the
RHS of the rule in transactions under the condition that these transactions also contain the LHS .

ALGORITHM:

Association rule mining is to find out association rules that satisfy the predefined minimum support
and confidence from a given database. The problem is usually decomposed into two subproblems.
One is to find those itemsets whose occurrences exceed a predefined threshold in the database those
itemsets are called frequent or large itemsets. The second problem is to generate association rules
from those large itemsets with the constraints of minimal confidence.

Suppose one of the large itemsets is Lk, Lk = {I1, I2, , Ik}, association rules with this itemsets are
generated in the following way: the first rule is {I1, I2, , Ik1} and {Ik}, by checking the confidence
this rule can be determined as interesting or not. Then other rule are generated by deleting the last
items in the antecedent and inserting it to the consequent, further the confidences of the new rules are
checked to determine the interestingness of them. Those processes iterated until the antecedent
becomes empty. Since the second subproblem is quite straight forward, most of the researches focus
on the first subproblem. The Apriori algorithm finds the frequent sets L In Database D.

Find frequent set Lk 1.

Join Step.

o Ck is generated by joining Lk 1with itself

Prune Step.

o Any (k 1) itemset that is not frequent cannot be a subset of a

frequent k itemset, hence should be removed.

Where (Ck: Candidate itemset of size k)

(Lk: frequent itemset of size k)

Apriori Pseudocode

Apriori (T,)

L<{ Large 1itemsets that appear in more than transactions }

K<2

while L(k1)

Page 5
DATAMINING LAB MANUAL

C(k)<Generate( Lk 1)

for transactions t T

C(t)Subset(Ck,t)

for candidates c C(t)

count[c]<count[ c]+1

L(k)<{ c C(k)| count[c]

K<K+ 1

return L(k) k

Procedure:

1) Given the Bank database for mining.

2) Select EXPLORER in WEKA GUI Chooser.

3) Load Bank.csv in Weka by Open file in Preprocess tab.

4) Select only Nominal values.

5) Go to Associate Tab.

6) Select Apriori algorithm from Choose button present in Associator

weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

7) Select Start button

8) now we can see the sample rules.

Sample output:

Page 6
DATAMINING LAB MANUAL

Page 7
DATAMINING LAB MANUAL

EXPERIMENT-3

Aim: To create a Decision tree by training data set using Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Theory:

Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the data. For
example, a classification model could be used to identify loan applicants as low, medium, or high
credit risks.

A classification task begins with a data set in which the class assignments are known. For example, a
classification model that predicts credit risk could be developed based on observed data for many loan
applicants over a period of time.

In addition to the historical credit rating, the data might track employment history, home ownership or
rental, years of residence, number and type of investments, and so on. Credit rating would be the
target, the other attributes would be the predictors, and the data for each customer would constitute a
case.

Classifications are discrete and do not imply order. Continuous, floating point values would indicate a
numerical, rather than a categorical, target. A predictive model with a numerical target uses a
regression algorithm, not a classification algorithm.

The simplest type of classification problem is binary classification. In binary classification, the target
attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass
targets have more than two values: for example, low, medium, high, or unknown credit rating.

In the model build (training) process, a classification algorithm finds relationships between the values
of the predictors and the values of the target. Different classification algorithms use different
techniques for finding relationships. These relationships are summarized in a model, which can then
be applied to a different data set in which the class assignments are unknown.

Classification models are tested by comparing the predicted values to known target values in a set of
test data. The historical data for a classification project is typically divided into two data sets: one for
building the model the other for testing the model.

Scoring a classification model results in class assignments and probabilities for each case. For
example, a model that classifies customers as low, medium, or high value would also predict the
probability of each classification for each customer.

Classification has many applications in customer segmentation, business modeling, marketing, credit
analysis, and biomedical and drug response modeling.

Different Classification Algorithms

Page 8
DATAMINING LAB MANUAL

Oracle Data Mining provides the following algorithms for classification:

Decision Tree

Decision trees automatically generate rules, which are conditional statements that reveal the logic
used to build the tree.

Naive Bayes

Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency
of values and combinations of values in the historical data.

Procedure:

1) Open Weka GUI Chooser.

2) Select EXPLORER present in Applications.

3) Select Preprocess Tab.

4) Go to OPEN file and browse the file that is already stored in the system bank.csv.

5) Go to Classify tab.

6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by
clicking the button choose

7) and select tree j48

9) Select Test options Use training set

10) if need select attribute.

11) Click Start .

12)now we can see the output details in the Classifier output.

13) right click on the result list and select visualize tree option .

Sample output:

Page 9
DATAMINING LAB MANUAL

Page 10
DATAMINING LAB MANUAL

The decision tree constructed by using the implemented C4.5 algorithm

Page 11
DATAMINING LAB MANUAL

EXPERIMENT-4

Aim: To find the percentage of examples that are classified correctly by using the above created
decision tree model? ie.. Testing on the training set.

Tools/ Apparatus: Weka mining tool..

Theory:

Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to
be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the
existence of the other features, a naive Bayes classifier considers all of these properties to
independently contribute to the probability that this fruit is an apple.

An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification. Because
independent variables are assumed, only the variances of the variables for each class need to be
determined and not the entirecovariance matrix The naive Bayes probabilistic model :

The probability model for a classifier is a conditional model

P(C|F1 .................Fn) over a dependent class variable C with a small number of outcomes or classes,
conditional on several feature variables F1 through Fn. The problem is that if the number of features
n is large or when a feature can take on a large number of values, then basing such a model on
probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes' theorem, we write P(C|F1...............Fn)=[{p(C)p(F1..................Fn|C)}/p(F1,........Fn)]

In plain English the above equation can be written as

Posterior= [(prior *likehood)/evidence]

In practice we are only interested in the numerator of that fraction, since the denominator does not
depend on C and the values of the features Fi are given, so that the denominator is effectively
constant. The numerator is equivalent to the joint probability model p(C,F1........Fn) which can be
rewritten as follows, using repeated applications of the definition of conditional probability:

p(C,F1........Fn) =p(C) p(F1............Fn|C) =p(C)p(F1|C) p(F2.........Fn|C,F1,F2)

=p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)

= p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)......p(Fn|C,F1,F2,F3.........Fn1)

Page 12
DATAMINING LAB MANUAL

Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is
conditionally independent of every other feature Fj for ji .

This means that p(Fi|C,Fj)=p(Fi|C)

and so the joint model can be expressed as p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)...........

=p(C) p(Fi|C)

This means that under the above independence assumptions, the conditional distribution over the class
variable C can be expressed like this:

p(C|F1..........Fn)= p(C) p(Fi|C)

where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the values of the feature
variables are known.

Models of this form are much more manageable, since they factor into a so called class prior p(C) and
independent probability distributions p(Fi|C). If there are k classes and if a model for eachp(Fi|C=c)
can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k 1) + n r
k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features)
are common, and so the total number of parameters of the naive Bayes model is 2n + 1, where n is the
number of binary features used for prediction

P(h/D)= P(D/h) P(h) P(D)

P(h) : Prior probability of hypothesis h

P(D) : Prior probability of training data D

P(h/D) : Probability of h given D

P(D/h) : Probability of D given h

Nave Bayes Classifier : Derivation

D : Set of tuples

Each Tuple is an n dimensional attribute vector

X : (x1,x2,x3,. xn)

Let there me m Classes : C1,C2,C3Cm

NB classifier predicts X belongs to Class Ci iff

P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i

Maximum Posteriori Hypothesis

Page 13
DATAMINING LAB MANUAL

P(Ci/X) = P(X/Ci) P(Ci) / P(X)

Maximize P(X/Ci) P(Ci) as P(X) is constant

Nave Bayes Classifier : Derivation

With many attributes, it is computationally expensive to evaluate P(X/Ci)

Nave Assumption of class conditional independence

P(X/Ci) = n P( xk/ Ci)

k=1

P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci)

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) Go to Classify tab.

7) Choose Classifier Tree

8) Select NBTree i.e., Navie Baysiean tree.

9) Select Test options Use training set

10) if need select attribute.

11) now Start weka.

12)now we can see the output details in the Classifier output.

Sample output:

=== Evaluation on training set ===

=== Summary ===

Page 14
DATAMINING LAB MANUAL

Correctly Classified Instances 554 92.3333 %

Incorrectly Classified Instances 46 7.6667 %

Kappa statistic 0.845

Mean absolute error 0.1389

Root mean squared error 0.2636

Relative absolute error 27.9979 %

Root relative squared error 52.9137 %

Total Number of Instances 600

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.894 0.052 0.935 0.894 0.914 0.936 YES

0.948 0.106 0.914 0.948 0.931 0.936 NO

Weighted Avg. 0.923 0.081 0.924 0.923 0.923 0.936

=== Confusion Matrix ===

a b <-- classified as

245 29 | a = YES

17 309 | b = NO

Page 15
DATAMINING LAB MANUAL

EXPERIMENT-5

Aim: To Is testing a good idea.

Tools/ Apparatus: Weka Mining tool

Procedure:

1) In Test options, select the Supplied test set radio button

2) click Set

3) Choose the file which contains records that were not in the training set we used to create the
model.

4) click Start(WEKA will run this test data set through the model we already created. )

5) Compare the output results with that of the 4th experiment

Sample output:

This can be experienced by the different problem solutions while doing practice.

The important numbers to focus on here are the numbers next to the "Correctly Classified Instances"
(92.3 percent) and the "Incorrectly Classified Instances" (7.6 percent). Other important numbers are in
the "ROC Area" column, in the first row (the 0.936) Finally, in the "Confusion Matrix," it shows the
number of false positives and false negatives. The false positives are 29, and the false negatives are 17
in this matrix.

Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a good model.

One final step to validating our classification tree, which is to run our test set through the model and
ensure that accuracy of the model

Comparing the "Correctly Classified Instances" from this test set with the "Correctly Classified
Instances" from the training set, we see the accuracy of the model , which indicates that the model
will not break down with unknown data, or when future data is applied to it.

EXPERIMENT-6

Aim: To create a Decision tree by cross validation training data set using Weka mining tool.

Page 16
DATAMINING LAB MANUAL

Tools/ Apparatus: Weka mining tool..

Theory:

Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive
model which maps observations about an item to conclusions about the item's target value In these
tree structures, leaves represent classifications and branches represent conjunctions of features that
lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree describes data but not
decisions rather the resulting classification tree can be an input for decision making. This page deals
with decision trees in data mining.

Decision tree learning is a common method used in data mining. The goal is to create a model that
predicts the value of a target variable based on several input variables. Each interior node corresponds
to one of the input variables there are edges to children for each of the possible values of that input
variable. Each leaf represents a value of the target variable given the values of the input variables
represented by the path from the root to the leaf.

A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This
process is repeated on each derived subset in a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node all has the same value of the target variable, or when
splitting no longer adds value to the predictions.

In data mining, trees can be described also as the combination of mathematical and computational
techniques to aid the description, categorisation and generalization of a given set of data.

Data comes in records of the form:

(x, y) = (x1, x2, x3..., xk, y)

The dependent variable, Y, is the target variable that we are trying to understand, classify or
generalise. The vector x is comprised of the input variables, x1, x2, x3 etc., that are used for that task.

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) Go to Classify tab.

7) Choose Classifier Tree

Page 17
DATAMINING LAB MANUAL

8) Select J48

9) Select Test options Cross-validation.

10) Set Folds Ex:10

11) if need select attribute.

12) now Start weka.

13)now we can see the output details in the Classifier output.

14)Compare the output results with that of the 4th experiment

15) check whether the accuracy increased or decreased?

Sample output:

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 539 89.8333 %

Page 18
DATAMINING LAB MANUAL

Incorrectly Classified Instances 61 10.1667 %

Kappa statistic 0.7942

Mean absolute error 0.167

Root mean squared error 0.305

Relative absolute error 33.6511 %

Root relative squared error 61.2344 %

Total Number of Instances 600

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.861 0.071 0.911 0.861 0.886 0.883 YES

0.929 0.139 0.889 0.929 0.909 0.883 NO

Weighted Avg. 0.898 0.108 0.899 0.898 0.898 0.883

=== Confusion Matrix ===

a b <-- classified as

236 38 | a = YES

23 303 | b = NO

Page 19
DATAMINING LAB MANUAL

EXPERIMENT-7

Aim: Delete one attribute from GUI Explorer and see the effect using Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list
available filters.

7) Select weka.filters.unsupervised.attribute.Remove

8) Next, click on text box immediately to the right of the "Choose" button

9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the
"invertSelection" option is set to false )

10) Then click "OK" . Now, in the filter box you will see "Remove -R 1"

11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and
create a new working relation

12) To save the new working relation as an ARFF file, click on save button in the top panel.

13) Go to OPEN file and browse the file that is newly saved (attribute deleted file)

14) Go to Classify tab.

15) Choose Classifier Tree

16) Select j48 tree

17) Select Test options Use training set

18) if need select attribute.

19) now Start weka.

20)now we can see the output details in the Classifier output.

21) right click on the result list and select visualize tree option .
Page 20
DATAMINING LAB MANUAL

22) Compare the output results with that of the 4th experiment

23) check whether the accuracy increased or decreased?

24)check whether removing these attributes have any significant effect.

Sample output:

Page 21
DATAMINING LAB MANUAL

Page 22
DATAMINING LAB MANUAL

EXPERIMENT-8

Aim: Select some attributes from GUI Explorer and perform classification and see the effect using
Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) select some of the attributes from attributes list which are to be removed. With this step only the
attributes necessary for classification are left in the attributes panel.

7) The go to Classify tab.

8) Choose Classifier Tree

9) Select j48

10) Select Test options Use training set

11) if need select attribute.

12) now Start weka.

13)now we can see the output details in the Classifier output.

14) right click on the result list and select visualize tree option .

15)Compare the output results with that of the 4th experiment

16) check whether the accuracy increased or decreased?

17)check whether removing these attributes have any significant effect.

Sample output:

Page 23
DATAMINING LAB MANUAL

Page 24
DATAMINING LAB MANUAL

EXPERIMENT-9

Aim: To create a Decision tree by cross validation training data set by changing the cost matrix in
Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) Go to Classify tab.

7) Choose Classifier Tree

8) Select j48

9) Select Test options Training set.

10)Click on more options.

11)Select cost sensitive evaluation and click on set button

12)Set the matrix values and click on resize. Then close the window.

13)Click Ok

14)Click start.

15) we can see the output details in the Classifier output

16) Select Test options Cross-validation.

17) Set Folds Ex:10

18) if need select attribute.

19) now Start weka.

20)now we can see the output details in the Classifier output.

21)Compare results of 15th and 20th steps.

Page 25
DATAMINING LAB MANUAL

22)Compare the results with that of experiment 6.

Sample output:

Page 26
DATAMINING LAB MANUAL

EXPERIMENT-10

Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

When we consider long complex decision trees, we will have many unnecessary attributes in the tree
which results in increase of the bias of the model. Because of this, the accuracy of the model can also
effect.

This problem can be reduced by considering simple decision tree. The attributes will be less and it
decreases the bias of the model. Due to this the result will be more accurate.

So it is a good idea to prefer simple decision trees instead of long complex trees.

1. Open any existing ARFF file.

2. In preprocess tab, select ALL to select all the attributes.

Page 27
DATAMINING LAB MANUAL

3. Go to classify tab and then use traning set with J48 algorithm.

4. To generate the decision tree, right click on the result list and select visualize tree option, by which
the decision tree will be generated.

5. Right click on J48 algorithm to get Generic Object Editor window.

6. In this, make the unpruned option as true .

7. Then press OK and then start. we find the tree will become more complex if not pruned.

8. The tree has become more complex.

EXPERIMENT-11

Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show accuracy
for cross validation trained data set using Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Theory :

Reduced-error pruning

Each node of the (over-fit) tree is examined for pruning

Page 28
DATAMINING LAB MANUAL

A node is pruned (removed) only if the resulting pruned tree

performs no worse than the original over the validation set

Pruning a node consists of

Removing the sub-tree rooted at the pruned node

Making the pruned node a leaf node

Assigning the pruned node the most common classification of the training instances attached to that
node

Pruning nodes iteratively

Always select a node whose removal most increases the DT accuracy over the validation set

Stop when further pruning decreases the DT accuracy over the validation set

IF (Children=yes) (income=>30000)

THEN (car=Yes)

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) select some of the attributes from attributes list

7) Go to Classify tab.

8) Choose Classifier Tree

9) Select NBTree i.e., Navie Baysiean tree.

10) Select Test options Use training set

11) right click on the text box besides choose button ,select show properties

12) now change unprone mode false to true.

13) change the reduced error pruning % as needed.

14) if need select attribute.

Page 29
DATAMINING LAB MANUAL

15) now Start weka.

16)now we can see the output details in the Classifier output.

17) right click on the result list and select visualize tree option .

Sample output:

EXPERIMENT-12

Page 30
DATAMINING LAB MANUAL

Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART classifiers,
by training data set using Weka mining tool.

Tools/ Apparatus: Weka mining tool..

Procedure:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) select some of the attributes from attributes list

7) Go to Classify tab.

8) Choose Classifier TreesRules

9) Select J48 .

10) Select Test options Use training set

11) if need select attribute.

12) now Start weka.

13)now we can see the output details in the Classifier output.

14) right click on the result list and select visualize tree option .

(or)

java weka.classifiers.trees.J48 -t c:\temp\bank.arff

Procedure for OneR:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

Page 31
DATAMINING LAB MANUAL

6) select some of the attributes from attributes list

7) Go to Classify tab.

8) Choose Classifier Rules

9) Select OneR .

10) Select Test options Use training set

11) if need select attribute.

12) now Start weka.

13)now we can see the output details in the Classifier output.

Procedure for PART:

1) Given the Bank database for mining.

2) Use the Weka GUI Chooser.

3) Select EXPLORER present in Applications.

4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv.

6) select some of the attributes from attributes list

7) Go to Classify tab.

8) Choose Classifier Rules

9) Select PART .

10) Select Test options Use training set

11) if need select attribute.

12) now Start weka.

13)now we can see the output details in the Classifier output.

Attribute relevance with respect to the class relevant attribute (science)

IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance)

Page 32
DATAMINING LAB MANUAL

IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)

Sample output:

J48

java weka.classifiers.trees.J48 -t c:/temp/bank.arff

One R

Page 33
DATAMINING LAB MANUAL

PART

Page 34

U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Management Project Report - Business Strategies For Sustainable Development in Various Industries in India
No ratings yet
Management Project Report - Business Strategies For Sustainable Development in Various Industries in India
32 pages
N Mines and Minerals Development and Regulation Act 1957 Vanisharma8814703822 Maimsacin 20250211 210939 1 109
No ratings yet
N Mines and Minerals Development and Regulation Act 1957 Vanisharma8814703822 Maimsacin 20250211 210939 1 109
109 pages
Data Mining Lab Manual
33% (3)
Data Mining Lab Manual
44 pages
Abstract Nouns Worksheet 3 PDF
100% (2)
Abstract Nouns Worksheet 3 PDF
1 page
Abstract Nouns Worksheet 3 PDF
100% (2)
Abstract Nouns Worksheet 3 PDF
1 page
2019 Annual Report Report of The Ghana Chamber of Mines
100% (1)
2019 Annual Report Report of The Ghana Chamber of Mines
156 pages
Contemporary World
33% (3)
Contemporary World
39 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Acc Chapter 2 Notes
100% (1)
Acc Chapter 2 Notes
15 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Proceedings Rev A English - Print Res
100% (2)
Proceedings Rev A English - Print Res
130 pages
Comprehensive Company Databse
No ratings yet
Comprehensive Company Databse
62 pages
CBSE Class 5 GK Practice Worksheet (6) Word Puzzle
No ratings yet
CBSE Class 5 GK Practice Worksheet (6) Word Puzzle
3 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
ML Unit4
No ratings yet
ML Unit4
41 pages
Mines Act 1952
No ratings yet
Mines Act 1952
38 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Kaltim Prima Coal
100% (3)
Kaltim Prima Coal
41 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit V
No ratings yet
Unit V
67 pages
Topics For Exemplification Essay
100% (3)
Topics For Exemplification Essay
3 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
A Comparison of BHP Billiton Mineral Escondida Flotation Concentrators
100% (1)
A Comparison of BHP Billiton Mineral Escondida Flotation Concentrators
22 pages
Eimco Elecon Initiating Coverage 04072016
No ratings yet
Eimco Elecon Initiating Coverage 04072016
19 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Weka Lab Record Experiments
No ratings yet
Weka Lab Record Experiments
21 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
PT PPA - Materi Webinar Nasional APKPI Series 7
No ratings yet
PT PPA - Materi Webinar Nasional APKPI Series 7
13 pages
Numpy - Tutorial - Ipynb - Colaboratory
No ratings yet
Numpy - Tutorial - Ipynb - Colaboratory
9 pages
Minerp Planning 1 2
No ratings yet
Minerp Planning 1 2
4 pages
Training Project (Repaired)
No ratings yet
Training Project (Repaired)
83 pages
Simile PDF
No ratings yet
Simile PDF
13 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
R Language
No ratings yet
R Language
59 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
No ratings yet
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
9 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Victory Nickel Minago Project Feasibility Study 2009 Executive Summary PDF
No ratings yet
Victory Nickel Minago Project Feasibility Study 2009 Executive Summary PDF
40 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
2 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
DWDM LAB Final Manualtest
No ratings yet
DWDM LAB Final Manualtest
134 pages
All About Vitamins & Minerals - Precision Nutrition PDF
No ratings yet
All About Vitamins & Minerals - Precision Nutrition PDF
31 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Mininet Python API Reference Manual: Generated by Doxygen 1.8.3.1
No ratings yet
Mininet Python API Reference Manual: Generated by Doxygen 1.8.3.1
75 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Visualization Errors
No ratings yet
Visualization Errors
34 pages
Data-Mining-Lab-Manual Cs 703b
No ratings yet
Data-Mining-Lab-Manual Cs 703b
41 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Vale Day 2023
No ratings yet
Vale Day 2023
95 pages
Lecture 4 Data Structure Linked List
No ratings yet
Lecture 4 Data Structure Linked List
30 pages
CH 6
No ratings yet
CH 6
72 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Pds Fall2003
No ratings yet
Pds Fall2003
52 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Introducing China S Dense Medium
No ratings yet
Introducing China S Dense Medium
9 pages
A Survey of Network-Based Intrusion Detection Data Sets
No ratings yet
A Survey of Network-Based Intrusion Detection Data Sets
17 pages
ML Unit Ii
No ratings yet
ML Unit Ii
30 pages
Minetruck MT54 - Epiroc
No ratings yet
Minetruck MT54 - Epiroc
4 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Design and Implementation of Power Estimation Technique For Digital Circuits IJERTV3IS041503
No ratings yet
Design and Implementation of Power Estimation Technique For Digital Circuits IJERTV3IS041503
10 pages
2013 11 13 Inca One Presentation
No ratings yet
2013 11 13 Inca One Presentation
17 pages
Figure of Speech
No ratings yet
Figure of Speech
18 pages
DF Date Du Cal Who Whoami WC Head Tail BC: LAB: 2 Explore The Commands The Given Below Basic Commands
No ratings yet
DF Date Du Cal Who Whoami WC Head Tail BC: LAB: 2 Explore The Commands The Given Below Basic Commands
9 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
The Secret Life of Us, Writing History From Below in The Miners' Strike 1984 - 85
No ratings yet
The Secret Life of Us, Writing History From Below in The Miners' Strike 1984 - 85
23 pages
Coal Seam Modelling and Mine Plalnning Using Results of A 3D Seis
No ratings yet
Coal Seam Modelling and Mine Plalnning Using Results of A 3D Seis
16 pages
Mc9280 Data Mining and Data Warehousing
No ratings yet
Mc9280 Data Mining and Data Warehousing
1 page
Ali Tamaddoni Jahromi, Mehrad Moeini, Issar Akbari, Aram Akbarzadeh
No ratings yet
Ali Tamaddoni Jahromi, Mehrad Moeini, Issar Akbari, Aram Akbarzadeh
11 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Self Assessment Tool Risk Control
No ratings yet
Self Assessment Tool Risk Control
17 pages
Development of Road Header Roof Bolting Module
No ratings yet
Development of Road Header Roof Bolting Module
7 pages
Taonga Simari MRM
No ratings yet
Taonga Simari MRM
7 pages
Geochemical Analysis of Iron Ore - SGS
No ratings yet
Geochemical Analysis of Iron Ore - SGS
2 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
High-Pressure Grinding 2MB PDF
No ratings yet
High-Pressure Grinding 2MB PDF
0 pages
Artikel 1303
No ratings yet
Artikel 1303
4 pages
CBSE Class 5 GK Practice Worksheet
No ratings yet
CBSE Class 5 GK Practice Worksheet
1 page
CBSE Class 5 GK Practice Worksheet
No ratings yet
CBSE Class 5 GK Practice Worksheet
1 page
Rock Phosphate
No ratings yet
Rock Phosphate
9 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
TIBCO Software The Ultimate Step-By-Step Guide
From Everand
TIBCO Software The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet