data mining file
data mining file
So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse:
DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions
but we’ll leave the fact table empty. The purpose of this article is to show how to populate the
fact table using SSIS.
Date Dimension:-
Customer Dimension:
Van Dimension:
And then we do it. This is the script to create and populate those dim and fact tables:
use TopHireDW go
HireBase.dbo.Customer
7. As your requirement do the necessary changes of settings and click Next. Full and
Associate files are the recommended settings.
10. The Installation will start wait for a while it will finish within a minute.
A. Explore various options in Weka for Preprocessing data and apply (like Discretization
Filters, Resample filter, etc.) n each dataset.
Ans:
Preprocess Tab
1. Loading Data
The first four buttons at the top of the preprocess section enable you to load data
into WEKA:
1. Open file.... Brings up a dialog box allowing you to browse for the data file on the
local file system.
2. Open URL.....Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB.....Reads data from a database. (Note that to make this work you might have to
edit the
file in weka/experiment/DatabaseUtils.props.)
1. Relation. The name of the relation, as given in the file it was loaded from. Filters
(described below) modify the name of a relation.
Below the Current relation box is a box titled Attributes. There are four buttons,
and beneath them is a list of the attributes in the current relation.
1. No.. A number that identifies the attribute in the order they are specified in the data file.
2. Selection tick boxes. These allow you select which attributes are present in the relation.
3. Name. The name of the attribute, as it was declared in the data file. When you click on
different rows in the list of attributes, the fields change in the box to the right titled
Selected attribute.
This box displays the characteristics of the currently highlighted attribute in the list:
1. Name. The name of the attribute, the same as that given in the attribute list.
3. Missing. The number (and percentage) of instances in the data for which this attribute
is missing (unspecified).
4. Distinct. The number of different values that the data contains for this attribute.
5. Unique. The number (and percentage) of instances in the data having a value for this
attribute that no other instances have.
Below these statistics is a list showing more information about the values stored in this
attribute, which differ depending on its type. If the attribute is nominal, the list consists
of each possible value for the attribute along with the number of instances that have
that value. If the attribute is numeric, the list gives four statistics describing the
distribution of values in the data— the minimum, maximum, mean and standard
deviation. And below these statistics there is a coloured histogram, colour-coded
according to the attribute chosen as the Class using the box above the histogram. (This
box will bring up a drop-down list of available selections when clicked.) Note that
only nominal Class attributes will result in a colour-coding. Finally, after pressing the
Visualize All button, histograms for all the attributes in the data are shown in a
separate window.
Returning to the attribute list, to begin with all the tick boxes are unticked.
They can be toggled on/off by clicking on them individually. The four buttons
above can also be used to change the selection:
PREPROCESSING
Once the desired attributes have been selected, they can be removed by clicking the
Remove button below the list of attributes. Note that this can be undone by clicking the
Undo button, which is located next to the Edit button in the top-right corner of the
Preprocess panel.
Working with Filters:-
The preprocess section allows filters to be defined that transform the data in
various ways. The Filter box is used to set up the filters that are required. At the
left of the Filter box is a Choose button. By clicking this button it is possible to
select one of the filters in WEKA. Once a filter has been selected, its name and
options are shown in the field next to the Choose button.
(see below). The fields in the window reflect the available options.
Right-clicking (or Alt+Shift+Left-Click) on such a field will bring up a popup menu, listing
the following options:
1. Show properties... has the same effect as left-clicking on the field, i.e., a
dialog appears allowing you to alter the settings.
A. Load each dataset into Weka and run Aprior algorithm with different support
and confidence values. Study the rules generated.
Ans:
Steps for run Aprior algorithm in WEKA
temp
eratu
re
hum
idity
wind
y
play
=== Associator model (full
training set) === Apriori
=======
Association Rule:-
An association rule has two parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in
combination with the antecedent.
Association rules are created by analyzing data for frequent if/then patterns and using the
criteriasupport and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the database. Confidence indicates the
number of times the if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in shopping basket data analysis, product
clustering, catalog design and store layout.
( X Y ).count
confidence
X .count
B.Apply different discretization filters on numerical attributes and run the Aprior
association rule algorithm. Study the rules generated. Derive interesting insights and
observe the effect of discretization in the rule generation process.
rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
Practical 3:- Demonstrate performing classification on data sets.
Classification Tab
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat
gives the name of the currently selected classifier, and its options. Clicking on the text
box with the left mouse button brings up a GenericObjectEditor dialog box, just the same
as for filters, that you can use to configure the options of the current classifier.
Test Options
The result of applying the chosen classifier will be tested according to the options
that are set by clicking in the Test options box. There are four test modes:
1. Use training set. The classifier is evaluated on how well it predicts the class of the
instances it was trained on.
2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set
of instances loaded from a file. Clicking the Set... button brings up a dialog allowing
you to choose the file to test on.
3. Cross-validation. The classifier is evaluated by cross-validation, using the number of
folds that are entered in the Folds text field.
4. Percentage split. The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held out depends
on the value entered in the % field.
1. Output model. The classification model on the full training set is output so that
it can be viewed, visualized, etc. This option is selected by default.
2. Output per-class stats. The precision/recall and true/false statistics for each class
are output. This option is also selected by default.
Note that in the case of a cross-validation the instance numbers do not correspond to the
location in the data!
9. Random seed for xval / % Split. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.
10. Preserve order for % Split. This suppresses the randomization of the data before
splitting into train and test set.
11. Output source code. If the classifier can output the built model as Java source
code, you can specify the class name here. The code will be printed in the “Classifier
output” area.
Training a Classifier
Once the classifier, test options and class have all been set, the learning process is started by clicking
on the Start button. While the classifier is busy being trained, the little bird moves around. You can
stop the training process at any time by clicking on the Stop button. When training is complete,
several things happen.
A. Load each dataset into Weka and run id3, j48 classification algorithm, study the
classifier output. Compute entropy values, Kappa ststistic.
Ans:
Output:
=== Run information ===
Scheme:weka.classifiers.trees.J48 -C
0.25 -M 2 Relation: iris
Instances: 150
Attributes: 5 sepallength sepalwidth
petallength petalwidth class
Test mode:evaluate on training data
150
=== Confusion
Matrix === a b
c <-- classified
as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
The text in the Classifier output area has scroll bars allowing you to
browse the results. Clicking with the left mouse button into the text area, while
holding Alt and Shift, brings up a dialog that enables you to save the displayed
output
in a variety of formats (currently, BMP, EPS, JPEG and PNG). Of
course, you can also resize the Explorer window to get a larger display
area.
The output is
Split into several sections:
2. Classifier model (full training set). A textual representation of the classification model
that was produced on the full training data.
3. The results of the chosen test mode are broken down thus.
4. Summary. A list of statistics summarizing how accurately the classifier was able to
predict the true class of the instances under the chosen test mode.
6. Confusion Matrix. Shows how many instances have been assigned to each class.
Elements show the number of test examples whose actual class is the row and whose
predicted class is the column.
7. Source code (optional). This section lists the Java source code if one
choose “Output source code” in the “More options” dialog.
B.Extract if-then rues from decision tree gentrated by classifier, Observe the confusion
matrix and derive Accuracy, F- measure, TPrate, FPrate , Precision and recall values.
Apply cross-validation strategy with various fold levels and compare the accuracy results.
Ans:
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each
leaf node represents a class
The benefits of having a decision tree are as follows −
C IF-THEN Rules:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN
conclusion Let us
consider a rule R1,
Points to remember −
The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
One rule is created for each path from the root to the leaf node.
The leaf node holds the class prediction, forming the rule consequent.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls,
c); remove tuples covered by Rule form
D; until termination condition;
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
Note − This value will increase with the accuracy of R on the pruning set.
Hence, if the FOIL_Prune value is higher for the pruned version of R, then we
prune R.
Number of training
instances: 150 Number
of Rules : 3
Non matches covered by
Majority class. Best first.
Start set: no
attributes
Search
direction:
forward
Stale search after 5 node
expansions Total number of
subsets evaluated: 12 Merit
of best subset found: 96
Evaluation (for feature selection): CV
(leave one out) Feature set: 4,5
C. Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbor classification, Interpret the results obtained.
Ans:
Steps for run Naïve-bayes and k-nearest neighbor Classification algorithms in WEKA
Scheme:weka.classifiers.bayes.Na
iveBayes Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
Test mode:evaluate on training data
Class
Attribute Iris-setosa Iris-versicolor
Iris-virginica (0.33) (0.33) (0.33)
===============================================================
sepallength
mean 4.9913 5.9379 6.5795
std. dev. 0.355 0.5042 0.6353
weight sum 50 50 50
precision 0.1059 0.1059 0.1059
sepalwidth
mean 3.4015 2.7687 2.9629
std. dev. 0.3925 0.3038 0.3088
weight sum 50 50 50
precision 0.1091 0.1091 0.1091
petallength
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
0 4 46 | c = Iris-virginica.
Output: KNN (IBK)
Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R
first-last\"" Relation: iris
Instances: 150
Attributes: 5 sepallength sepalwidth
petallength
petalwidth class
=== Confusion
Matrix === a b c
<-- classified as
50 0 0 | a = Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica
Ans
=== Confusion
Matrix === a b c
<-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
Naïve-bayes:
Scheme:weka.classifiers.bayes.Na
iveBayes Relation: iris
Instances: 150
Attributes: 5
sepallength
petallength
mean 1.4694 4.2452 5.5516
std. dev. 0.1782 0.4712 0.5529
weight sum 50 50 50
precision 0.1405 0.1405 0.1405
petalwidth
mean 0.2743 1.3097 2.0343
std. dev. 0.1096 0.1915 0.2646
weight sum 50 50 50
precision 0.1143 0.1143 0.1143
=== Confusion
Matrix === a b c
<-- classified as
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
=== Confusion
Matrix === a b c
<-- classified as
50 0 0 | a = Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica
Practical 4:- Demonstrate performing clustering on data sets Clustering Tab
Selecting a Clusterer
By now you will be familiar with the process of selecting and configuring objects.
Clicking on the clustering scheme listed in the Clusterer box at the top of the window
brings up a GenericObjectEditor dialog with which to choose a new clustering scheme.
Cluster Modes
The Cluster mode box is used to choose what to cluster and how to evaluate
the results. The first three options are the same as for classification: Use training set,
Supplied test set and Percentage split (Section 5.3.1)—except that now the data is
assigned to clusters instead of trying to predict a specific class. The fourth mode, Classes
to clusters evaluation, compares how well the chosen clusters match up with a pre-
assigned class in the data. The drop-down box below this option selects the class, just as
in the Classify panel.
An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so large that memory becomes a
problem it may be helpful to disable this option.
Ignoring Attributes
Often, some attributes in the data should be ignored when clustering. The Ignore
attributes button brings up a small window that allows you to select which attributes are
ignored. Clicking on an attribute in the window highlights it, holding down the SHIFT
key selects a range of consecutive attributes, and holding down CTRL toggles individual
attributes on and off. To cancel the selection, back out with the Cancel button. To activate
it, click the Select button. The next time clustering is invoked, the selected attributes are
ignored.
Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, a result text
area and a result list. These all behave just like their classification counterparts. Right-
clicking an entry in the result list brings up a similar menu, except that it shows only two
visualization options: Visualize cluster assignments and Visualize tree. The latter is
grayed out when it is not applicable.
A.Load each dataset into Weka and run simple k-means clustering algorithm with
different values of k(number of desired clusters). Study the clusters formed. Observe
the sum of squared errors and centroids, and derive insights.
Ans:
Output:
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden”
data points.
Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available
From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring
Up The Menu. You Will Be Presented With Options For Viewing Or Saving The Text Output And
--- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc.
To open Visualization screen, click ‘Visualize’ tab.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s
choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that
corresponds to ‘play o
In the visualization window, beneath the X-axis selector there is a drop-down list,
‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on
the attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better
visibility you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’
box and select lighter color from the color palette.n the left and ‘outlook’ at the top.
Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A special
case is the ‘UserClassifier’, which lets you to build your own classifier by interactively
selecting instances. Below the Y – axis there is a drop-down list that allows you to choose a
selection method. A group of points on the graph can be selected in four ways [2]:
attributes of the point. If more than one point will appear at the same location, more than
one set of attributes will be shown.
Rectangle. You can create a
Polygon. You can select several points by building a free-form polygon. Left-click
on the graph to add vertices to the polygon and right-click to complete it.
Polyline. To distinguish the points on one side from the once on another, you can build
a polyline. Left-click on the graph to add vertices to the polyline and right-click to
finish.
Practical 5:- Demonstrate performing regression on data sets.
Regression:
Regression is a data mining function that predicts a number. Age, weight, distance,
temperature, income, or sales could all be predicted using regression techniques. For
example, a regression model could be used to predict children's height, given their age,
weight, and other factors.
A regression task begins with a data set in which the target values are known. For
example, a regression model that predicts children's height could be developed based on
observed data for many children over a period of time. The data might track age, height,
weight, developmental milestones, family history, and so on. Height would be the target,
the other attributes would be the predictors, and the data for each child would constitute
a case.
Common Applications of Regression
The simplest form of regression to visualize is linear regression with a single predictor.
A linear regression technique can be used if the relationship between x and y can be
approximated with a straight line, as shown in Figure 4-1.
Figure 4-1 Linear Relationship Between x and y
The slope of the line (θ2) — the angle between a data point and the
regression line and
The y intercept (θ1) — the point where x crosses the y axis (x = 0)
Nonlinear Regression
Often the relationship between x and y cannot be approximated with a straight line. In
this case, a nonlinear regression technique may be used. Alternatively, the data could be
preprocessed to make the relationship linear.
In Figure 4-2, x and y have a nonlinear relationship. Oracle Data Mining supports
nonlinear regression via the gaussian kernel of SVM. (See "Kernel-Based Learning".)
Multivariate Regression
Multivariate regression refers to regression with multiple predictors (x 1 , x2 , ..., xn). For purposes
of illustration, Figure 4-1and Figure 4-2 show regression with a single predictor. Multivariate
regression is also referred to as multiple regression.
Regression Algorithms
Generalized Linear Models (GLM) is a popular statistical technique for linear modeling.
Oracle Data Mining implements GLM for regression and classification. See Chapter 12,
"Generalized Linear Models"
Support Vector Machines (SVM) is a powerful, state-of-the-art algorithm for linear and
nonlinear regression. Oracle Data Mining implements SVM for regression and other
mining functions. See Chapter 18, "Support Vector Machines"
Both GLM and SVM, as implemented by Oracle Data Mining, are particularly suited for mining
data that includes many predictors (wide data).
The Root Mean Squared Error and the Mean Absolute Error are statistics for evaluating the
overall quality of a regression model. Different statistics may also be available depending on the
regression methods used by the algorithm.
The Root Mean Squared Error (RMSE) is the square root of the average squared distance of a
data point from the fitted line.Figure 4-3 shows the formula for the RMSE.
Note:
Description of "Figure 4-3 Root Mean Squared Error"
The Mean Absolute Error (MAE) is the average of the absolute value of the residuals. The MAE
is very similar to the RMSE but is less sensitive to large errors. Figure 4-4 shows the formula for
the MAE.
A. Load each dataset into Weka and build Linear Regression model. Study the cluster
formed. Use training set option. Interpret the regression model and derive patterns and
conclusions from the regression results.
Ans:
Steps for run Aprior algorithm in WEKA
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose labor data set and open file.
8.Click on Classify tab and Click the Choose button then expand the functions
branch.
9.Select the LinearRegression leaf ans select use training set test option.
10. Click on start button.
Output:
Relation: labor-neg-data
Instances: 57
Attributes: 17 duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
duration =
0.4689 * cost-of-living-adjustment=tc,tcf +
0.6523 * pension=none,empl_contr +
1.0321 * bereavement-assistance=yes +
0.3904 * contribution-to-health-plan=full +
0.2765
Linear
Regression
Model
duration =
0.4689 * cost-of-living-
adjustment=tc,tcf + 0.6523 *
pension=none,empl_contr +
1.0321 * bereavement-
assistance=yes + 0.3904 *
contribution-to-health-plan=full
+ 0.2765
Scheme: weka.classifiers.functions.LinearRegression -S 0
-R 1.0E-8 Relation: labor-neg-data
Instances: 57
Attributes: 17 duration
wage-increase-
first-year wage-
increase-second-
year wage-
increase-third-
year cost-of-
living-
adjustment
working-hours
pension standby-pay
shift-
differenti
al
education
-
allowance
statutory-
holidays
vacation
longterm-disability-assistance
contribution-to-
dental-plan
bereavement-
assistance
contribution-to-
health-plan class
Test mode: split 66.0% train, remainder test
Linear
Regression
Model
duration =
0.4689 * cost-of-living-
adjustment=tc,tcf + 0.6523 *
pension=none,empl_contr +
1.0321 * bereavement-
assistance=yes + 0.3904 *
contribution-to-health-plan=full
+ 0.2765
Time taken to build model: 0.02 seconds
=== Evaluation on test split ===
=== Summary ===
Correlation coefficient 0.243
Mean absolute error 0.783
Root mean squared error 0.9496
Relative absolute error 106.8823 %
Root relative squared error 114.13 %
Total Number of Instances 19
C.Explore Simple linear regression techniques that only looks at one variable.
Description: The business of banks is making loans. Assessing the credit worthiness of
an applicant is of crucial importance. You have to develop a system to help a loan
officer decide whether the credit of a customer is good. Or bad. A bank’s business
rules regarding loans must consider two opposing factors. On th one han, a bank
wants to make as many loans as possible.
Interest on these loans is the banks profit source. On the other hand, a bank can not
afford to make too many bad loans. Too many bad loans could lead to the collapse of
the bank. The bank’s loan policy must involved a compromise. Not too strict and not
too lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit.
You can acquire such knowledge in a number of ways.
1. Knowledge engineering: Find a loan officer who is willing to talk. Interview her and
try to represent her knowledge in a number of ways.
2. Books: Find some training manuals for loan officers or perhaps a suitable textbook
on finance. Translate this knowledge from text from to production rule form.
3. Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.
4. Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.
1. Checking_Status
2. Duration
3. Credit_history
4. Purpose
5. Credit_amout
6. Savings_status
7. Employment
8. Installment_Commitment
9. Personal_status
10. Other_parties
11. Residence_since
12. Property_Magnitude
13. Age
14. Other_payment_plans
15. Housing
16. Existing_credits
17. Job
18. Num_dependents
19. Own_telephone
20. Foreign_worker
21. Class
Tasks (Turn in your answers to the following tasks)
Task 1. List all the categorical (or nominal) attributes and the real valued
attributes separately.
1. Double click on
credit-g.arff file.
2.Select all real
valued attributes.
3. Click on invert.
4. Then we get all categorial attributes selected
5. Click on remove
6. Click on visualize all.
1. Checking_Status
2. Credit_history
3. Purpose
4. Savings_status
5. Employment
6. Personal_status
7. Other_parties
8. Property_Magnitude
9. Other_payment_plans
10. Housing
11. Job
12. Own_telephone
13. Foreign_worker
1. Duration
2. Credit_amout
3. Installment_Commitment
4. Residence_since
5. Age
6. Existing_credits
7. Num_dependents
Ans) The following are the attributes may be crucial in making the credit assessment.
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. Existing_credits
6. Installment_commitment
7. Property_magnitude
3. One type of model that you can create is a Decision tree . train a
Decision tree using the complete data set as the training data. Report the
model obtained after training.
5. Output:
6.
7. === Run information ===
8.
9. Scheme: weka.classifiers.trees.J48 -
C 0.25 -M 2 Relation: german_credit
10. Instances: 1000
11.Attributes: 21
12.
13.Checking_status duration credit_history purpose credit_amount savings_status
employment installment_commitment personal_status other_parties residence_since
property_magnitude age other_payment_plans housing existing_credits job
num_dependents own_telephone foreign_worker class
14.
15.Test mode: evaluate on training data
as 669
31 | a = good
If we used our above model trained on the complete dataset and classified credit as
good/bad for each of the examples in that dataset. We can not get 100% training
accuracy only 85.5% of examples, we can classify correctly.
27. Is testing on the training set as you did above a good idea? Why or why not?
Ans)It is not good idea by using 100% training data set.
28. One approach for solving the problem encountered in the previous
question is using cross-validation? Describe what is cross validation briefly.
Train a decision tree again using cross validation and report your results. Does
accuracy increase/decrease? Why?
Output:
i) Fold Size-10
Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Coverage of cases (0.95 level) 92.8 %
Mean rel. region size (0.95 level) 91.7 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
a b <-- classified as
588 112 | a = good
183 117 | b = bad
=== Confusion
Matrix === a
b <--
classified as
624 76 | a = good
203 97 | b = bad
Note: With this observation, we have seen accuracy is increased when we have
folds size is 5 and accuracy is decreased when we have 10 folds.
29. Check to see if the data shows a bias against “foreign workers” or “personal-
status”.
One way to do this is to remove these attributes from the data set and see if the
decision tree created in those cases is significantly different from the full dataset
case which you have already done. Did removing these attributes have any
significantly effect? Discuss.
Output:
Press Start Button, If these attributes removed from the dataset, we can see change
in the accuracy compare to full data set when we removed.
i) If Foreign_worker is removed
Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 859 85.9 %
Incorrectly Classified Instances 141 14.1 %
Kappa statistic 0.6377
Mean absolute error 0.2233
Root mean squared error 0.3341
Relative absolute error 53.1347 %
Root relative squared error 72.9074 %
Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95
level)91.9 % Total Number of Instances
1000
TP
Rate
FP
Rate
0.954 0.363
0.637 0.046
Weighted Avg 0.859 0.268
=== Confusion
Matrix === a
b classified
as
668 32 | a = good
109 191 | b = bad
i) If Personal_status is
removed
Mean rel. region
size (0.95 level)
91.7
0.867
Note: With this observation we have seen, when “Foreign_worker “attribute is removed
2. Another question might be, do you really need to input so many attributes to get good
be only a few would do. For example, you could try just having attributes 2,3,5,7,10,17
combinations.(You had removed two attributes in problem 7. Remember to reload the
all the attributes initially before you start selecting the ones you want.)
Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We
Tab in Weka GUI Explorer to remove 21 st attribute (Class). In Classify Tab, Select Use Training set option then
Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy
data set when we removed.
a b <-- classified as
963 0 | a = yes
37 0 | b = no
Note: With this observation we have seen, when 3 rd attribute is removed from the Dataset, the
accuracy (83%) is decreased. So this attribute is important for classification. when 2 nd and 10th
attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove any one
among them. when 7th and 17th attributes are removed from the Dataset, the accuracy(85%) is
same. So we can remove any one among them. If we remove 5 th and 21st attributes the accuracy
is increased, so these attributes may not be needed for the classification.
17. Sometimes, The cost of rejecting an applicant who actually has good credit might
be higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5) and
lower cost to the second case. By using a cost matrix in weak. Train your decision tree and
report the Decision Tree and cross validation results. Are they significantly different from
results obtained in problem 6.
OUTPUT:
In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In Classify
Tab then press Choose button in that select J48 as Decision Tree Technique. In Classify Tab
then press More options button then we get classifier evaluation options window in that select
cost sensitive evaluation the press set option Button then we get Cost Matrix Editor. In that
change classes as 2 then press Resize button. Then we get 2X2 Cost matrix. In Cost Matrix (0,1)
location value change as 5, then we get modified cost matrix is as follows.
0.0 5.0
1.0 0.0
Then close the cost matrix editor, then press ok button. Then press start button.
=== Evaluation on training set ===
=== Summary ===
a b <-- classified as
669 31 | a = good
114 186 | b = bad
Note: With this observation we have seen that ,total 700 customers in that 669 classified as good
customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified as bad
customers and 114 misclassified as good customers.
18. Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias of the
model?
Ans)
steps followed are:-
1)click on credit arff file
2)Select all attributes
3)click on classify tab
4)click on choose and select J48 algorithm
5)select cross validation folds with 2
6)click on start
It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.
19. You can make your Decision Trees simpler by pruning the nodes. One approach is
to use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning for
training your Decision Trees using cross validation and report the Decision Trees you
obtain? Also Report your accuracy using the pruned model Does your Accuracy
increase?
Ans)
We can make our decision tree simpler by pruning the nodes. For that In Weka GUI
Explorer, Select Classify Tab, In that Select Use Training set option . In Classify Tab then
press Choose button in that select J48 as Decision Tree Technique. Beside Choose Button
Press on J48 –c 0.25
–M2 text we get Generic Object Editor. In that select Reduced Error pruning Property as
12) How can you convert a Decision Tree into “if-then-else rules”. Make up your own small
Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in weka
is rules. PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you predict what
attribute that might be in this data set? OneR classifier uses a single attribute to make
decisions(it chooses the attribute based on minimum error).Report the rule obtained by
training a one R classifier. Rank the performance of j48,PART,oneR.
Ans)
Steps For Analyze Decision Tree:
1)click on credit arff file
2)Select all attributes
3) click on classify tab
4) click on choose and select J48 algorithm
5)select cross validation folds with
6)click on start
In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option .There
also exist different classifiers that output the model in the form of Rules. Such classifiers in
weka are
“PART” and ”OneR” . Then go to Choose and select Rules in that select PART and press
start Button.
a b <--
classified as 653
47 | a = good
56 244 | b = bad
Then go to Choose and select Rules in that select OneR and press start Button.
== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 742 74.2 %
Incorrectly Classified Instances 258 25.8 %
=== Confusion Matrix ===
a b <--
classified as 642
58 | a = good
200 100 | b = bad
Then go to Choose and select Trees in that select J48 and press start Button.
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
=== Confusion Matrix ===
a b <--
classified as 669
31 | a = good
114 186 | b = bad
Note: With this observation we have seen the performance of classifier and Rank is as follows
1. PART
2. J48 3. OneR
Task 2: Hospital Management System
Data warehouse consists dimension table and fact
Dimension
_name
_hierarchies
hierarchies
Dimensions objects(dimension) consists of set of levels and set of hierarchies defined over
those levels.the levels represent levels of aggregation.hierarchies describe-child
relationships among a set of levels.
For example .a typical calander dimension could contain five levels.two hierarchies can be
defined on these levels.
YearL>QuarterL>MonthL>DayL
H2: YearL>WeekL>DayL
The hierarchies are describes from parent to child,so that year is the parent of Quarter,quarter
are parent of month,and so forth.
When you create a definition for a hierarchy,warehouse builder creates an identifier key for
each level of the hierarchy and unique key constraint on the lowest level (base level)
SUPPLIER:( Supplier_name,medicine_brand_name,address,etc..,)
If each dimension has 6 levels,decide the levels and hierarchies,assumes the level names
suitably.
Design the hospital management system data warehousing using all schemas.give the example
M.TECH
(IYEAR – I SEM)
(2024-25)
DEPARTMENT OF
COMPUTER SCIENCE AND
ENGINEERING
Submitted To Submitted By