Assignment 3
Assignment 3
Assignment 3
Table of Contents
Input................................................................................................................................ 4
Output............................................................................................................................. 4
Preprocessing ................................................................................................................... 5
Normalizer ..................................................................................................................... 5
Partitioning..................................................................................................................... 5
Classifiers .......................................................................................................................... 6
SVM ................................................................................................................................ 9
Tree Ensemble.............................................................................................................11
Conclusion ................................................................................................................... 12
Data Mining
The Task
Following the last assignment, building classifiers and choosing the best one to predict
an attribute “QUALIFIED” for property data set is the main focus of this assignment.
There are number of methods for it. The software called KNIME, which has a graphical
Input
There are three files for this assignment. These are training data set, unknown data set,
and sample prediction data set. The training data set has the attribute “QUALIFIED”, but
unknown data set has not. The last data set, sample prediction, is filled with random
For the assignment, KNIME will handle the training and unknown data sets to predict
Output
It is not mandatory, but once predicted data is created, uploading on Kaggle will score it
Preprocessing
Column Filter
has same value for all rows. Therefore, a column filter is used to remove the column
Missing Value
Missing values which may disturb the prediction are will be removed.
Number to String
There are attributes which have numbers as data, but not numeric data such as “HEAT”,
Normalizer
Partitioning
The partitioning node separates the training data into two portions, split 70-30 with 70%
Classifiers
Decision Trees
The data will be transformed and predicted by decision tree nodes. It is most
appropriate to construct categorical data. The accuracy is 83.074%. There are 1544
Random Forest
The pre-processed data transmitted into Random Forest learner, and default settings
The preprocessed data transmitted into the KNN node. The “Number of Neighbors to
consider (K) was changed to 5 which was originally 3. The accuracy is 85.855%
SVM
After starting the SVM learner over 24 hours, it did not complete the process; thus, no
Neural Networks
The pre-processed data was transmitted into the PNN Learner. The settings are default.
Tree Ensemble
The pre-processed data transmitted into the Tree Ensemble Learner with default
Best Classifier
Result Summary
SVM:
Conclusion
Based on the result summary above, Tree Ensemble has the highest accuracy among
others. Thus, for unknown data set, Tree Ensemble methods will be used for making a
prediction. The prediction from unknown data set was uploaded on Kaggle.