0% found this document useful (0 votes)
32 views

Assignment 3

this is the assignment file

Uploaded by

arun neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Assignment 3

this is the assignment file

Uploaded by

arun neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

lOMoARcPSD|12245914

Assignment 3

Introduction to Data Analytics (University of Technology Sydney)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by arun neupane ([email protected])
lOMoARcPSD|12245914

Introduction to Data Analytics

Assessment Task 3: Data mining in action

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Table of Contents

Data Mining ....................................................................................................................... 4

The Task ........................................................................................................................ 4

Input................................................................................................................................ 4

Output............................................................................................................................. 4

Preprocessing ................................................................................................................... 5

Column Filter ................................................................................................................. 5

Missing Value ................................................................................................................ 5

Number to String .......................................................................................................... 5

Normalizer ..................................................................................................................... 5

Partitioning..................................................................................................................... 5

Classifiers .......................................................................................................................... 6

Decision Trees .............................................................................................................. 6

Random Forest ............................................................................................................. 7

K Nearest Neighbor (KNN) ......................................................................................... 8

SVM ................................................................................................................................ 9

Neural Networks ......................................................................................................... 10

Tree Ensemble.............................................................................................................11

Best Classifier ................................................................................................................. 12

Result Summary ......................................................................................................... 12

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Conclusion ................................................................................................................... 12

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Data Mining

The Task

Following the last assignment, building classifiers and choosing the best one to predict

an attribute “QUALIFIED” for property data set is the main focus of this assignment.

There are number of methods for it. The software called KNIME, which has a graphical

interface, is chosen for it to explicate the process visually.

Input

There are three files for this assignment. These are training data set, unknown data set,

and sample prediction data set. The training data set has the attribute “QUALIFIED”, but

unknown data set has not. The last data set, sample prediction, is filled with random

values for how Kaggle works.

For the assignment, KNIME will handle the training and unknown data sets to predict

the attribute value for unknown data set.

Output

It is not mandatory, but once predicted data is created, uploading on Kaggle will score it

and known how effective the process is.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Preprocessing

Column Filter

Within the data set, attribute “GIS_LAST_MOD_DTTM” which is a column number 37

has same value for all rows. Therefore, a column filter is used to remove the column

from the data set to ignore it.

Missing Value

Missing values which may disturb the prediction are will be removed.

Number to String

There are attributes which have numbers as data, but not numeric data such as “HEAT”,

“STYLE”, “STRUCT”, “GRADE”, “CNDTN”, “EXTWALL”, “ROOF”, “INTWALL”,

“USECODE”. There will be treated as string to improve learner’s performance.

Normalizer

The normalizer normalizes attribute “AYB” with min-max normalization.

Partitioning

The partitioning node separates the training data into two portions, split 70-30 with 70%

will be trained, and 30% will be tested.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Classifiers

Decision Trees

The data will be transformed and predicted by decision tree nodes. It is most

appropriate to construct categorical data. The accuracy is 83.074%. There are 1544

wrong classified rows.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Random Forest

The pre-processed data transmitted into Random Forest learner, and default settings

are used. The accuracy is 88.043%.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

K Nearest Neighbor (KNN)

The preprocessed data transmitted into the KNN node. The “Number of Neighbors to

consider (K) was changed to 5 which was originally 3. The accuracy is 85.855%

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

SVM

After starting the SVM learner over 24 hours, it did not complete the process; thus, no

results came out.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Neural Networks

The pre-processed data was transmitted into the PNN Learner. The settings are default.

The accuracy is 87.01%.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Tree Ensemble

The pre-processed data transmitted into the Tree Ensemble Learner with default

settings except the partitioning, which is 90-10. The accuracy is 88.662%.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

Best Classifier

Result Summary

The result of each method is the following:

Decision Tree: 83.074%

Random Forest: 88.043%

K Nearest Neighbor: 85.855%

SVM:

Neural Networks: 87.01%.

Tree Ensemble: 88.662%.

Conclusion

Based on the result summary above, Tree Ensemble has the highest accuracy among

others. Thus, for unknown data set, Tree Ensemble methods will be used for making a

prediction. The prediction from unknown data set was uploaded on Kaggle.

Downloaded by arun neupane ([email protected])


lOMoARcPSD|12245914

The whole part of KNIME workflows:

Downloaded by arun neupane ([email protected])

You might also like