IJCSI_PublishedPaper_Xhemali_et_al
IJCSI_PublishedPaper_Xhemali_et_al
IJCSI_PublishedPaper_Xhemali_et_al
Items in Figshare are protected by copyright, with all rights reserved, unless otherwise indicated.
Naïve Bayes vs. Decision Trees vs. Neural Networks in the classification of
training web pages
PLEASE CITE THE PUBLISHED VERSION
PUBLISHER
© IJCSI
VERSION
LICENCE
CC BY-NC-ND 4.0
REPOSITORY RECORD
Xhemali, Daniela, Chris J. Hinde, and R.G. Stone. 2019. “Naïve Bayes Vs. Decision Trees Vs. Neural
Networks in the Classification of Training Web Pages”. figshare. https://hdl.handle.net/2134/5394.
This item was submitted to Loughborough’s Institutional Repository
(https://dspace.lboro.ac.uk/) by the author and is made available under the
following Creative Commons Licence conditions.
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 17
stages of the classification process. Experiments have damage detection in engineering materials. NBTree in [24]
shown that our classifier exceeds expectations, achieving induced a hybrid of NB and DTs by using the Bayes rule
an impressive F-Measure value of over 97%. to construct the decision tree. Other research works ([5],
[23]) have modified their NB classifiers to learn from
positive and unlabeled examples. Their assumption is that
2. Related Work finding negative examples is very difficult for certain
domains, particularly in the medical industry. Finding
Many ideas have emerged over the years on how to negative examples for the training courses domain,
achieve quality results from Web Classification systems, however, is not at all difficult, thus the above is not an
thus there are different approaches that can be used to a issue for our research.
degree such as Clustering, NB and Bayesian Networks,
NNs, DTs, Support Vector Machines (SVMs) etc. We 2.2 Decision Trees
decided to only concentrate on NN, DT and NB
classifiers, as they proved more closely applicable to our Unlike NB classifiers, DT classifiers can cope with
project. Despite the benefits of other approaches, our combinations of terms and can produce impressive results
research is in collaboration with a small organisation, thus for some domains. However, training a DT classifier is
we had to consider the organisation’s hardware and quite complex and they can get out of hand with the
software limitations before deciding on a classification number of nodes created in some cases. According to [17],
technique. SVM and Clustering would be too expensive with six Boolean attributes there would be need for
and processor intensive for the organisation, thus they 18,446,744,073,709,551,616 distinct nodes. Decision trees
were considered inappropriate for this project. The may be computationally expensive for certain domains,
following discusses the pros and cons of NB, DTs and however, they make up for it by offering a genuine
NNs, as well as related research works in each field. simplicity of interpreting models, and helping to consider
the most important factors in a dataset first by placing
2.1 Naïve Bayes Models them at the top of the tree.
NB models are popular in machine learning applications, The researchers in [7], [12], [15] all used DTs to allow for
due to their simplicity in allowing each attribute to both the structure and the content of each web page to
contribute towards the final decision equally and determine the category in which they belong. An accuracy
independently from the other attributes. This simplicity of under 85% accuracy was achieved by all. This idea is
equates to computational efficiency, which makes NB very similar to our work, as our classifier also analyses
techniques attractive and suitable for many domains. both structure and content. WebClass in [12] was designed
to search geographically distributed groups of people, who
However, the very same thing that makes them popular, is share common interests. WebClass modifies the standard
also the reason given by some researchers, who consider decision tree approach by associating the tree root node
this approach to be weak. The conditional independence with only the keywords found, depth-one nodes with
assumption is strong, and makes NB-based systems descriptions and depth-two nodes with the hyperlinks
incapable of using two or more pieces of evidence found. The system however, only achieved 73% accuracy.
together, however, used in appropriate domains, they offer The second version of WebClass ([2]) implemented
quick training, fast data analysis and decision making, as various classification models such as: Bayes networks,
well as straightforward interpretation of test results. There DTs, K-Means clustering and SVMs in order to compare
is some research ([13], [26]) trying to relax the conditional findings of WebClassII. However, findings showed that
independence assumption by introducing latent variables for increasing feature set sizes, the overall recall fell to
in their tree-shaped or hierarchical NB classifiers. just 39.75%.
However, a thorough analysis of a large number of
training web pages has shown us that the features used in 2.3 Neural Networks
these pages can be independently examined to compute
the category for each page. Thus, the domain for our NNs are powerful techniques for representing complex
research can easily be analysed using NB classifiers, relationships between inputs and outputs. Based on the
however, in order to increase the system’s accuracy, the neural structure of the brain ([17]), NNs are complicated
classifier has been enhanced as described in section 3. and they can be enormous for certain domains, containing
Enhancing the standard NB rule or using it in a large number of nodes and synapses. There is research
collaboration with other techniques has also been that has managed to convert NNs into sets of rules in order
attempted by other researchers. Addin et al in [1] coupled to discover what the NN has learnt ([8], [21]), however,
a NB classifier with K-Means clustering to simulate many other works still refer to NNs as a ‘black box’
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 18
approach ([18], [19]), due to the difficulty in that are believed to be too common or too insignificant in
understanding the decision making process of the NN, distinguishing web pages from one another, otherwise
which can lead to not knowing if testing has succeeded. known as stopwords, are also removed. Care is taken,
however, to preserve the information extracted from
AIRS in [4] used the knowledge acquired during the certain Web structures such as the page TITLE, the LINK
training of a NN to modify the user’s query, making it and META tag information. These are given higher
possible for the adapted query to retrieve more documents weights than the rest of the text, as we believe that the
than the original query. However, this process would information given by these structures is more closely
sometimes give more importance to the knowledge related to the central theme of the web page.
‘learnt’, thus change the original query until it lost its
initial keywords.
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 19
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 20
test. The ROC curve is a comparison of two standard and enhanced NB classifiers with the above
characteristics: TPR (true positive rate) and FPR (false training and test data. The results showed that the
positive rate). The TPR measures the number of relevant enhanced NB classifier was comfortably in the lead by
pages that were correctly identified. over 7% in both accuracy and F-Measure value.
TPR = TP / (TP + FN) (9) In the second set of experiments, the sampling units
analysed by the NB classifier were also run by a DT
The FPR measures the number of incorrect classifications classifier and an NN classifier. The results were compared
of relevant pages out of all irrelevant test pages. to determine which classifier is better at analysing
attribute data from training web pages. The DT classifier
FPR = FP / (FP + TN) (10) is a ‘C’ program, based on the C4.5 algorithm in [16],
written to evaluate data samples and find the main
In the ROC space graph, FPR and TPR values form the x pattern(s) emerging from the data. For example, the DT
and y axes respectively. Each prediction (FPR, TPR) classifier may conclude that all web pages containing a
represents one point in the ROC space. There is a diagonal specific word are all relevant. More complex data samples
line that connects points with coordinates (0, 0) and (1, 1). however, may result in more complex configurations
This is called the “line of no-discriminations’ and all the found.
points along this line are considered to be completely
random guesses. The points above the diagonal line The NN classifier used is also a ‘C’ program, based on the
indicate good classification results, whereas the points work published in [8]-[11]. MATLAB’s NN toolbox
below the line indicate wrong results. The best prediction ([20]) could have also been used, however in past
(i.e. 100% sensitivity and 100% specificity), also known experiments MATLAB managed approximately 2 training
as ‘perfect classification’, would be at point (0, 1). Points epochs compared to the ‘C’ NN classifier, which achieved
closer to this coordinate show better classification results approximately 60,000 epochs in the same timeframe. We,
than other points in the ROC space. therefore, abandoned MATLAB for the bespoke compiled
NN system.
4.2 Data Corpus
All three classifiers were initially trained with 105
In this research, each web page is referred to as a sampling sampling units and tested with a further 521 units, all
unit. Each sampling unit comprises a maximum of 100 consisting of a total of 3900 unique features. The NB
features, which are selected after discarding much of the classifier achieved the highest accuracy (97.89%),
page content, as explained previously. The total number of precision (99.20%), recall (98.61%) and F-Measure
unique features examined in the following experiments (98.90%) values, however, the DT classifier achieved the
was 5217. The total number of sampling units used was fastest execution time. The NN classifier, created with
9436. These units were separated into two distinct sets: a 3900 inputs, 3900 midnodes and 1 output, came last in all
training set and a test set. metrics and execution time.
The training set for the NB classifier consisted of 711 For the most recent test, all classifiers were trained with
randomly selected, positive and negative examples (i.e. 711 sampling units and they were then tested on the
relevant and irrelevant sampling units). The test collection remaining 8725 sampling units. The NB and DT
created consisted of data obtained from the remaining classifiers were adequately fast for exploitation and
8725 sampling units. The training set makes for under delivered good discriminations. The test results are shown
10% of the size of the entire data corpus. This was in Table 2 and Table 3.
intentionally decided, in order to really challenge the NB
classifier. Compared to many other classification systems Table 2: Confusion Matrix for NB Classifier
encountered, this is the smallest training set used.
PREDICTED
4.3 Experimental Results IRRELEVANT RELEVANT
The first experiment that was carried out was to test our IRRELEVANT TN / 876 FP / 47
ACTUAL
enhanced NB classifier against the standard naïve bayes RELEVANT FN / 372 TP / 7430
algorithm, in order to determine whether or not the
changes made to the original algorithm had enhanced the
Table 3: Confusion Matrix for DT Classifier
accuracy of the classifier. For this purpose, we stripped
our system of the additional steps and executed both
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 21
PREDICTED
IRRELEVANT RELEVANT This result is further confirmed by the comparison of the
two classifiers on the ROC space (Fig.2.), where it is
IRRELEVANT TN / 794 FP / 129
ACTUAL shown that the result set from the NB classifier falls closer
RELEVANT FN / 320 TP / 7482
to the ‘perfect classification’ point than the result set from
the DT classifier.
The vocabulary used in our experiments, consisting of
5217 features, was initially mapped onto a NN with 5217 Table 5: ROC Space Results
inputs, one hidden layer with 5217 nodes and 1 output, in
keeping with the standard configuration of a NN, where Classifier FPR TPR
the number of midnodes is the same as the number of
NB Classifier 0.05092 0.95232
inputs. A fully connected network of this size would have
over 27 million connections, with each connection DT Classifier 0.13976 0.95899
involving weight parameters to be learnt. Our attempt at
creating such a network resulted in the NN program
failing to allocate the needed memory and crashing.
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 22
in accuracy, in comparison with the original naïve bayes for Innovative and Collaborative Engineering (CICE) for
algorithm. funding our work.
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 23
IJCSI