IJCSI_PublishedPaper_Xhemali_et_al

This item was submitted to Loughborough's Research Repository by the author.
Items in Figshare are protected by copyright, with all rights reserved, unless otherwise indicated.
Naïve Bayes vs. Decision Trees vs. Neural Networks in the classification of
training web pages
PLEASE CITE THE PUBLISHED VERSION
PUBLISHER
© IJCSI
VERSION
VoR (Version of Record)
LICENCE
CC BY-NC-ND 4.0
REPOSITORY RECORD
Xhemali, Daniela, Chris J. Hinde, and R.G. Stone. 2019. “Naïve Bayes Vs. Decision Trees Vs. Neural
Networks in the Classification of Training Web Pages”. figshare. https://hdl.handle.net/2134/5394.
This item was submitted to Loughborough’s Institutional Repository
(https://dspace.lboro.ac.uk/) by the author and is made available under the
following Creative Commons Licence conditions.
For the full text of this licence, please go to:

http://creativecommons.org/licenses/by-nc-nd/2.5/
IJCSI International Journal of Computer Science Issues, Vol. 4, No. 1, 2009 16
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
Naïve Bayes vs. Decision Trees vs. Neural Networks in the

Classification of Training Web Pages
Daniela XHEMALI1, Christopher J. HINDE2 and Roger G. STONE3
1
Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]
2
[email protected]
3
[email protected]
Apricot Training Management (ATM), which helps other

Abstract organisations to identify and analyse their training needs
Web classification has been attempted through many different and recommend suitable courses for their employees.
technologies. In this study we concentrate on the comparison of Currently, the latest prospectuses from different training
Neural Networks (NN), Naïve Bayes (NB) and Decision Tree providers are ordered, catalogued, shelved and the course
(DT) classifiers for the automatic analysis and classification of information found is manually entered into the company’s
attribute data from training course web pages. We introduce an
database. This is a time consuming, labour-intensive
enhanced NB classifier and run the same data sample through the
DT and NN classifiers to determine the success rate of our process, which does not guarantee always up-to-date
classifier in the training courses domain. This research shows results, due to the limited life expectancy of some course
that our enhanced NB classifier not only outperforms the information such as dates and prices and other limitations
traditional NB classifier, but also performs similarly as good, if in the availability of up-to-date, accurate information on
not better, than some more popular, rival techniques. This paper websites and printed literature. The overall project is
also shows that, overall, our NB classifier is the best choice for therefore to automate the process of retrieving, extracting
the training courses domain, achieving an impressive F-Measure and storing course information into the database
value of over 97%, despite it being trained with fewer samples guaranteeing it is always kept up-to-date.
than any of the classification systems we have encountered.
Keywords: Web classification, Naïve Bayesian Classifier,
The research presented in this paper is related to the
Decision Tree Classifier, Neural Network Classifier, Supervised
learning. information retrieval side of the project, in particular to the
automatic analysis and filtering of the retrieved web pages
according to their relevance. This classification process is
1. Introduction vital to the efficiency of the overall system, as only
relevant pages will then be considered by the extraction
Managing the vast amount of online information and process, thus drastically reducing processing time &
classifying it into what could be relevant to our needs is an increasing accuracy.
important step towards being able to use this information.
Thus, it comes as no surprise that the popularity of Web The underlining technique used for our classifier is based
Classification applies not only to the academic needs for on the NB algorithm, due to the independence noticed in
continuous knowledge growth, but also to the needs of the data corpus analysed. The traditional technique is
industry for quick, efficient solutions to information enhanced however, to analyse not only the visible textual
gathering and analysis in maintaining up-to-date content of web pages, but also important web structures
information that is critical to the business success. such as META data, TITLE and LINK information.
Additionally, a ‘believed probability’ of features in each
This research is part of a larger research project in category is calculated to handle situations when there is
collaboration with an independent brokerage organisation, little evidence about the data, particularly in the early
IJCSI
IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009 17
stages of the classification process. Experiments have damage detection in engineering materials. NBTree in [24]
shown that our classifier exceeds expectations, achieving induced a hybrid of NB and DTs by using the Bayes rule
an impressive F-Measure value of over 97%. to construct the decision tree. Other research works ([5],
[23]) have modified their NB classifiers to learn from
positive and unlabeled examples. Their assumption is that
2. Related Work finding negative examples is very difficult for certain
domains, particularly in the medical industry. Finding
Many ideas have emerged over the years on how to negative examples for the training courses domain,
achieve quality results from Web Classification systems, however, is not at all difficult, thus the above is not an
thus there are different approaches that can be used to a issue for our research.
degree such as Clustering, NB and Bayesian Networks,
NNs, DTs, Support Vector Machines (SVMs) etc. We 2.2 Decision Trees
decided to only concentrate on NN, DT and NB
classifiers, as they proved more closely applicable to our Unlike NB classifiers, DT classifiers can cope with
project. Despite the benefits of other approaches, our combinations of terms and can produce impressive results
research is in collaboration with a small organisation, thus for some domains. However, training a DT classifier is
we had to consider the organisation’s hardware and quite complex and they can get out of hand with the
software limitations before deciding on a classification number of nodes created in some cases. According to [17],
technique. SVM and Clustering would be too expensive with six Boolean attributes there would be need for
and processor intensive for the organisation, thus they 18,446,744,073,709,551,616 distinct nodes. Decision trees
were considered inappropriate for this project. The may be computationally expensive for certain domains,
following discusses the pros and cons of NB, DTs and however, they make up for it by offering a genuine
NNs, as well as related research works in each field. simplicity of interpreting models, and helping to consider
the most important factors in a dataset first by placing
2.1 Naïve Bayes Models them at the top of the tree.
NB models are popular in machine learning applications, The researchers in [7], [12], [15] all used DTs to allow for
due to their simplicity in allowing each attribute to both the structure and the content of each web page to
contribute towards the final decision equally and determine the category in which they belong. An accuracy
independently from the other attributes. This simplicity of under 85% accuracy was achieved by all. This idea is
equates to computational efficiency, which makes NB very similar to our work, as our classifier also analyses
techniques attractive and suitable for many domains. both structure and content. WebClass in [12] was designed
to search geographically distributed groups of people, who
However, the very same thing that makes them popular, is share common interests. WebClass modifies the standard
also the reason given by some researchers, who consider decision tree approach by associating the tree root node
this approach to be weak. The conditional independence with only the keywords found, depth-one nodes with
assumption is strong, and makes NB-based systems descriptions and depth-two nodes with the hyperlinks
incapable of using two or more pieces of evidence found. The system however, only achieved 73% accuracy.
together, however, used in appropriate domains, they offer The second version of WebClass ([2]) implemented
quick training, fast data analysis and decision making, as various classification models such as: Bayes networks,
well as straightforward interpretation of test results. There DTs, K-Means clustering and SVMs in order to compare
is some research ([13], [26]) trying to relax the conditional findings of WebClassII. However, findings showed that
independence assumption by introducing latent variables for increasing feature set sizes, the overall recall fell to
in their tree-shaped or hierarchical NB classifiers. just 39.75%.
However, a thorough analysis of a large number of
training web pages has shown us that the features used in 2.3 Neural Networks
these pages can be independently examined to compute
the category for each page. Thus, the domain for our NNs are powerful techniques for representing complex
research can easily be analysed using NB classifiers, relationships between inputs and outputs. Based on the
however, in order to increase the system’s accuracy, the neural structure of the brain ([17]), NNs are complicated
classifier has been enhanced as described in section 3. and they can be enormous for certain domains, containing
Enhancing the standard NB rule or using it in a large number of nodes and synapses. There is research
collaboration with other techniques has also been that has managed to convert NNs into sets of rules in order
attempted by other researchers. Addin et al in [1] coupled to discover what the NN has learnt ([8], [21]), however,
a NB classifier with K-Means clustering to simulate many other works still refer to NNs as a ‘black box’
IJCSI
approach ([18], [19]), due to the difficulty in that are believed to be too common or too insignificant in
understanding the decision making process of the NN, distinguishing web pages from one another, otherwise
which can lead to not knowing if testing has succeeded. known as stopwords, are also removed. Care is taken,
however, to preserve the information extracted from
AIRS in [4] used the knowledge acquired during the certain Web structures such as the page TITLE, the LINK
training of a NN to modify the user’s query, making it and META tag information. These are given higher
possible for the adapted query to retrieve more documents weights than the rest of the text, as we believe that the
than the original query. However, this process would information given by these structures is more closely
sometimes give more importance to the knowledge related to the central theme of the web page.
‘learnt’, thus change the original query until it lost its
initial keywords.
Researchers in [6] and [14] proposed a term frequency

method to select the feature vectors for the classification
of documents using NNs. A much later research ([3]) used
NNs together with an SVM for better classification
performance. The content of each web page was analysed
together with the content of its neighbouring pages. The
resulting feature scores were also used by the SVM.
Using two powerful techniques may radically improve

Fig. 1 System Stages
classification; however, this research did not combine the
techniques to create a more sophisticated one. They simply The classification algorithm is then applied to each
used them one after the other on the same data set, which category stored in the database. There are only two
meant that the system took much longer to come up with categories currently used in this research, relevant and
results. irrelevant, however, the system is designed to work with
any number of categories. This is to allow for future
growth and potential changes at ATM.
3. NB Classifier
Our system involves three main stages (Fig. 1). In stage-1, Our classification algorithm is based on the NB approach.
a CRAWLER was developed to find and retrieve web The standard Bayes rule is defined as follows:
pages in a breadth-first search manner, carefully checking
each link for format accuracies, duplication and against an
automatically updatable rejection list. (1)
where:
In stage-2, a TRAINER was developed to analyse a list of P(Cn) = the prior probability of category n
relevant (training pages) and irrelevant links and compute w = the new web page to be classified
probabilities about the feature-category pairs found. After P(w|Cn) = the conditional probability of the test page,
each training session, features become more strongly given category n.
associated with the different categories.
The P(w) can be disregarded, because it has the same
The training results were then used by the NB Classifier value regardless of the category for which the calculation
developed in stage-3, which takes into account the is carried out, and as such it will scale the end probabilities
‘knowledge’ accumulated during training and uses this to by the exact same amount, thus making no difference to
make intelligent decisions when classifying new, unseen- the overall calculation. Also, the results of this calculation
before web pages. The second and third stages have a very are going to be used in comparison with each other, rather
important sub-stage in common, the INDEXER. This is than as stand-alone probabilities, thus calculating P(w)
responsible for identifying and extracting all suitable would be unnecessary effort. The Bayes Theorem in (eq.
features from each web page. The INDEXER also applies 1) is therefore simplified to:
rules to reject HTML formatting and features that are
ineffective in distinguishing web pages from one-another.
This is achieved through sophisticated regular expressions (2)
and functions which clean, tokenise and stem the content
of each page prior to the classification process. Features
IJCSI
The algorithm used in this research is based on the 4. Results

mathematical manipulation of the probabilities of the top
100 most used keywords extracted from each web page.
However, the separate probabilities alone would not be 4.1 Performance Measures
sufficient in classifying a new web page. The
classification of each page requires the combination of the The experiments in this research are evaluated using the
probabilities of all the separate features {f1, f2, … fi, …, standard metrics of accuracy, precision, recall and f-
fn} into one probability, as in eq. 3: measure for Web Classification. These were calculated
using the predictive classification table, known as
Confusion Matrix (Table 1).
(3)
Table 1: Confusion Matrix
where z is a scaling factor dependent on the feature
variables, which helps to normalise the data. This scaling PREDICTED
factor was added as we found that when experimenting IRRELEVANT RELEVANT
with smaller datasets, the feature probabilities got very IRRELEVANT TN FP
ACTUAL
small, which made the page probabilities get close to zero. RELEVANT FN TP
This made the system unusable.
Considering Table 1:
Once the probabilities for each category have been
calculated, the probability values are compared to each TN (True Negative) Æ Number of correct predictions that
other. The category with the highest probability, and an instance is irrelevant
within a predefined threshold value, is assigned to the web FP (False Positive) Æ Number of incorrect predictions
page being classified. All the features extracted from this that an instance is relevant
page are also paired up with the resulting category and the FN (False Negative) Æ Number of incorrect predictions
information in the database is updated to expand the that an instance is irrelevant
system’s knowledge. TP (True Positive) Æ Number of correct predictions that
an instance is relevant
Our research adds an additional step to the traditional NB
Accuracy – The proportion of the total number of
algorithm to handle situations when there is very little
predictions that were correct:
evidence about the data, in particular during early stages
Accuracy (%) =
of the classification. This step calculates a Believed
(TN + TP) / (TN + FN + FP + TP) (5)
Probability (Believed Prob), which helps to calculate more
gradual probability values for the data. An initial Precision – The proportion of the predicted relevant pages
probability is decided for features with little information that were correct:
about them; in this research the probability value decided Precision (%) = TP / (FP + TP) (6)
is equal to the probability of the page, given no evidence
(P(Cn)) The calculation followed is as shown below: Recall – The proportion of the relevant pages that were
correctly identified
Recall (%) = TP / (FN + TP) (7)
F-Measure – Derives from precision and recall values:

(4) F-Measure (%) =
(2 x Recall x Precision) / (Recall + Precision) (8)
where: bpw is the believed probability weight (in this case
bpw = 1, which means that the Believed Probability is The F-Measure was used, because despite Precision and
weighed the same as one feature). Recall being valid metrics in their own right, one can be
optimised at the expense of the other ([22]). The F-
The above eliminates the assignment of extreme Measure only produces a high result when Precision and
probability values when little evidence exists; instead, Recall are both balanced, thus this is very significant.
much more realistic probability values are achieved, which
has improved the overall accuracy of the classification A Receiver Operating Characteristic (ROC) curve analysis
process as shown in the Results section. was also performed, as it shows the sensitivity (FN
classifications) and specificity (FP classifications) of a
IJCSI
test. The ROC curve is a comparison of two standard and enhanced NB classifiers with the above
characteristics: TPR (true positive rate) and FPR (false training and test data. The results showed that the
positive rate). The TPR measures the number of relevant enhanced NB classifier was comfortably in the lead by
pages that were correctly identified. over 7% in both accuracy and F-Measure value.
TPR = TP / (TP + FN) (9) In the second set of experiments, the sampling units
analysed by the NB classifier were also run by a DT
The FPR measures the number of incorrect classifications classifier and an NN classifier. The results were compared
of relevant pages out of all irrelevant test pages. to determine which classifier is better at analysing
attribute data from training web pages. The DT classifier
FPR = FP / (FP + TN) (10) is a ‘C’ program, based on the C4.5 algorithm in [16],
written to evaluate data samples and find the main
In the ROC space graph, FPR and TPR values form the x pattern(s) emerging from the data. For example, the DT
and y axes respectively. Each prediction (FPR, TPR) classifier may conclude that all web pages containing a
represents one point in the ROC space. There is a diagonal specific word are all relevant. More complex data samples
line that connects points with coordinates (0, 0) and (1, 1). however, may result in more complex configurations
This is called the “line of no-discriminations’ and all the found.
points along this line are considered to be completely
random guesses. The points above the diagonal line The NN classifier used is also a ‘C’ program, based on the
indicate good classification results, whereas the points work published in [8]-[11]. MATLAB’s NN toolbox
below the line indicate wrong results. The best prediction ([20]) could have also been used, however in past
(i.e. 100% sensitivity and 100% specificity), also known experiments MATLAB managed approximately 2 training
as ‘perfect classification’, would be at point (0, 1). Points epochs compared to the ‘C’ NN classifier, which achieved
closer to this coordinate show better classification results approximately 60,000 epochs in the same timeframe. We,
than other points in the ROC space. therefore, abandoned MATLAB for the bespoke compiled
NN system.
4.2 Data Corpus
All three classifiers were initially trained with 105
In this research, each web page is referred to as a sampling sampling units and tested with a further 521 units, all
unit. Each sampling unit comprises a maximum of 100 consisting of a total of 3900 unique features. The NB
features, which are selected after discarding much of the classifier achieved the highest accuracy (97.89%),
page content, as explained previously. The total number of precision (99.20%), recall (98.61%) and F-Measure
unique features examined in the following experiments (98.90%) values, however, the DT classifier achieved the
was 5217. The total number of sampling units used was fastest execution time. The NN classifier, created with
9436. These units were separated into two distinct sets: a 3900 inputs, 3900 midnodes and 1 output, came last in all
training set and a test set. metrics and execution time.
The training set for the NB classifier consisted of 711 For the most recent test, all classifiers were trained with
randomly selected, positive and negative examples (i.e. 711 sampling units and they were then tested on the
relevant and irrelevant sampling units). The test collection remaining 8725 sampling units. The NB and DT
created consisted of data obtained from the remaining classifiers were adequately fast for exploitation and
8725 sampling units. The training set makes for under delivered good discriminations. The test results are shown
10% of the size of the entire data corpus. This was in Table 2 and Table 3.
intentionally decided, in order to really challenge the NB
classifier. Compared to many other classification systems Table 2: Confusion Matrix for NB Classifier
encountered, this is the smallest training set used.
PREDICTED
4.3 Experimental Results IRRELEVANT RELEVANT
The first experiment that was carried out was to test our IRRELEVANT TN / 876 FP / 47
ACTUAL
enhanced NB classifier against the standard naïve bayes RELEVANT FN / 372 TP / 7430
algorithm, in order to determine whether or not the
changes made to the original algorithm had enhanced the
Table 3: Confusion Matrix for DT Classifier
accuracy of the classifier. For this purpose, we stripped
our system of the additional steps and executed both
IJCSI
PREDICTED
IRRELEVANT RELEVANT This result is further confirmed by the comparison of the
two classifiers on the ROC space (Fig.2.), where it is
IRRELEVANT TN / 794 FP / 129
ACTUAL shown that the result set from the NB classifier falls closer
RELEVANT FN / 320 TP / 7482
to the ‘perfect classification’ point than the result set from
the DT classifier.
The vocabulary used in our experiments, consisting of
5217 features, was initially mapped onto a NN with 5217 Table 5: ROC Space Results
inputs, one hidden layer with 5217 nodes and 1 output, in
keeping with the standard configuration of a NN, where Classifier FPR TPR
the number of midnodes is the same as the number of
NB Classifier 0.05092 0.95232
inputs. A fully connected network of this size would have
over 27 million connections, with each connection DT Classifier 0.13976 0.95899
involving weight parameters to be learnt. Our attempt at
creating such a network resulted in the NN program
failing to allocate the needed memory and crashing.
After paying more attention to the function complexity, we

decided to change the number of midnodes to reflect this
complexity. We, therefore, created a NN with 5217 inputs,
1 output and only 200 midnodes. This worked well and the
resulting NN successfully established all connections.
However, we realised that the NN would need to be
extended (more nodes and midnodes created) to model any
additional, new features, each time they are extracted from
future web pages. This would potentially take the NN
back to the situation where it fails to make all the required
connections and this would be an unacceptable result for
ATM. Technology exists for growing nodes; however, this Fig. 2. ROC Space
would be complex and expensive. Furthermore, the NN
The ROC space was created using the values in Table 5,
took 200 minutes to train, which is much longer than the
following the calculations in equations (9) and (10).
other classifiers, which took seconds for the same training
sample. Therefore, we decided not to proceed with NNs
any further, as they would be unsuitable for our project 5. Conclusions
and other projects of this kind.
To summarise, we succeeded in building a NB Classifier
Table 4: Final Results
that can classify training web pages with 95.20% accuracy
and an F-Measure value of over 97%. The NB approach
Classifier Accuracy Precision Recall F-Measure
was chosen as thorough analysis of many web pages
NB
95.20% 99.37% 95.23% 97.26% showed independence amongst the features used. This
Classifier
DT
approach was also a practical choice, because ATM, like
94.85% 98.31% 95.90% 97.09% many small companies, has limited hardware
Classifier
specifications available at their premises, which needed to
be taken into account.
Table 4 shows the Accuracy, Precision, Recall and F-
Measure results achieved by the NB and DT classifiers, The NB approach was enhanced, however, to calculate the
following the calculations in section 4.1. These results believed probability of features in each category. This
show that both classifiers achieve impressive results in the additional step was added to handle situations when there
classification of attribute data in the training courses is little evidence about the data, in particular during early
domain. The DT classifier outperforms the NB classifier in stages of the classification process. Furthermore, the
execution speed and Recall value (by 0.67%). However, classification process was enhanced by taking into
the NB classifier achieves higher Accuracy, Precision and consideration not only the content of each web page, but
most importantly, overall F-Measure value, which is a also various important structures such as the page TITLE,
very promising result. META data and LINK information. Experiments showed
that our enhancements improved the classifier by over 7%
IJCSI
in accuracy, in comparison with the original naïve bayes for Innovative and Collaborative Engineering (CICE) for
algorithm. funding our work.
The NB classifier was tested against 8725 sampling units References

after being trained with only 711 units. This exact same [1] Addin, O., Sapuan, S. M., Mahdi, E., & Othman, M. “A
sample was also analysed by a DT and a NN classifier and Naive-Bayes classifier for damage detection in engineering
the results from all systems were compared to one-another. materials”, Materials and Design, 2007, pp. 2379-2386.
Our experiments showed that although some NN [2] Ceci, M., & Malerba, D. “Hierarchical Classification of
classifiers can be very accurate for some domains, they HTML Documents with WebClassII”, Lecture Notes in
take the longest to train and have extensibility issues due Computer Science, 2003, pp. 57-72.
[3] Chau, M., & Chen, H. “A machine learning approach to
to their extremely large and complex nature. It was
web page filtering using content and structure analysis”,
therefore realized that NNs would be too expensive for Decision Support Systems, Vol. 44, No. 2, 2007, pp. 482-494.
ATM and unsuitable for handling a potentially large [4] Crestani, F. “An Adaptive Information Retrieval System
number of features created by the classification process. Based on Neural Networks”, in: International Workshop on
Artificial Neural Networks: New Trends in Neural
On a more positive note, our experiments produced Computation, Vol. 686, 1993, pp. 732-737.
exciting findings for the application of the NB algorithm [5] Denis, F., Laurent, A., Gilleron, R., Tommasi, M. “Text
in the training courses domain, as the NB classifier classification and co-training from positive and unlabeled
achieved impressive results, including the highest examples”, in: ICML Workshop: The Continuum from
Precision value (99.37%) and F-Measure (97.26%). Labeled to Unlabeled Data, 2003, pp. 80-87.
[6] Enhong, C., Shangfei, W., Zhenya, Z. & W. Xufa.
Although some of the results are close to the results from
“Document classification with CC4 neural network”, in:
the DT classifier, these experiments show that Naïve Proceedings of ICONIP, Shanghai, China, 2001.
Bayes Classifiers should not be considered inferior to [7] Estruch, V., Ferri, C., Hernández-Orallo, J., & Ramírez-
more complex techniques such as Decision Trees or Quintana, M. J. “Web Categorisation Using Distance-Based
Neural Networks. They are fast, consistent, easy to Decision Trees”, in: International Workshop on Automated
maintain and accurate in the classification of attribute data, Specification and Verification of Web Site, 2006, pp. 35-40.
such as the training courses domain. In one of our [8] Fletcher, G.P & Hinde, C.J. “Interpretation of Neural
previous papers ([25]), we expressed our concern that Networks as Boolean Transfer Functions”, Knowledge-Based
many researchers go straight for the more complex Systems, Vol. 7, No. 3, 1994, 207-214.
approaches without trying out the simpler ones first. We [9] Fletcher, G.P & Hinde, C.J. “Using Neural Networks as a
Tool for Constructing Rule Based Systems”, Knowledge-Based
hope this paper will encourage researchers to exploit the
Systems, Vol. 8, No. 4, 1995, 183-189.
simpler techniques, as they can be, as this paper showed, [10] Fletcher, G.P & Hinde, C.J. “Producing Evidence for the
more efficient and much less expensive. Hypotheses of Large Neural Networks”, Neurocomputing, Vol.
10, 1996, pp. 359-373.
The system may be improved further by reducing the [11] Hinde, C.J., & Fletcher, G.P., West, A.A. & Williams, D.J.
number of features analysed. More research needs to be “Neural Networks”, ICL Systems Journal, Vol. 11, No. 2,
done to establish a possible cut off point for the extracted 1997, pp. 244-278.
features. This may speed up the classification process as [12] Hu, W., Chang, K. & Ritter, G. “WebClass: Web Document
well as potentially improve the classifier further. More Classification Using Modified Decision Trees”, in: 38th Annual
tests will also be done to confirm the NB classifier’s Southeast Regional Conference, 2000, pp. 262-263.
[13] Langseth, H. & Nielsen, T. “Classification using
success on a grander scale. In conclusion, this research has
Hierarchical Naïve Bayes models”, Machine Learning, Vol. 63,
shown that the NB approach, enhanced to perform even No. 2, 2006, pp. 135-159.
with limited information, whilst analysing both web [14] Liu, Z. & Zhang, Y. “A competitive neural network
content and structural information, gives very promising approach to web-page categorization”, International Journal of
results in the training courses domain, outperforming Uncertainty, Fuzziness & Knowledge Systems, Vol. 9, 2001,
powerful and popular rivals such as decision trees and pp. 731-741.
neural networks. [15] Orallo, J. “Extending Decision Trees for Web
Categorisation”, in: 2nd Annual Conference of the ICT for EU-
India Cross Cultural Dissemination, 2005.
Acknowledgments [16] Quinlan, J. R. “Improved use of continuous attributes in
C4.5”, Journal of Artificial Intelligence Research, Vol. 4,
We would like to thank the whole team at ATM for the 1996, pp. 77-90.
support and help they have offered us since the first day of [17] Russell, S. & Norvig, P. Artificial Intelligence: A Modern
the project. Also, thank you to both ATM and the Centre Approach, London: Prentice Hall, 2003.
IJCSI
[18] Segaran, T. Programming Collective Intelligence, U.S.A:

O’Reilly Media Inc, 2007.
[19] Tal, B. “Neural Network - Based System of Leading
Indicators”, CIBC World Markets, 2003.
[20] TheMathsWork,
http://www.mathworks.com/products/neuralnet/
[21] Towell, G. & Shavlik, J. “Extracting Refined Rules from
Knowledge-Based Neural Networks”, Machine Learning, Vol.
13, No. 1, 1993, pp. 71-101.
[22] Turney, P. “Learning to Extract Keyphrases from Text”,
Technical Report ERB-1057, Institute for Information
Technology, National Research Council of Canada, 1999.
[23] Wang C., Ding C., Meraz R., Holbrook S. “PSoL: a positive
sample only learning algorithm for finding non-coding RNA
genes”, Bioinformatics, Vol. 22, No. 21, 2006, pp. 2590-2596.
[24] Wang, L., Li, X., Cao, C. & Yuan, S. “Combining decision
tree and Naïve Bayes for classification”, Knowledge Based
Systems, Vol. 19, 2006, pp. 511-515.
[25] Xhemali, D., Hinde, C.J. & Stone, R.G. 2007. “Embarking
on a Web Extraction Project”, in: The 2007 UK Conference on
Computational Intelligence, 2007.
[26] Zhang, N. “Hierarchical latent class models for cluster
analysis”, in: 18th National Conference on Artificial
Intelligence, 2002, pp. 230-237.
Daniela Xhemali is an Engineering Doctorate (EngD) student at

Loughborough University, UK. She received a First Class
(Honours) BSc in Software Engineering from Sheffield Hallam
University in 2005 and an MSc with Distinction in Engineering,
Innovation and Management from Loughborough University in
2008. Daniela Xhemali has also worked in industry for two years
as a Software Engineer, programming multi-user, object oriented
applications, with large database backend. Her current research
focuses on Web Information Retrieval and Extraction, specifically
on the use of Bayes Networks, Decision Trees and Neural
Networks in the classification of web pages as well as the use of
Genetic Programming and Evolution in the extraction of specific
web information.
Dr. Christopher J. Hinde is a Senior Lecturer at Loughborough

University. He is the Programme Director of the Computer Science
& Artificial Intelligence group as well as the Programme Director of
the Computer Science & E-business group. Dr. Hinde is also the
leader of the Intelligent and Interactive Systems Research division.
His research interests include: Artificial intelligence, fuzzy
reasoning, logic programming, natural language processing, neural
nets etc.
Dr. Roger G. Stone is a lecturer at Loughborough University. He is

DANS Coordinator and the Quality Manager at Loughborough
University. Dr. Stone is also a member of the Interdisciplinary
Computing Research Division. His research interests include: Web
programming, web accessibility, program specification techniques,
software engineering tools, compiling etc.
IJCSI

IJCSI_PublishedPaper_Xhemali_et_al

Uploaded by

Copyright:

Available Formats

IJCSI_PublishedPaper_Xhemali_et_al

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IJCSI_PublishedPaper_Xhemali_et_al

Uploaded by

Copyright:

Available Formats

This item was submitted to Loughborough's Research Repository by the author.

VoR (Version of Record)

For the full text of this licence, please go to:

Naïve Bayes vs. Decision Trees vs. Neural Networks in the

Apricot Training Management (ATM), which helps other

Researchers in [6] and [14] proposed a term frequency

Using two powerful techniques may radically improve

The algorithm used in this research is based on the 4. Results

F-Measure – Derives from precision and recall values:

After paying more attention to the function complexity, we

The NB classifier was tested against 8725 sampling units References

[18] Segaran, T. Programming Collective Intelligence, U.S.A:

Daniela Xhemali is an Engineering Doctorate (EngD) student at

Dr. Christopher J. Hinde is a Senior Lecturer at Loughborough

Dr. Roger G. Stone is a lecturer at Loughborough University. He is

You might also like