Application of Machine Learning
Application of Machine Learning
There are three algorithms discussed in this paper. There has been considerable amount of
work on each of them. In the paper, Improved naive Bayes classification algorithm for traffic
risk management by Hong Chen, Songhua Hu, Rui Hua & Xiuju Zhao, it is mentioned that the
Naive Bayesian classification algorithm is widely used in big data analysis and other fields
because of its simple and fast algorithm structure. Aiming at the shortcomings of the naive
Bayes classification algorithm, this paper uses feature weighting and Laplace calibration to
improve it, and obtains the improved naive Bayes classification algorithm. Through
numerical simulation, it is found that when the sample size is large, the accuracy of the
improved naive Bayes classification algorithm is more than 99%, and it is very stable; when
the sample attribute is less than 400 and the number of categories is less than 24, the
accuracy of the improved naive Bayes classification algorithm is more than 95%. Through
empirical research, it is found that the improved naive Bayes classification algorithm can
greatly improve the correct rate of discrimination analysis from 49.5 to 92%. Through
robustness analysis, the improved naive Bayes classification algorithm has higher accuracy.
[1] In the paper, KNN Model-Based Approach in Classification by Gongde Guo, Hui Wang,
David Bell, Yaxin Bi, and Kieran Greer, it is mentioned, the k-Nearest-Neighbours (kNN) is a
simple but effective methodfor classification. The major drawbacks with respect to kNN are
(1) its low-efficiency - being a lazy learning method prohibits it in many applications such as
dynamic web mining for a large repository, and (2) its dependency on the selection of a
“good value” for k. In this paper, they proposed a novel kNN type method for classification
that is aimed at overcoming these shortcomings. Their method constructs a kNN model for
the data, which replaces the data to serve as the basis of classification. The value of k is
automatically determined, is varied for different data, and is optimal in terms of
classification accuracy. The construction of the model reduces the dependency on k and
makes classification faster. Experiments were carried out on some public datasets collected
from the UCI machine learning repository in order to test their method. The experimental
results show that the kNN based model compares well with C5.0 and kNN in terms of
classification accuracy, but is more efficient than the standard kNN. [2] In Random Forests
and Decision Trees by Jehad Ali, Rehanullah Khan, Nasir Ahmad, Imran Maqsood, they have
compared the classification results of two models i.e. Random Forest and the J48 for
classifying twenty versatile datasets. They took 20 data sets available from UCI
repository containing instances varying from 148 to 20000. They compared the
classification results obtained from methods i.e. Random Forest and Decision Tree (J48).
The classification parameters consist of correctly classified instances, incorrectly classified
instances, F-Measure, Precision, Accuracy and Recall. They discussed the pros and cons of
using these models for large and small data sets. The classification results show that
Random Forest gives better results for the same number of attributes and large data sets i.e.
with greater number of instances, while J48 is handy with small data sets (less number of
instances). The results from breast cancer dataset depicts that when the number of
instances increased from 286 to 699, the percentage of correctly classified instances
increased from 69.23% to 96.13% for Random Forest i.e. for dataset with same number
of attributes but having more instances, the Random Forest accuracy increased. [3]
In Do we need hundreds of classifiers to solve real world classification problems, by Amorim,
D.G., Barro, S., Cernadas, E., & Delgado, M.F., they evaluate 179 classifiers arising from 17
families (discriminant analysis, Bayesian, neural networks, support vector machines,
decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other
ensembles, generalized linear models, nearest-neighbors, partial least squares and principal
component regression, logistic and multinomial regression, multiple adaptive regression
splines and other methods). We use 121 data sets from UCI data base to study the classifier
behavior, not dependent on the data set collection. The winners are the random forest (RF)
versions implemented in R and accessed via caret) and the SVM with Gaussian kernel
implemented in C using LibSVM. [4] In Trends in extreme learning machines: a review, by
Huang, G., Huang, G., Song, S., & You, K., they aim to report the current state of the
theoretical research and practical advances on Extreme learning machine (ELM). Apart from
classification and regression, ELM has recently been extended for clustering, feature
selection, representational learning and many other learning tasks. Due to its remarkable
efficiency, simplicity, and impressive generalization performance, ELM have been applied in
a variety of domains, such as biomedical engineering, computer vision, system
identification, and control and robotics. [5]
3. Methodology
The methodology followed consisted of 9 main steps. The coding was done in the
programming language, Python and was executed using Anaconda. Algorithm of the
program was as follows:
STEP 1: START
STEP 2: We import the sklearn package
STEP 3: We split the data frame into categorical and numeric features to perform
preprocessing
4. Results
The train score tells us how the model generalized or fitted in the training data. If the model
fits so well in a data with lots of variance, then this causes over-fitting. This causes poor
result on Test Score. Because the model curved a lot to fit the training data and generalized
very poorly.
The test score is made when our model is ready. Before this step we have not touched this
data-set. So, this represents real life scenario. Higher the score, better the model
generalized.
Naive Bayes-
Decision Tree-
It is observed from the above chart that out of the three machine learning algorithms,
From the findings of this project, it can be concluded that Decision Tree is the most
accurate algorithm followed by Naïve Bayes and KNN. Both the train and test scores of the
Machine Learning has great uses in the future. Machine learning is used in internet
search engines, email filters to sort out spam, websites to make personalised
recommendations, banking software to detect unusual transactions, and lots of apps on our
phones such as voice recognition. It is efficient because it easily identifies trends and
1. Improved naive Bayes classification algorithm for traffic risk management by Hong
Chen, Songhua Hu, Rui Hua & Xiuju Zhao [1]
2. KNN Model-Based Approach in Classification by Gongde Guo, Hui Wang, David Bell,
Yaxin Bi, and Kieran Greer [2]
3. Random Forests and Decision Trees by Jehad Ali, Rehanullah Khan, Nasir Ahmad,
Imran Maqsood [3]
4. Do we need hundreds of classifiers to solve real world classification problems, by
Amorim, D.G., Barro, S., Cernadas, E., & Delgado, M.F. (2014). Journal of Machine
Learning Research (cited 387 times, HIC: 3 , CV: 0) [4]
5. Trends in extreme learning machines: a review, by Huang, G., Huang, G., Song, S., &
You, K. (2015). Neural Networks, (cited 323 times, HIC: 0, CV: 0) [5]