Data Mining Machine Learning and Big Dat
Data Mining Machine Learning and Big Dat
2, 55-61
Available online at http://pubs.sciepub.com/iteces/4/2/2
©Science and Education Publishing
DOI:10.12691/iteces-4-2-2
Department of Engineering Technology, Mississippi Valley State University, Itta Bena, MS, USA
*Corresponding author: [email protected]
Abstract This paper analyses deep learning and traditional data mining and machine learning methods;
compares the advantages and disadvantage of the traditional methods; introduces enterprise needs, systems and data,
IT challenges, and Big Data in an extended service infrastructure. The feasibility and challenges of the applications
of deep learning and traditional data mining and machine learning methods in Big Data analytics are also analyzed
and presented.
Keywords: big data, Big Data analytics, data mining, machine learning, deep learning, information technology,
data engineering
Cite This Article: Lidong Wang, “Data Mining, Machine Learning and Big Data Analytics.” International
Transaction of Electrical and Computer Engineers System, vol. 4, no. 2 (2017): 55-61. doi: 10.12691/iteces-4-2-2.
normalization, z-score normalization, and normalization k-NN involves assigning an object a class of its nearest
by decimal scaling. neighbor or of the majority of its nearest neighbors.
The purposes of this paper are to 1) analyze deep Specifically speaking, the k-NN classification finds the k
learning and traditional data mining and machine learning training instances that are closest to the unseen instance
methods (including k-means, k-nearest neighbor, support and takes the most commonly occurring classification for
vector machines, decision trees, logistic regression, Naive these k instances. There are several key issues that affect
Bayes, neural networks, bagging, boosting, and random the performance of k-NN. One is the choice of k. If k is
forests); 2) compares the advantages and disadvantage of too small, the result can be sensitive to noise points. On
the traditional methods; 3) introduces enterprise needs, the other hand, if k is too large, the neighborhood may
systems and data, IT challenges, and Big Data in an include too many points from other classes. An estimate
extended service infrastructure; and 4) discuss the of the best value for k can be obtained by cross-validation.
feasibility and challenges of the applications of deep Given enough samples, larger values of k are more
learning and traditional data mining and machine learning resistant to noise [12,13]. The k-NN algorithm for
methods in Big Data analytics. classification is a very simple ‘instance-based’ learning
algorithm. Despite its simplicity, it can offer very good
performance on some problems [3]. Important properties
2. Some Methods in Data Mining and of k-NN algorithm are [11]: 1) it is simple to implement
Machine Learning and use; 2) it needs a lot of space to store all objects.
partitioning trees and conditional inference trees are principal components analysis) is often used to
more efficient for representing certain classes of functions; 2.8. Comparison of Different Methods and
particularly for those involved in visual recognition, they Ensemble Methods
can represent more complex functions with less
“hardware”. SVMs and Kernel methods are not deep. Table 1 compares the advantages and disadvantages of
Classification trees are not deep either because there are traditional data mining (DM) and machine learning (ML)
no hierarchy of features. Deep learning involves methods.
non-convex loss functions and deep supervised learning is Ensemble methods increase the accuracy of classification
non-convex [20]. Deep learning has the potential in or prediction. Bagging, boosting, and random forest are
dealing with big data although there are challenges. the three most common methods in ensemble learning.
Some methods have been proposed for using unlabeled The bootstrap (or bagged) classifier is often better than a
data in deep neural network-based architectures. These single classifier that is derived from the original training
methods either perform a greedy layer-wise pre-training of set. The increased accuracy occurs because the composite
weights using unlabeled data alone followed by supervised model reduces the variance of the individual classifiers.
fine-turning, or learn unsupervised encodings at multiple For prediction, a bagged predictor improves the accuracy
levels of architecture jointly with a supervised signal. For over a single predictor. It is robust to overfitting and noisy
the latter, the basic setup is as follows: 1) choose an data. Bootstrap methods can be used not only to assess a
unsupervised learning algorithm; 2) choose a model with a model’s discrepancy, but also improve the accuracy.
deep architecture; 3) the unsupervised learning is plugged Bagging and boosting methods use a combination of
into any (or all) layers of the architecture as an auxiliary models and combine the results of more than one method.
task; and 4) train supervised and unsupervised tasks using Both bagging and boosting can be used for classification
the same architecture simultaneously [21]. as well as prediction [6,7,8,18].
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Algorithms Advantages Disadvantages
• Often terminates at a local optimum.
• Applicable only when mean is defined.
• Relatively efficient
• Not applicable for categorical data.
The k-means method
• Can process large data sets.
• Unable to handle noisy data.
[22,23]
• Robust to outliers on the predictors • Very difficult in handling data of mixed types.
• Can utilize predictive power of linear combinations • Weak in natural handling of mixed data types and
of inputs computational scalability
Support vector machine • Good prediction in a variety of situations • Very black box
• Low generalization error • Sensitive to tuning parameters and kernel choice
(SVM)
• Easy to interpret results • Training an SVM on a large data set can be slow
[5,15,22]
Bagging, which stands for bootstrap aggregation, is an single input. It uses the majority of votes from all the
ensemble classification method that uses multiple decision trees to classify data or use an average output for
bootstrap samples (with replacement) from the input regression [7].
training data to create slightly different training sets [1]. Random forest models are generally very competitive
Bagging is the idea of collecting a random sample of with nonlinear classifiers such as artificial neural nets and
observations into a bag. Multiple bags are made up of support vector machines. A random forest model is a good
randomly selected observations obtained from the original choice for model building because of very little pre-
observations from the training dataset [14]. Bagging is a processing of the data, no requirement for data
voting method of using bootstrap for different training sets normalization, and being resilient to outliers. The need for
and using the training sets to make different base learners. variable selection is avoided because the algorithm
The bagging method employs a combination of base effectively does its own. Because many trees are built
learners to make a better prediction [7]. using two levels of randomness (observations and
Boosting is also an ensemble method which attempts to variables), each tree is effectively an independent model.
build better learning algorithms by combining multiple The random forest algorithm builds multiple decision trees
more simple algorithms [24]. Boosting is similar to the using a concept called bagging to introduce random
bagging method. It first constructs the base learning in sampling into the whole process. In building each decision
sequence, where each successive learner is built for the tree, the random forest algorithm generally does not
prediction residuals of the preceding learner. With the perform any pruning of the decision tree. Overfitted
means to create a complementary learner, it uses the models tend not to perform well on new data. However, a
mistakes made by previous learners to train the next base random forest of overfitted trees can deliver a very good
learner. Boosting trains the base classifiers on different model that performs well on new data [14].
samples [1,7]. Boosting can fail to perform if there is
insufficient data or if the weak models are overly complex.
Boosting is also susceptible to noise [14]. The most 3. Big Data in Service Infrastructure and
popular boosting algorithm is AdaBoost that is “adaptive.” IT Challenges
AdaBoost is extremely simple to use and implement (far
simpler than SVMs), and often gives very effective results As enterprise data challenges continue to grow (see
[24]. AdaBoost works with numeric values and nominal Table 2 [26]), traditional technologies have challenges in
values. It has low generalization error, is easy to code, handling unstructured, Cloud, and Big Data sources.
works with most classifiers, and has no parameters to Table 3 [27] shows Big Data as part of a virtualized
adjust. However, it is sensitive to outliers [5]. service infrastructure. Hardware infrastructure is
Although bagging and randomization yield similar virtualized with cloud computing technologies; On top of
results, it sometimes pays to combine them because they this cloud-based infrastructure, Software as a Service
introduce randomness in different and perhaps complementary (SaaS); and on top of SaaS, Business Processes as a
ways. A popular algorithm for learning random forests Service (BPaaS) can be built. In parallel, Big Data will be
builds a randomized decision tree in each iteration of the offered as a service and embedded as the precondition for
bagging algorithm and often produces excellent predictors Knowledge services, e.g., the integration of Semantic
[16]. The random forests method is a tree-based ensemble Technologies for the analysis of unstructured and
approach that is actually a combination of many models aggregated data. Big Data as a Service can be treated as an
[1,15]. It is an ensemble classifier that consist of many extended layer between PaaS and SaaS. Knowledge
decision trees [25]. A random forest grows many workers or data scientists are needed to run Big Data and
classification trees, obtaining multiple results from a Knowledge.
Table 2. Enterprise Needs, Systems and Data, and IT Challenges
Business Needs Systems and Data IT Challenges
• Inventory System (MS SQL Server)
• Billing System (Web Service-Rest) • Data silos
• Access all information of value • Customer Relationship Management (CRM) • Exponential data growth
• Business capability and value driven (MySQL) • Unstructured, Web & Big Data
• Virtualized & unified semantic business views of data • Big Data, Cloud (Hadoop, Web) • IT complexity, rigidity
• Fast, iterative, self-service, pervasive • Customer Voice (Internet, Unstructured) • Inherent latency
• Right information to right user at right time • Product Catalog (Web Service-SOAP) • Move to Cloud
• Product Data (CSV) • High costs
• Log Files (.txt/.log files)
Layers Services
Layer 1 Business Process as a Service (BPaaS), Knowledge as a Service (KaaS)
Layer 2 Software as a Service (SaaS), Big Data as a Service (BDaaS)
Layer 3 (Cloud Infrastructure) Platform as a Service (PaaS)
Layer 4 (Cloud Infrastructure) Infrastructure as a Service (IaaS)
60 International Transaction of Electrical and Computer Engineers System
4. Data Mining and Machine Learning in models [32]. The Variety characteristic of Big Data
analytics, focuses on the variation of the input data types
Big Data Analytics and domains in big data. Domain adaptation during
learning is an important focus of study in deep learning,
Hadoop is a tool of Big Data analytics and the open- where the distribution of the training data is different from
source implementation of MapReduce. The following the distribution of the test data. In some big data domains,
brief list identifies the MapReduce implementations of e.g., cyber security, the input corpus consists of a mix of
however, it is very difficult for the method to deal with big [15] Clark M. An introduction to machine learning: with applications
data with complex models. Bagging, boosting, and random in R. University of Notre Dame, USA, 2013.
[16] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical
forests are the three most common ensemble methods that machine learning tools and techniques. Morgan Kaufmann; 2016
use a combination of models to increase accuracy. Oct 1.
Traditional technologies have challenges in handling [17] Galit S, Nitin P, Peter B. Data Mining In Excel: Lecture Notes and
unstructured and big data sources. Big Data as a Service Cases. Resampling Stats, Inc., USA, 2005 December 30.
(BDaaS) can be an extended layer in the service [18] Ledolter J. Data mining and business analytics with R. John Wiley
& Sons; 2013 May 28.
infrastructure. Traditional data mining and machine
[19] LISA Lab. Deep Learning Tutorial. University of Montreal,
learning (ML) techniques such as k-means, k-NN, decision Canada, 2015 September.
trees, and SVM are unsuitable for handling big data. Deep [20] LeCun Y, Ranzato M. Deep learning tutorial. InTutorials in
learning has the potential in dealing with big data although International Conference on Machine Learning (ICML’13) 2013
there are challenges. Jun.
[21] Weston J, Ratle F, Mobahi H, Collobert R. Deep learning via
semi-supervised embedding. InNeural Networks: Tricks of the
Trade. Springer Berlin Heidelberg 2012, 639-655.
References [22] Andreopoulos B. Literature Survey of Clustering Algorithms,
Workshop, Department of Computer Science and Engineering,
[1] Zaki MJ, Meira Jr W, Meira W. Data mining and analysis: York University, Toronto, Canada, 2006 June 27.
fundamental concepts and algorithms. Cambridge University Press; [23] Sharma S and Gupta RK. Intrusion Detection System: A Review.
2014 May 12. International Journal of Security and Its Applications. 2015, 9(5):
[2] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of 69-76.
data with neural networks. science. 2006 Jul 28; 313(5786): [24] Hertzmann A, Fleet D. Machine Learning and Data Mining
504-507. Lecture Notes. Computer Science Department, University of
[3] Wikibook, Data Mining Algorithms In R - Wikibooks, open books Toronto. 2010.
for an open world. PDF generated using the open source mwlib [25] Karatzoglou A. Machine Learning in R. Workshop, Telefonica
toolkit. See http://code.pediapress.com/, 2014 14 Jul. Research, Barcelona, Spain. 2010 December 15.
[4] Jackson J. Data Mining; A Conceptual Overview. Communications [26] Viña A. Data Virtualization Goes Mainstream, White Paper,
of the Association for Information Systems. 2002 Mar 22; 8(1): 19. Denodo Technologies, 2015.
[5] Harrington P. Machine learning in action. Greenwich, CT: [27] Curry E, Kikiras P, Freitas A. et al. Big Data Technical Working
Manning; 2012 Apr 16. Groups, White Paper, BIG Consortium, 2012.
[6] Paolo G. Applied data mining: statistical methods for business and [28] Suthaharan S., Big Data Classification: Problems and Challenges
industry. John Wiley & Sons Ltd, 2003. in Network Intrusion Prediction with Machine Learning, Performance
[7] Yu-Wei CD. Machine learning with R cookbook. Packt Publishing Evaluation Review, 41 (4), March 2014, 70-73.
Ltd; 2015, Mar 26. [29] S. Hido, S. Tokui, S. Oda, Jubatus: An Open Source Platform for
[8] Han J, Pei J, Kamber M. Data mining: concepts and techniques. Distributed Online Machine Learning, Technical Report of the
Elsevier; 2011 Jun 9. Joint Jubatus project by Preferred Infrastructure Inc., and NTT
[9] Tutorialspoint. Data Mining: data pattern evaluation, Tutorials Software Innovation Center, Tokyo, Japan, NIPS 2013 Workshop
Point (I) Pvt. Ltd, 2014. on Big Learning, Lake Tahoe. December 9, 2013. Pp. 1-6.
[10] Andreopoulos B. Literature Survey of Clustering Algorithms, [30] C.L. P. Chen, C.-Y. Zhang, “Data-intensive applications, challenges,
Workshop, Department of Computer Science and Engineering, techniques and technologies: A survey on Big Data,” Information
York University, Toronto, Canada, 2006 June 27. Sciences, Vol. 275, No. 10, pp. 314-347, 2014.
[11] Sharma S and Gupta RK. Intrusion Detection System: A Review. [31] K. M. Lee, “Grid-based Single Pass Classification for Mixed Big
International Journal of Security and Its Applications. 2015, 9(5): Data,” International Journal of Applied Engineering Research,
69-76. Vol. 9, No. 21, pp. 8737-8746, 2014.
[12] Kumar V, Wu X, editors. The top ten algorithms in data mining. [32] M. M. Najafabadi, F. Villanustre, T. M Khoshgoftaar, N. Seliya, R.
CRC Press; 2009. Wald and E. Muharemagic, Deep learning applications and
challenges in big data analytics, Journal of Big Data, 2 (1), 2015.
[13] Bramer M. Principles of data mining. London: Springer; 2007 Mar 6.
[33] Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald
[14] Williams G. Data mining with Rattle and R: The art of excavating
R, Muharemagic E. Deep learning applications and challenges in
data for knowledge discovery. Springer Science & Business Media;
big data analytics. Journal of Big Data. 2015 Feb 24; 2(1):1.
2011 Aug 4.