0% found this document useful (0 votes)
35 views7 pages

Data Mining Machine Learning and Big Dat

Uploaded by

Henry Eduardo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views7 pages

Data Mining Machine Learning and Big Dat

Uploaded by

Henry Eduardo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Transaction of Electrical and Computer Engineers System, 2017, Vol. 4, No.

2, 55-61
Available online at http://pubs.sciepub.com/iteces/4/2/2
©Science and Education Publishing
DOI:10.12691/iteces-4-2-2

Data Mining, Machine Learning and Big Data Analytics


Lidong Wang*

Department of Engineering Technology, Mississippi Valley State University, Itta Bena, MS, USA
*Corresponding author: [email protected]

Abstract This paper analyses deep learning and traditional data mining and machine learning methods;
compares the advantages and disadvantage of the traditional methods; introduces enterprise needs, systems and data,
IT challenges, and Big Data in an extended service infrastructure. The feasibility and challenges of the applications
of deep learning and traditional data mining and machine learning methods in Big Data analytics are also analyzed
and presented.
Keywords: big data, Big Data analytics, data mining, machine learning, deep learning, information technology,
data engineering
Cite This Article: Lidong Wang, “Data Mining, Machine Learning and Big Data Analytics.” International
Transaction of Electrical and Computer Engineers System, vol. 4, no. 2 (2017): 55-61. doi: 10.12691/iteces-4-2-2.

PCA can be used to reduce the observed variables into a


smaller number of principal components [3].
1. Introduction Factor analysis is another method for dimensionality
reduction. It is useful for understanding the underlying
Data mining focuses on the knowledge discovery of reasons for the correlations among a group of variables.
data. Machine learning concentrates on prediction based The main applications of factor analysis are reducing the
on training and learning. Data mining uses many machine number of variables and detecting structure in the
learning methods; machine learning also uses data mining relationships among variables. Therefore, factor analysis
methods as pre-processing for better learning and is often used as a structure detection or data reduction
accuracy. Machine learning includes both supervised and method. Specifically, it is used to find the hidden factors
unsupervised learning methods. Data mining has six main behind observed variables and reduce the number of
tasks: clustering, classification, regression, anomaly or intercorrelated variables. In factor analysis, it is assumed
outlier detection, association rule learning, and summarization. that some unobservable latent variables generate the
The feasibility and challenges of the applications of data observed data. The data is assumed to be a linear
mining and machine learning in big data has been a combination of the latent variables and some noise. The
research topic although there are many challenges. Data number of latent variables is possibly less than the number
dimension reduction is one of the issues in processing big of variables in the observed data, which fulfils the
data. dimensionality reduction [4,5].
High-dimensional data can cause problems for data In practical applications, the proportions of 75% and
mining and machine learning although high-dimensionality 25% are often used for the training and validation datasets,
can help in certain situations, for example, nonlinear respectively. However, the most frequently used method,
classification. Nevertheless, it is important to check whether especially in the field of neural networks, is dividing the
the dimensionality can be reduced while preserving the data set into three blocks: training, validation, and testing.
essential properties of the full data matrix. [1]. Dimensionality The testing data will not be used in the modelling phase
reduction facilitates the classification, communication, [6]. The k-fold cross-validation technique is a common
visualization, and storage of high-dimensional data. The technique that is used to estimate the performance of a
most widely used method in dimensionality reduction is classifier because it overcomes the problem of over-fitting
principal component analysis (PCA). PCA is a simple [7]. In k-fold cross-validation, the initial data is randomly
method that finds the directions of greatest variance in the partitioned into k mutually exclusive subsets or “folds”.
dataset and represents each data point by its coordinates Training and testing are performed k times. Each sample is
along each of these directions [2]. The direction with the used the same number of times for training and once for
largest projected variance is called the first principal testing [8]. Normalization is particularly useful for
component. The orthogonal direction that captures the classification algorithms involving neural networks, or
second largest projected variance is called the second distance measurements such as nearest-neighbor classification
principal component, and so on [1]. PCA is useful when and clustering. For distance-based methods, normalization
there are a large number of variable within the data, and helps prevent attributes with initially large ranges (e.g.,
there is some redundancy in those variables. In this income) from outweighing the attributes with initially
situation, redundancy means that some of the variables are smaller ranges (e.g., binary attributes). There are many
correlated with one another. Because of this redundancy, methods for data normalization such as min-max
56 International Transaction of Electrical and Computer Engineers System

normalization, z-score normalization, and normalization k-NN involves assigning an object a class of its nearest
by decimal scaling. neighbor or of the majority of its nearest neighbors.
The purposes of this paper are to 1) analyze deep Specifically speaking, the k-NN classification finds the k
learning and traditional data mining and machine learning training instances that are closest to the unseen instance
methods (including k-means, k-nearest neighbor, support and takes the most commonly occurring classification for
vector machines, decision trees, logistic regression, Naive these k instances. There are several key issues that affect
Bayes, neural networks, bagging, boosting, and random the performance of k-NN. One is the choice of k. If k is
forests); 2) compares the advantages and disadvantage of too small, the result can be sensitive to noise points. On
the traditional methods; 3) introduces enterprise needs, the other hand, if k is too large, the neighborhood may
systems and data, IT challenges, and Big Data in an include too many points from other classes. An estimate
extended service infrastructure; and 4) discuss the of the best value for k can be obtained by cross-validation.
feasibility and challenges of the applications of deep Given enough samples, larger values of k are more
learning and traditional data mining and machine learning resistant to noise [12,13]. The k-NN algorithm for
methods in Big Data analytics. classification is a very simple ‘instance-based’ learning
algorithm. Despite its simplicity, it can offer very good
performance on some problems [3]. Important properties
2. Some Methods in Data Mining and of k-NN algorithm are [11]: 1) it is simple to implement
Machine Learning and use; 2) it needs a lot of space to store all objects.

2.1. k-means, k-modes, k-prototypes and 2.3. Support Vector Machine


Clustering Analysis Support vector machines (SVM) is a supervised
learning method used for classification and regression
Clustering methods can be classified into the following tasks [3]. SVM has been found to work well on problems
categories: partitioning method, hierarchical method, that are sparse, nonlinear, and high-dimensional. An
model-based method, grid-based method, density-based advantage of the method is that building the model only
method, and the constraint-based method. The main advantage uses support vectors rather than the whole training dataset.
of clustering over classification is its adaptability to Hence, the size of the training set is usually not a problem.
changes and helping single out useful features that Also, the model is less affected by outliers due to only
distinguish different groups [9]. A good clustering method using the support vectors to build it. A disadvantage is that
produces high quality clusters with high intra-class the algorithm is sensitive to the choice of tuning option
similarity and low inter-class similarity. The quality of (e.g., the type of transformations to perform). This makes
clustering depends upon the appropriateness of the method it time-consuming and harder to use for the best model.
for the dataset, the (dis)similarity measure used, and its Another disadvantage is that the transformations are
implementation. The quality of a clustering method is also performed during both building the model and scoring
measured by its ability to discover some or all of the new data. This makes it computationally expensive. SVM
hidden patterns. Types of data in clustering analysis works with numeric and nominal values; the SVM
include nominal (categorical), interval-scaled variables, classification supports both binary and multiclass targets
binary variables, ordinal variables, and mixed types [10]. [14].
k-means uses a greedy iterative approach to find
clustering that minimizes the sum of squared errors (SSE). 2.4. Trees and Logistic Regression
It possibly converges to a local optimum instead of a
globally optimum [1]. Important properties of the k-means Decision trees used in data mining include two main
algorithm include [11]: 1) efficient in processing large types: 1) classification trees for predicting the class which
data sets; 2) works only on numerical values; 3) clusters the data belongs to; and 2) regression trees for predicting
have convex shapes. Users need to specify k (the number the outcome that is a real number. Classification trees and
of clusters) in advance. The method possibly terminates at regression trees provide different approaches to prediction
a local optimum. The global optimum may be found using [15]. When constructing a tree, measures such as
techniques such as deterministic annealing and genetic statistical significance, information gain, Gini index, and
algorithms. The k-means method is not applicable for so on can be used to assess the performance of a split.
categorical data while k-modes is a method for categorical When a decision tree is built, many of the branches will
data that uses modes. k-modes use new dissimilarity reflect anomalies in the training data due to noise or
measures to deal with categorical objects and use a outliers. Tree pruning methods address this problem of
frequency-based method to update the modes of clusters. overfitting the data. Pruned trees tend to be smaller and
The k-prototypes method can deal with a mixture of less complex, thus easier to comprehend. They are usually
categorical and numerical data [10]. faster and better at correctly classifying independent test
data [8]. There are two approaches to prune a tree: 1)
2.2. k-Nearest Neighbors pre-pruning — the tree is pruned by halting its
construction early; 2) post-pruning — this approach
k-nearest neighbor (k-NN) classification finds a group removes a sub-tree from a fully grown tree [9]. A strategy
of k objects in the training set that are closest to the test of post-pruning (sometimes called backward pruning)
object and bases the assignment of a label on the rather than pre-pruning (or forward pruning) is often
predominance of a particular class in this neighborhood. adopted after building a complete tree [16]. Both recursive
International Transaction of Electrical and Computer Engineers System 57

partitioning trees and conditional inference trees are principal components analysis) is often used to

• Third, the extreme flexibility of the neural network


nonparametric, work on both classification and regression identify key predictors.
problems, and are very flexible and easy to interpret while
they are prone to over-fitting. Conditional inference trees relies heavily on having sufficient data for training
are less prone to bias than a recursive partitioning tree [7]. purposes. A neural network performs poorly when
Logistic regression is a regression model where the the training set size is insufficient, even if the
dependent variable is categorical. It is computationally relationship between the response and predictors is

• Fourth, a technical problem is the risk of obtaining


inexpensive, easy to implement, good in knowledge very simple.
representation, and easy to interpret. However, it is prone
to underfitting and may have low accuracy [5]. weights that lead to a local optimum rather than the

• Finally, neural networks are involved in much


global optimum.
2.5. Naïve Bayes
computation and require longer runtime than other
The Naïve Bayes classifier is a method of classification classifiers. The run time increases greatly when the
that does not use rules, a decision tree or any other explicit number of predictors grows.
representation of the classifier. Rather, it uses the The most popular neural network algorithm is
probability theory to find the most possible classifications backpropagation. Backpropagation uses a method of
[13]. Naïve Bayes works with a small amount of data and gradient descent. The target value may be the known class
nominal values [5]. Important properties of the Naive label of the training tuple (for classification problems) or a
Bayes algorithm are [11]: 1) it is very easy to construct continuous value (for prediction) [6]. The tradeoff should
and training is also easy and fast; and 2) it is highly be between under- and over-fitting to decide the size of
scalable. hidden layer. Using too few nodes might not be sufficient
The Naive Bayes classifier's beauty is in its simplicity, to capture complex relationships. On the other hand, too
computational efficiency, good classification performance. many nodes may result in overfitting. A rule of thumb is
In fact, it often outperforms more sophisticated classifiers to start with p (number of predictors) nodes and gradually
even when the underlying assumption of independent decrease/increase a bit while checking for overfitting [17].
predictors is far from true. This advantage is especially for Advantages of neural networks include their good
the situation when the number of predictors is very large. predictive performance, tolerance of noisy data as well as
There are more features about Naive Bayes. First, the their ability to classify patterns on which they have not
Naive Bayes classifier requires a very large number of been trained. They can be used when you may have little
records to obtain good results. Second, where a predictor knowledge of the relationships between attributes and
category is not present in the training data, Naive Bayes classes. They are well-suited for continuous-valued inputs
assumes that a new record with that category of the and outputs, unlike most decision tree algorithms [6,17].
predictor has zero probability. This can be a problem if Neural networks are very general and can approximate
this rare predictor value is important. Finally, good complicated relationships. Their weakest point is in providing
performance is obtained when the goal is classification or insight into the structure of the relationship, and hence
ranking of records according to their probability of their “black-box" reputation. The user of neural networks
belonging to a certain class. However, when the goal is to must make many modelling assumptions, such as the
actually estimate the probability of class membership, this number of hidden layers and the number of units in each
method provides very biased results. For this reason, the hidden layer, and usually there is little guidance on how to
Naive Bayes method is rarely used in credit scoring [17]. do this. Furthermore, back-propagation can be quite slow
if the learning constant is not chosen correctly [17,18].
2.6. Neural Networks Reducing the data dimensionality can be performed
with neural networks. High-dimensional data can be
Neural networks, also called artificial neural networks, converted to low-dimensional codes by training a
are models for classification and prediction [17]. Neural multilayer neural network with a small central layer to
network algorithms are inherently parallel. Parallelization reconstruct high-dimensional input vectors. Gradient
methods can be used to speed up the computation process. descent can be used for fine-tuning the weights in such
In addition, several techniques have recently been developed ‘‘autoencoder’’ networks, but this works well only if the
for the extraction of rules from trained neural networks. initial weights are close to a good solution. An effective
This contributes to the application of neural networks for way of initializing the weights that allows deep
classification and prediction in data mining [6]. Important autoencoder networks to learn low-dimensional codes was

• First, although neural networks are capable of


properties of neural networks are as follows [17]: proposed. It works better than principal components
analysis as a tool to reduce the dimensionality of data [2].
generalizing from a set of examples, extrapolation
is still a serious danger. If the network sees only
2.7. Deep Learning
cases in a certain range, then its predictions outside

• Second, neural networks do not have a built-in


this range can be completely invalid. Deep Learning is a new area in machine learning
research, which has been introduced with the objective of
variable selection mechanism. This means that there moving machine learning closer to one of its original
is need for careful consideration of predictors. goals — artificial intelligence. Deep Learning is about
Combination with classification and regression trees learning multiple levels of representation and abstraction
and other dimension reduction techniques (e.g., that help to make sense of data [19]. Deep machines are
58 International Transaction of Electrical and Computer Engineers System

more efficient for representing certain classes of functions; 2.8. Comparison of Different Methods and
particularly for those involved in visual recognition, they Ensemble Methods
can represent more complex functions with less
“hardware”. SVMs and Kernel methods are not deep. Table 1 compares the advantages and disadvantages of
Classification trees are not deep either because there are traditional data mining (DM) and machine learning (ML)
no hierarchy of features. Deep learning involves methods.
non-convex loss functions and deep supervised learning is Ensemble methods increase the accuracy of classification
non-convex [20]. Deep learning has the potential in or prediction. Bagging, boosting, and random forest are
dealing with big data although there are challenges. the three most common methods in ensemble learning.
Some methods have been proposed for using unlabeled The bootstrap (or bagged) classifier is often better than a
data in deep neural network-based architectures. These single classifier that is derived from the original training
methods either perform a greedy layer-wise pre-training of set. The increased accuracy occurs because the composite
weights using unlabeled data alone followed by supervised model reduces the variance of the individual classifiers.
fine-turning, or learn unsupervised encodings at multiple For prediction, a bagged predictor improves the accuracy
levels of architecture jointly with a supervised signal. For over a single predictor. It is robust to overfitting and noisy
the latter, the basic setup is as follows: 1) choose an data. Bootstrap methods can be used not only to assess a
unsupervised learning algorithm; 2) choose a model with a model’s discrepancy, but also improve the accuracy.
deep architecture; 3) the unsupervised learning is plugged Bagging and boosting methods use a combination of
into any (or all) layers of the architecture as an auxiliary models and combine the results of more than one method.
task; and 4) train supervised and unsupervised tasks using Both bagging and boosting can be used for classification
the same architecture simultaneously [21]. as well as prediction [6,7,8,18].
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Algorithms Advantages Disadvantages
• Often terminates at a local optimum.
• Applicable only when mean is defined.
• Relatively efficient
• Not applicable for categorical data.
The k-means method
• Can process large data sets.
• Unable to handle noisy data.
[22,23]

• Not suitable to discover clusters with non-convex shapes.


• Nonparametric • Expensive computation for a large dataset
• Zero cost in the learning process • Hard to interpret the result
• Classifying any data whenever finding similarity • The performance relies on the number of dimensions
k-nearest neighbor (k-NN)

• Lack of explicit model training


classifier
measures of any given instances
• Intuitive approach • Susceptible to correlated inputs and irrelevant features
[7,15]

• Robust to outliers on the predictors • Very difficult in handling data of mixed types.
• Can utilize predictive power of linear combinations • Weak in natural handling of mixed data types and
of inputs computational scalability
Support vector machine • Good prediction in a variety of situations • Very black box
• Low generalization error • Sensitive to tuning parameters and kernel choice
(SVM)

• Easy to interpret results • Training an SVM on a large data set can be slow
[5,15,22]

• Testing data should be near the training data


• Some tolerance to correlated inputs. • Cannot work on (linear) combinations of features.
• A single tree is highly interpretable, • Relatively less predictive in many situations.
• Can handle missing values. • Practical decision-tree learning algorithms cannot guarantee
Decision Trees

• Able to handle both numerical and categorical data.


[7,15]
to return the globally-optimal decision tree.
• Performs well with large datasets. • Decision-tree can lead to overfitting.
• Provides model logistic probability
• Easy to interpret • Does not handle the missing value of continuous variables
• Provides confidence interval • Suffers multicollinearity
• Quickly update the classification model to • Sensitive to extreme values of continuous variables
Logistic regression
[7]
incorporate new data
• Suitable for relative small training set
• Prone to bias when increasing the number of training sets
• Can easily obtain the probability for a prediction
• Assumes all features are independent and equally important,
Naïve Bayes [5,7] • Relatively simple and straightforward to use
• Can deal with some noisy and missing data
which is unlikely in real-world cases.
• Sensitive to how the input data is prepared.
• Can handles multiple classes
• Good prediction generally
• Not robust to outliers
• Some tolerance to correlated inputs
• Susceptible to irrelevant features
• Incorporating the predictive power of different
Neural networks [15]
• Difficult in dealing with big data with complex model
combinations of inputs
International Transaction of Electrical and Computer Engineers System 59

Bagging, which stands for bootstrap aggregation, is an single input. It uses the majority of votes from all the
ensemble classification method that uses multiple decision trees to classify data or use an average output for
bootstrap samples (with replacement) from the input regression [7].
training data to create slightly different training sets [1]. Random forest models are generally very competitive
Bagging is the idea of collecting a random sample of with nonlinear classifiers such as artificial neural nets and
observations into a bag. Multiple bags are made up of support vector machines. A random forest model is a good
randomly selected observations obtained from the original choice for model building because of very little pre-
observations from the training dataset [14]. Bagging is a processing of the data, no requirement for data
voting method of using bootstrap for different training sets normalization, and being resilient to outliers. The need for
and using the training sets to make different base learners. variable selection is avoided because the algorithm
The bagging method employs a combination of base effectively does its own. Because many trees are built
learners to make a better prediction [7]. using two levels of randomness (observations and
Boosting is also an ensemble method which attempts to variables), each tree is effectively an independent model.
build better learning algorithms by combining multiple The random forest algorithm builds multiple decision trees
more simple algorithms [24]. Boosting is similar to the using a concept called bagging to introduce random
bagging method. It first constructs the base learning in sampling into the whole process. In building each decision
sequence, where each successive learner is built for the tree, the random forest algorithm generally does not
prediction residuals of the preceding learner. With the perform any pruning of the decision tree. Overfitted
means to create a complementary learner, it uses the models tend not to perform well on new data. However, a
mistakes made by previous learners to train the next base random forest of overfitted trees can deliver a very good
learner. Boosting trains the base classifiers on different model that performs well on new data [14].
samples [1,7]. Boosting can fail to perform if there is
insufficient data or if the weak models are overly complex.
Boosting is also susceptible to noise [14]. The most 3. Big Data in Service Infrastructure and
popular boosting algorithm is AdaBoost that is “adaptive.” IT Challenges
AdaBoost is extremely simple to use and implement (far
simpler than SVMs), and often gives very effective results As enterprise data challenges continue to grow (see
[24]. AdaBoost works with numeric values and nominal Table 2 [26]), traditional technologies have challenges in
values. It has low generalization error, is easy to code, handling unstructured, Cloud, and Big Data sources.
works with most classifiers, and has no parameters to Table 3 [27] shows Big Data as part of a virtualized
adjust. However, it is sensitive to outliers [5]. service infrastructure. Hardware infrastructure is
Although bagging and randomization yield similar virtualized with cloud computing technologies; On top of
results, it sometimes pays to combine them because they this cloud-based infrastructure, Software as a Service
introduce randomness in different and perhaps complementary (SaaS); and on top of SaaS, Business Processes as a
ways. A popular algorithm for learning random forests Service (BPaaS) can be built. In parallel, Big Data will be
builds a randomized decision tree in each iteration of the offered as a service and embedded as the precondition for
bagging algorithm and often produces excellent predictors Knowledge services, e.g., the integration of Semantic
[16]. The random forests method is a tree-based ensemble Technologies for the analysis of unstructured and
approach that is actually a combination of many models aggregated data. Big Data as a Service can be treated as an
[1,15]. It is an ensemble classifier that consist of many extended layer between PaaS and SaaS. Knowledge
decision trees [25]. A random forest grows many workers or data scientists are needed to run Big Data and
classification trees, obtaining multiple results from a Knowledge.
Table 2. Enterprise Needs, Systems and Data, and IT Challenges
Business Needs Systems and Data IT Challenges
• Inventory System (MS SQL Server)
• Billing System (Web Service-Rest) • Data silos
• Access all information of value • Customer Relationship Management (CRM) • Exponential data growth
• Business capability and value driven (MySQL) • Unstructured, Web & Big Data
• Virtualized & unified semantic business views of data • Big Data, Cloud (Hadoop, Web) • IT complexity, rigidity
• Fast, iterative, self-service, pervasive • Customer Voice (Internet, Unstructured) • Inherent latency
• Right information to right user at right time • Product Catalog (Web Service-SOAP) • Move to Cloud
• Product Data (CSV) • High costs
• Log Files (.txt/.log files)

Table 3. Big Data in an Extended Service Infrastructure

Layers Services
Layer 1 Business Process as a Service (BPaaS), Knowledge as a Service (KaaS)
Layer 2 Software as a Service (SaaS), Big Data as a Service (BDaaS)
Layer 3 (Cloud Infrastructure) Platform as a Service (PaaS)
Layer 4 (Cloud Infrastructure) Infrastructure as a Service (IaaS)
60 International Transaction of Electrical and Computer Engineers System

4. Data Mining and Machine Learning in models [32]. The Variety characteristic of Big Data
analytics, focuses on the variation of the input data types
Big Data Analytics and domains in big data. Domain adaptation during
learning is an important focus of study in deep learning,
Hadoop is a tool of Big Data analytics and the open- where the distribution of the training data is different from
source implementation of MapReduce. The following the distribution of the test data. In some big data domains,
brief list identifies the MapReduce implementations of e.g., cyber security, the input corpus consists of a mix of

• Naïve Bayes—This is one of a few algorithms that


three algorithms [5]: both labelled and unlabeled data. In such cases, deep
learning algorithms can incorporate semi-supervised
is naturally implementable in MapReduce. It’s easy training methods towards the goal of defining criteria for
to calculate sums in MapReduce. Given a class, the good data representation learning [33].
probability of a feature can be calculated in Naïve Representation-learning algorithms help supervised
Bayes method, the results from a given class can be learning techniques to achieve high classification accuracy
given to an individual mapper, the Reducer can be with computational efficiency. They transform the data,

• Support vector machines (SVMs) —There’s also an


used to sum up the results. while preserving the original characteristics of the data, to
another domain so that the classification algorithms can
approximate version of SVM called proximal SVM improve accuracy, reduce computational complexity, and
which computes a solution much faster and is easily increase processing speed. However, Big Data classification

• Singular value decomposition—The Lanczos algorithm


used in a MapReduce framework. requires multi-domain, representation-learning (MDRL)
technique because of its large and growing data domain.
is an efficient method for approximating eigenvalues. The MDRL technique includes feature variable learning,
This algorithm can be used in a series of feature extraction learning, and distance-metric learning.
MapReduce jobs to efficiently find the singular Several representation-learning techniques have been
values in a large matrix. proposed in the machine learning research. The recently
However, the above three methods cannot be used in proposed cross-domain, representation-learning (CDRL)
Big Data analytics. Traditional machine learning (ML) technique maybe suitable for the big data classification
techniques are unsuitable for big data classification along with the suggested network model; however, the
because: (1) An ML technique that is trained on a implementation of the CDRL technique to big data
particular labelled datasets or data domain may not be classification will encounter several challenges, including
suitable for another dataset or data domain; (2) an ML the difficulty in selecting relevant features, constructing
technique is in general trained using a certain number of geometric representation, extracting suitable features, and
class types and a large varieties of class types found in separating various types of data. Also, the continuity
dynamically growing big data; and (3) an ML technique is parameter of big data introduces the problems that need to
developed based on a single learning task, and thus they be addressed by lifelong learning techniques. The learning
are not suitable for multiple learning tasks and knowledge of big data characteristics in short term may not be suitable
transfer requirements of Big data analytics [28]; and for long-term. Hence the machine lifelong learning (ML3)
(4) memory constraint is a challenge. Although algorithms techniques should be used. The concept of ML3 provides
typically assume that training data samples exist in main a framework that can retain learned knowledge with
memory, big data does not fit into it [29]. training examples throughout the learning phases [31].
Big data mining is more challenging compared with
traditional data mining algorithms. Taking clustering as an
example, a natural way of clustering big data is to extend 5. Conclusions
existing methods (such as k-means) so that they can cope
with the huge workloads. Most extensions usually rely on Dimensionality reduction can aid data visualization.
analyzing a certain number of samples of big data, and PCA is the most commonly used technique for dimensional
vary in how the sample-based results are used to derive a reduction. Factor analysis can be used as a data reduction
partition for the overall data [30]. The k-NN classifiers do or structure detection method. The k-means method is
not construct any classifier model explicitly; instead they relatively efficient, but it possibly terminates at a local
keep all training data in memory. Hence, they are not optimum.
amenable to big data applications [31]. Splitting criteria of k-NN is simple to implement and robust to outliers on
decision trees are chosen based on some quality measures the predictors; however, it is very difficult for it to handle
such as information gain which requires handling the data with mixed types. SVM works well on problems that
entire data set of each expanding nodes. This makes it are sparse, nonlinear, and high-dimensional; but it is weak
difficult for decision trees to be applied to big data in natural handling of mixed data types and computational
applications. Support vector machine (SVM) shows good scalability. Decision trees performs well with large
performance for data sets in a moderate size. It has datasets, but can lead to overfitting. Tree pruning is
inherent limitations to big data applications [31]. performed to remove anomalies in the training data due to
Deep machine learning has the potential in dealing with noise or outliers. Logistic regression is computationally
big data. However, it has some challenges in big data inexpensive, but it is prone to underfitting and may have
applications because it requires significant amount of low accuracy. The Naive Bayes algorithm is easy to
training time [31,32]. Deep learning challenges in Big construct and training is fast; it is suitable for relative
Data analytics lie in: incremental learning for non- small training set and prone to bias. Neural networks have
stationary data, high-dimensional data, and large-scale good predictive performance and tolerance of noisy data;
International Transaction of Electrical and Computer Engineers System 61

however, it is very difficult for the method to deal with big [15] Clark M. An introduction to machine learning: with applications
data with complex models. Bagging, boosting, and random in R. University of Notre Dame, USA, 2013.
[16] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical
forests are the three most common ensemble methods that machine learning tools and techniques. Morgan Kaufmann; 2016
use a combination of models to increase accuracy. Oct 1.
Traditional technologies have challenges in handling [17] Galit S, Nitin P, Peter B. Data Mining In Excel: Lecture Notes and
unstructured and big data sources. Big Data as a Service Cases. Resampling Stats, Inc., USA, 2005 December 30.
(BDaaS) can be an extended layer in the service [18] Ledolter J. Data mining and business analytics with R. John Wiley
& Sons; 2013 May 28.
infrastructure. Traditional data mining and machine
[19] LISA Lab. Deep Learning Tutorial. University of Montreal,
learning (ML) techniques such as k-means, k-NN, decision Canada, 2015 September.
trees, and SVM are unsuitable for handling big data. Deep [20] LeCun Y, Ranzato M. Deep learning tutorial. InTutorials in
learning has the potential in dealing with big data although International Conference on Machine Learning (ICML’13) 2013
there are challenges. Jun.
[21] Weston J, Ratle F, Mobahi H, Collobert R. Deep learning via
semi-supervised embedding. InNeural Networks: Tricks of the
Trade. Springer Berlin Heidelberg 2012, 639-655.
References [22] Andreopoulos B. Literature Survey of Clustering Algorithms,
Workshop, Department of Computer Science and Engineering,
[1] Zaki MJ, Meira Jr W, Meira W. Data mining and analysis: York University, Toronto, Canada, 2006 June 27.
fundamental concepts and algorithms. Cambridge University Press; [23] Sharma S and Gupta RK. Intrusion Detection System: A Review.
2014 May 12. International Journal of Security and Its Applications. 2015, 9(5):
[2] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of 69-76.
data with neural networks. science. 2006 Jul 28; 313(5786): [24] Hertzmann A, Fleet D. Machine Learning and Data Mining
504-507. Lecture Notes. Computer Science Department, University of
[3] Wikibook, Data Mining Algorithms In R - Wikibooks, open books Toronto. 2010.
for an open world. PDF generated using the open source mwlib [25] Karatzoglou A. Machine Learning in R. Workshop, Telefonica
toolkit. See http://code.pediapress.com/, 2014 14 Jul. Research, Barcelona, Spain. 2010 December 15.
[4] Jackson J. Data Mining; A Conceptual Overview. Communications [26] Viña A. Data Virtualization Goes Mainstream, White Paper,
of the Association for Information Systems. 2002 Mar 22; 8(1): 19. Denodo Technologies, 2015.
[5] Harrington P. Machine learning in action. Greenwich, CT: [27] Curry E, Kikiras P, Freitas A. et al. Big Data Technical Working
Manning; 2012 Apr 16. Groups, White Paper, BIG Consortium, 2012.
[6] Paolo G. Applied data mining: statistical methods for business and [28] Suthaharan S., Big Data Classification: Problems and Challenges
industry. John Wiley & Sons Ltd, 2003. in Network Intrusion Prediction with Machine Learning, Performance
[7] Yu-Wei CD. Machine learning with R cookbook. Packt Publishing Evaluation Review, 41 (4), March 2014, 70-73.
Ltd; 2015, Mar 26. [29] S. Hido, S. Tokui, S. Oda, Jubatus: An Open Source Platform for
[8] Han J, Pei J, Kamber M. Data mining: concepts and techniques. Distributed Online Machine Learning, Technical Report of the
Elsevier; 2011 Jun 9. Joint Jubatus project by Preferred Infrastructure Inc., and NTT
[9] Tutorialspoint. Data Mining: data pattern evaluation, Tutorials Software Innovation Center, Tokyo, Japan, NIPS 2013 Workshop
Point (I) Pvt. Ltd, 2014. on Big Learning, Lake Tahoe. December 9, 2013. Pp. 1-6.
[10] Andreopoulos B. Literature Survey of Clustering Algorithms, [30] C.L. P. Chen, C.-Y. Zhang, “Data-intensive applications, challenges,
Workshop, Department of Computer Science and Engineering, techniques and technologies: A survey on Big Data,” Information
York University, Toronto, Canada, 2006 June 27. Sciences, Vol. 275, No. 10, pp. 314-347, 2014.
[11] Sharma S and Gupta RK. Intrusion Detection System: A Review. [31] K. M. Lee, “Grid-based Single Pass Classification for Mixed Big
International Journal of Security and Its Applications. 2015, 9(5): Data,” International Journal of Applied Engineering Research,
69-76. Vol. 9, No. 21, pp. 8737-8746, 2014.
[12] Kumar V, Wu X, editors. The top ten algorithms in data mining. [32] M. M. Najafabadi, F. Villanustre, T. M Khoshgoftaar, N. Seliya, R.
CRC Press; 2009. Wald and E. Muharemagic, Deep learning applications and
challenges in big data analytics, Journal of Big Data, 2 (1), 2015.
[13] Bramer M. Principles of data mining. London: Springer; 2007 Mar 6.
[33] Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald
[14] Williams G. Data mining with Rattle and R: The art of excavating
R, Muharemagic E. Deep learning applications and challenges in
data for knowledge discovery. Springer Science & Business Media;
big data analytics. Journal of Big Data. 2015 Feb 24; 2(1):1.
2011 Aug 4.

You might also like