Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
Shan Suthaharan
Machine Learning
Models and
Algorithms for Big
Data Classification
Thinking with Examples for Effective
Learning
Integrated Series in Information Systems
Volume 36
Series Editors
Ramesh Sharda
Oklahoma State University, Stillwater, OK, USA
Stefan Voß
University of Hamburg, Hamburg, Germany
123
Shan Suthaharan
Department of Computer Science
UNC Greensboro
Greensboro, NC, USA
The interest in writing this book began at the IEEE International Conference on
Intelligence and Security Informatics held in Washington, DC (June 11–14, 2012),
where Mr. Matthew Amboy, the editor of Business and Economics: OR and MS,
published by Springer Science+Business Media, expressed the need for a book on
this topic, mainly focusing on a topic in data science field. The interest went even
deeper when I attended the workshop conducted by Professor Bin Yu (Department
of Statistics, University of California, Berkeley) and Professor David Madigan (De-
partment of Statistics, Columbia University) at the Institute for Mathematics and its
Applications, University of Minnesota on June 16–29, 2013.
Data science is one of the emerging fields in the twenty-first century. This field
has been created to address the big data problems encountered in the day-to-day
operations of many industries, including financial sectors, academic institutions, in-
formation technology divisions, health care companies, and government organiza-
tions. One of the important big data problems that needs immediate attention is
in big data classifications. The network intrusion detection, public space intruder
detection, fraud detection, spam filtering, and forensic linguistics are some of the
practical examples of big data classification problems that require immediate atten-
tion.
We need significant collaboration between the experts in many disciplines, in-
cluding mathematics, statistics, computer science, engineering, biology, and chem-
istry to find solutions to this challenging problem. Educational resources, like books
and software, are also needed to train students to be the next generation of research
leaders in this emerging research field. One of the current fields that brings the in-
terdisciplinary experts, educational resources, and modern technologies under one
roof is machine learning, which is a subfield of artificial intelligence.
Many models and algorithms for standard classification problems are available
in the machine learning literature. However, a few of them are suitable for big data
classification. Big data classification is dependent not only on the mathematical and
software techniques but also on the computer technologies that help store, retrieve,
and process the data with efficient scalability, accessibility, and computability fea-
tures. One such recent technology is the distributed file system. A particular system
vii
viii Preface
that has become popular and provides these features is the Hadoop distributed file
system, which uses the modern techniques called MapReduce programming model
(or a framework) with Mapper and Reducer functions that adopt the concept called
the (key, value) pairs. The machine learning techniques such as the decision tree
(a hierarchical approach), random forest (an ensemble hierarchical approach), and
deep learning (a layered approach) are highly suitable for the system that addresses
big data classification problems. Therefore, the goal of this book is to present some
of the machine learning models and algorithms, and discuss them with examples.
The general objective of this book is to help readers, especially students and
newcomers to the field of big data and machine learning, to gain a quick under-
standing of the techniques and technologies; therefore, the theory, examples,
and programs (Matlab and R) presented in this book have been simplified,
hardcoded, repeated, or spaced for improvements. They provide vehicles to
test and understand the complicated concepts of various topics in the field. It
is expected that the readers adopt these programs to experiment with the ex-
amples, and then modify or write their own programs toward advancing their
knowledge for solving more complex and challenging problems.
The presentation format of this book focuses on simplicity, readability, and de-
pendability so that both undergraduate and graduate students as well as new re-
searchers, developers, and practitioners in this field can easily trust and grasp the
concepts, and learn them effectively. The goal of the writing style is to reduce the
mathematical complexity and help the vast majority of readers to understand the
topics and get interested in the field. This book consists of four parts, with a total of
14 chapters. Part I mainly focuses on the topics that are needed to help analyze and
understand big data. Part II covers the topics that can explain the systems required
for processing big data. Part III presents the topics required to understand and select
machine learning techniques to classify big data. Finally, Part IV concentrates on
the topics that explain the scaling-up machine learning, an important solution for
modern big data problems.
The journey of writing this book would not have been possible without the sup-
port of many people, including my collaborators, colleagues, students, and family.
I would like to thank all of them for their support and contributions toward the suc-
cessful development of this book. First, I would like to thank Mr. Matthew Amboy
(Editor, Business and Economics: OR and MS, Springer Science+Business Media)
for giving me an opportunity to write this book. I would also like to thank both Ms.
Christine Crigler (Assistant Editor) and Mr. Amboy for helping me throughout the
publication process.
I am grateful to Professors Ratnasingham Shivaji (Head of the Department of
Mathematics and Statistics at the University of North Carolina at Greensboro) and
Fadil Santosa (Director of the Institute for Mathematics and its Applications at Uni-
versity of Minnesota) for the opportunities that they gave me to attend a machine
learning workshop at the institute. Professors Bin Yu (Department of Statistics,
University of California, Berkeley) and David Madigan (Department of Statistics,
Columbia University) delivered an excellent short course on applied statistics and
machine learning at the institute, and the topics covered in this course motivated
me and equipped me with techniques and tools to write various topics in this book.
My sincere thanks go to them. I would also like to thank Jinzhu Jia, Adams Blo-
niaz, and Antony Joseph, the members of Professor Bin Yu’s research group at the
Department of Statistics, University of California, Berkeley, for their valuable dis-
cussions in many machine learning topics.
My appreciation goes out to University of California, Berkeley, and University of
North Carolina at Greensboro for their financial support and the research assignment
award in 2013 to attend University of California, Berkeley as a Visiting scholar—
this visit helped me better understand the deep learning techniques. I would also
like to show my appreciation to Mr. Brent Ladd (Director of Education, Center for
the Science of Information, Purdue University) and Mr. Robert Brown (Managing
Director, Center for the Science of Information, Purdue University) for their sup-
port to develop a course on big data analytics and machine learning at University of
North Carolina at Greensboro through a sub-award approved by the National Sci-
ence Foundation. I am also thankful to Professor Richard Smith, Director of the
ix
x Acknowledgements
1 Science of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Technological Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Technological Advancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Big Data Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Facts and Statistics of a System . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Big Data Versus Regular Data . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Machine Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Modeling and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Supervised and Unsupervised . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Collaborative Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 A Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 The Purpose and Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 The Goal and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.3 The Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Chapter 1
Science of Information
Abstract The main objective of this chapter is to provide an overview of the modern
field of data science and some of the current progress in this field. The overview
focuses on two important paradigms: (1) big data paradigm, which describes a prob-
lem space for the big data analytics, and (2) machine learning paradigm, which
describes a solution space for the big data analytics. It also includes a preliminary
description of the important elements of data science. These important elements
are the data, the knowledge (also called responses), and the operations. The terms
knowledge and responses will be used interchangeably in the rest of the book. A pre-
liminary information of the data format, the data types and the classification are also
presented in this chapter. This chapter emphasizes the importance of collaboration
between the experts from multiple disciplines and provides the information on some
of the current institutions that show collaborative activities with useful resources.
Data science is an emerging field in the twenty-first century. The article by Mike
Loukides at the O’reilly website [1] provides an overview, and it talks about data
source, and data scalability. We can define data science as the management and
analysis of data sets, the extraction of useful information, and the understanding of
the systems that produce the data. The system can be a single unit (e.g., a com-
puter network or a wireless sensor network) that is formed by many interconnecting
subunits (computers or sensors) that can collaborate under a certain set of prin-
ciples and strategies to carry out tasks, such as the collection of data, facts, or
statistics of an environment the system is expected to monitor. Some examples of
these systems include network intrusion detection systems [2], climate-change det-
ection systems [3], and public space intruder detection systems [4]. These real-world
systems may produce massive amounts of data, called big data, from many data
sources that are highly complex, unstructured, and hard to manage, process, and
The current advancements in the technology include the modern distributed file sys-
tems and the distributed machine learning. One such technology is called Hadoop
[15, 16], which facilitates distributed machine learning using external libraries, like
the scikit-learn library [17], to process big data. Among several machine-learning
techniques in the libraries, most of them are based on classical models and algo-
rithms, may not be suitable for big data processing. However, some techniques,
1.2 Big Data Paradigm 3
like the decision tree learning and the deep learning, are suitable for big data clas-
sification, and they may help develop better supervised learning techniques in the
upcoming years. The classification techniques evolved from these models and alg-
orithms are the main focus, and they will be discussed in detail in the rest of the
book.
In this book, it is assumed that the big data paradigm consists of a big data system
and an environment. The goal of a system is to observe an environment and learn
its characteristics to make accurate decisions. For example, the goal of a network
intrusion detection system is to learn traffic characteristics and detect intrusions
to improve the security of a computer network. Similarly, the goal of a wireless
sensor network is to monitor changes in the weather to learn the weather patterns
for forecasting. The environment generates events, and the system collects the facts
and statistics, transforms them into knowledge with suitable operations, learns the
event characteristics, and predicts the environmental characteristics.
1.2.1.1 Data
Data can be described as the hidden digital facts that the monitoring system collects.
Hidden digital facts are the digitized facts that are not obvious to the system without
further comprehensive processing. The definition of data should be based on the
knowledge that must be gained from it. One of the important requirements for the
data is the format. For example, the data could be presented mathematically or in a
two-dimensional tabular representation. Another important requirement is the type
of data. For example, the data could be labeled or not labeled. In the labeled data,
the digital facts are not hidden and can be used for training the machine-learning
techniques. In the unlabeled data, the digital facts are hidden and can be used for
testing or validation as a part of the machine-learning approach.
4 1 Science of Information
1.2.1.2 Knowledge
Knowledge can be described as the learned information acquired from the data.
For example, the knowledge could be the detection of patterns in the data, the
classification of the varieties of patterns in the data, the calculation of unknown
statistical distributions, or the computation of the correlations of the data. It forms
the responses for the system, and it is called the “knowledge set” or “response
set” (sometimes called the “labeled set”). The data forms the domain, called “data
domain,” on which the responses are generated using a model f as illustrated in
Fig. 1.1. In addition to these two elements (i.e., the data and the knowledge), a
monitoring system needs three operations, called physical operations, mathemati-
cal operations, and logical operations in this book. The descriptions of these three
important operations are presented in the following subsections.
Physical operations describe the steps involved in the processes of data capture,
data storage, data manipulation, and data visualization [18]. These are the important
contributors to the development of a suitable data domain for a system so that the
machine-learning techniques can be applied efficiently. Big data also means mas-
sive data, and the assumption is that it cannot be solved with a single file or a single
machine. Hence, the indexing and distribution of the big data over a distributed net-
work becomes necessary. One of the popular tools available in the market for this
purpose is the Hadoop distributed file system (http://hadoop.apache.org/), which
uses the MapReduce framework (http://hadoop.apache.org/mapreduce/) to accom-
plish these objectives. These modern tools help enhance the physical operations of a
system which, in turn, helps generate sophisticated, supervised learning models and
algorithms for big data classifications.
1.2 Big Data Paradigm 5
1.2.2.1 Scenario
the system. As an example, the source bytes, destination count, and protocol type
information found in a packet can serve as features of the computer network traf-
fic data. The changes in the values of feature variables determine the type (or the
class) of an event. To determine the correct class for an event, the event must be
transformed into knowledge.
In summary, the parameter n represents the number of observations captured
by a system at time t, which determines the size (volume) of the data set, and the
parameter p represents the number of features that determines the dimension of the
data and contributes to the number of classes (variety) in the data set. In addition,
the ratio between the parameters n and t determines the data rate (velocity) term as
described in the standard definition of big data [6].
Now referring back to Fig. 1.2, the horizontal axis represents p (i.e., the dimen-
sion) and the vertical axis represents n (i.e., the size or the volume). The domain
defined by n and p is divided into four subdomains (small, large, high dimension,
and massive) based on the magnitudes of n and p. The arc boundary identifies the
regular data and massive data regions, and the massive data region becomes big data
when velocity and variety are included.
A data set may be defined in mathematical or tabular form. The tabular form is vi-
sual, and it can be easily understood by nonexperts. Hence this section first presents
the data representation tool in a tabular form, and it will be defined mathematically
from Chap. 2 onward. The data sets generally contain a large number of events as
mentioned earlier. Let us denote these events by E1 , E2 , . . . , Emn . Now assume that
1.3 Machine Learning Paradigm 7
these observations can be divided into n separable classes denoted by C1 ,C2 , . . . ,Cn
(where n is much smaller than mn), where C1 is a set of events E1 , E2 , . . . , Em1 , C2
is a set of events E1 , E2 , . . . , Em2 , and so on (where m1 + m2 + · · · = mn). These
classes of events may be listed in the first column of a table. The last column of
the table identifies the corresponding class types. In addition, every set of events
depends on p features that are denoted by F1 , F2 , . . . , Fp , and the values associated
with these features can be presented in the other columns of the table. For example,
the values associated with feature F1 of the first set E1 , E2 , . . . , Em1 can be denoted
by x11 , x12 , . . . , x1m1 , indicating the event E1 takes x11 , event E2 takes x12 , and so on.
The same pattern can be followed for the other sets of events.
The term modeling refers to both mathematical and statistical modeling of data.
The goal of modeling is to develop a parametrized mapping between the data do-
main and the response set. This mapping could be a parametrized function or a
parametrized process that learn the characteristics of a system from the input (la-
beled) data. The term algorithm is a confusing term in the context of machine learn-
ing. For a computer scientist, the term algorithm means step-by-step systematic in-
structions for a computer to solve a problem. In machine learning, the modeling,
itself, may have several algorithms to derive a model; however, the term algorithm
here refers to a learning algorithm. The learning algorithm is used to train, validate,
and test the model using a given data set to find an optimal value for the parameters,
validate it, and evaluate its performance.
It is best to define supervised learning and unsupervised learning based on the class
definition. In supervised learning, the classes are known and class boundaries are
8 1 Science of Information
well defined in the given (training) data set, and the learning is done using these
classes (i.e., class labels). Hence, it is called classification. In unsupervised learning,
we assume the classes or class boundaries are not known, hence the class labels
themselves are also learned, and classes are defined based on this. Hence, the class
boundaries are statistical and not sharply defined, and it is called clustering.
1.3.2.1 Classification
f : Rl ⇒ {0, 1, 2, . . ., n} (1.1)
In this function definition, the range {0, 1, 2, . . . , n} is the knowledge set which
assigns the discrete values (labels) 0, 1, 2, . . . , n to different classes. This mathe-
matical function helps us to define suitable classifiers for the classification of the
data. Several classification techniques have been proposed in the machine learning
1.3 Machine Learning Paradigm 9
literature, and some of the well-known techniques are: support vector machine [19],
decision tree [20], random forest [21], and deep learning [22]. These techniques will
be discussed in detail in this book with programming and examples.
1.3.2.2 Clustering
In clustering problems [23, 24], we assume data sets are available to generate rules,
but they are not labeled. Hence, we can only derive an approximated rule that can
help to label new data that do not have labels. Figure 1.4 illustrates this example.
It shows a set of points labeled with white dots; however, a geometric pattern that
determines two clusters can be found. These clusters form a rule that helps to assign
a label to the given data points and thus to a new data point. As a result, the data may
only be clustered, not classified. Hence, the clustering problem can also be defined
as follows with an approximated rule. The clustering problem may also be addressed
mathematically based on the data-to-knowledge transformation mentioned earlier.
Once again, let us assume a data set is given, and its domain D is Rl , indicating that
the events of the data set depend on l features and form an l-dimensional vector
space. If we extract structures (e.g., statistical or geometrical) and estimate there are
n̂ classes, then we can define the knowledge function as follows:
The range {0, 1, 2, . . ., n̂} is the knowledge set which assigns the discrete labels
0, 1, 2, . . . , n̂ to different classes. This function helps us to assign suitable labels to
new data. Several clustering algorithms have been proposed in machine learning:
k-Means clustering, Gaussian mixture clustering, and hierarchical clustering [23].
10 1 Science of Information
Big data means big research. Without strong collaborative efforts between the
experts from many disciplines (e.g., mathematics, statistics, computer science, med-
ical science, biology, and chemistry) and the dissemination of educational resources
in a timely fashion, the goal of advancing the field of data science may not be prac-
tical. These issues have been realized not only by researchers and academics but
also by government agencies and industries. This momentum can be noticed in the
last several years. Some of the recent collaborative efforts and the resources that can
provide long-term impacts in the field of big data science are:
• Simons Institute UC Berkeley—http://simons.berkeley.edu/
• Statistical Applied Mathematical Science Institute—http://www.samsi.info/
• New York University Center for Data science—http://datascience.nyu.edu/
• Institute for Advanced Analytics—http://analytics.ncsu.edu/
• Center for Science of Information, Purdue University—http://soihub.org/
• Berkeley Institute for Data Science—http://bids.berkeley.edu/
• Stanford and Coursera—https://www.coursera.org/
• Institute for Data Science—http://www.rochester.edu/data-science/
• Institute for Mathematics and its Applications—http://www.ima.umn.edu/
• Data Science Institute—http://datascience.columbia.edu/
• Data Science Institute—https://dsi.virginia.edu/
• Michigan Institute for Data Science—http://minds.umich.edu/
An important note to the readers: The websites (or web links) cited in the entire book
may change rapidly, please be aware of it. My plan is to maintain the information in
this book current by updating the information at the following website: http://www.
uncg.edu/cmp/downloads/
1.5 A Snapshot
The snapshot of the entire book always helps readers by informing the topics cov-
ered in the book ahead of time. This allows them to conceptualize, summarize, and
understand the theory and applications. This section provides a snapshot of this book
under three categories: the purpose and interests, the goals and objectives, and the
problems and challenges.
The purpose of this book is to provide information on big data classification and
the related topics with simple examples and programming. Several interesting top-
ics contribute to big data classification, including the characteristics of data, the
relationships between data and knowledge, the models and algorithms that can help