100% found this document useful (1 vote)
97 views

Machine Learning Models and Algorithms For Big Data Classification - Suthaharan

Uploaded by

joao paulo rocha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
97 views

Machine Learning Models and Algorithms For Big Data Classification - Suthaharan

Uploaded by

joao paulo rocha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Integrated Series in Information Systems 36

Series Editors: Ramesh Sharda · Stefan Voß

Shan Suthaharan

Machine Learning
Models and
Algorithms for Big
Data Classification
Thinking with Examples for Effective
Learning
Integrated Series in Information Systems
Volume 36

Series Editors
Ramesh Sharda
Oklahoma State University, Stillwater, OK, USA

Stefan Voß
University of Hamburg, Hamburg, Germany

More information about this series at http://www.springer.com/series/6157


Shan Suthaharan

Machine Learning Models


and Algorithms for Big Data
Classification
Thinking with Examples for Effective
Learning

123
Shan Suthaharan
Department of Computer Science
UNC Greensboro
Greensboro, NC, USA

ISSN 1571-0270 ISSN 2197-7968 (electronic)


Integrated Series in Information Systems
ISBN 978-1-4899-7640-6 ISBN 978-1-4899-7641-3 (eBook)
DOI 10.1007/978-1-4899-7641-3

Library of Congress Control Number: 2015950063

Springer New York Heidelberg Dordrecht London


© Springer Science+Business Media New York 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.


springer.com)
It is the quality of our work which will please
God and not the quantity – Mahatma Gandhi
If you can’t explain it simply, you don’t
understand it well enough – Albert Einstein
Preface

The interest in writing this book began at the IEEE International Conference on
Intelligence and Security Informatics held in Washington, DC (June 11–14, 2012),
where Mr. Matthew Amboy, the editor of Business and Economics: OR and MS,
published by Springer Science+Business Media, expressed the need for a book on
this topic, mainly focusing on a topic in data science field. The interest went even
deeper when I attended the workshop conducted by Professor Bin Yu (Department
of Statistics, University of California, Berkeley) and Professor David Madigan (De-
partment of Statistics, Columbia University) at the Institute for Mathematics and its
Applications, University of Minnesota on June 16–29, 2013.
Data science is one of the emerging fields in the twenty-first century. This field
has been created to address the big data problems encountered in the day-to-day
operations of many industries, including financial sectors, academic institutions, in-
formation technology divisions, health care companies, and government organiza-
tions. One of the important big data problems that needs immediate attention is
in big data classifications. The network intrusion detection, public space intruder
detection, fraud detection, spam filtering, and forensic linguistics are some of the
practical examples of big data classification problems that require immediate atten-
tion.
We need significant collaboration between the experts in many disciplines, in-
cluding mathematics, statistics, computer science, engineering, biology, and chem-
istry to find solutions to this challenging problem. Educational resources, like books
and software, are also needed to train students to be the next generation of research
leaders in this emerging research field. One of the current fields that brings the in-
terdisciplinary experts, educational resources, and modern technologies under one
roof is machine learning, which is a subfield of artificial intelligence.
Many models and algorithms for standard classification problems are available
in the machine learning literature. However, a few of them are suitable for big data
classification. Big data classification is dependent not only on the mathematical and
software techniques but also on the computer technologies that help store, retrieve,
and process the data with efficient scalability, accessibility, and computability fea-
tures. One such recent technology is the distributed file system. A particular system
vii
viii Preface

that has become popular and provides these features is the Hadoop distributed file
system, which uses the modern techniques called MapReduce programming model
(or a framework) with Mapper and Reducer functions that adopt the concept called
the (key, value) pairs. The machine learning techniques such as the decision tree
(a hierarchical approach), random forest (an ensemble hierarchical approach), and
deep learning (a layered approach) are highly suitable for the system that addresses
big data classification problems. Therefore, the goal of this book is to present some
of the machine learning models and algorithms, and discuss them with examples.

The general objective of this book is to help readers, especially students and
newcomers to the field of big data and machine learning, to gain a quick under-
standing of the techniques and technologies; therefore, the theory, examples,
and programs (Matlab and R) presented in this book have been simplified,
hardcoded, repeated, or spaced for improvements. They provide vehicles to
test and understand the complicated concepts of various topics in the field. It
is expected that the readers adopt these programs to experiment with the ex-
amples, and then modify or write their own programs toward advancing their
knowledge for solving more complex and challenging problems.

The presentation format of this book focuses on simplicity, readability, and de-
pendability so that both undergraduate and graduate students as well as new re-
searchers, developers, and practitioners in this field can easily trust and grasp the
concepts, and learn them effectively. The goal of the writing style is to reduce the
mathematical complexity and help the vast majority of readers to understand the
topics and get interested in the field. This book consists of four parts, with a total of
14 chapters. Part I mainly focuses on the topics that are needed to help analyze and
understand big data. Part II covers the topics that can explain the systems required
for processing big data. Part III presents the topics required to understand and select
machine learning techniques to classify big data. Finally, Part IV concentrates on
the topics that explain the scaling-up machine learning, an important solution for
modern big data problems.

Greensboro, NC, USA Shan Suthaharan


Acknowledgements

The journey of writing this book would not have been possible without the sup-
port of many people, including my collaborators, colleagues, students, and family.
I would like to thank all of them for their support and contributions toward the suc-
cessful development of this book. First, I would like to thank Mr. Matthew Amboy
(Editor, Business and Economics: OR and MS, Springer Science+Business Media)
for giving me an opportunity to write this book. I would also like to thank both Ms.
Christine Crigler (Assistant Editor) and Mr. Amboy for helping me throughout the
publication process.
I am grateful to Professors Ratnasingham Shivaji (Head of the Department of
Mathematics and Statistics at the University of North Carolina at Greensboro) and
Fadil Santosa (Director of the Institute for Mathematics and its Applications at Uni-
versity of Minnesota) for the opportunities that they gave me to attend a machine
learning workshop at the institute. Professors Bin Yu (Department of Statistics,
University of California, Berkeley) and David Madigan (Department of Statistics,
Columbia University) delivered an excellent short course on applied statistics and
machine learning at the institute, and the topics covered in this course motivated
me and equipped me with techniques and tools to write various topics in this book.
My sincere thanks go to them. I would also like to thank Jinzhu Jia, Adams Blo-
niaz, and Antony Joseph, the members of Professor Bin Yu’s research group at the
Department of Statistics, University of California, Berkeley, for their valuable dis-
cussions in many machine learning topics.
My appreciation goes out to University of California, Berkeley, and University of
North Carolina at Greensboro for their financial support and the research assignment
award in 2013 to attend University of California, Berkeley as a Visiting scholar—
this visit helped me better understand the deep learning techniques. I would also
like to show my appreciation to Mr. Brent Ladd (Director of Education, Center for
the Science of Information, Purdue University) and Mr. Robert Brown (Managing
Director, Center for the Science of Information, Purdue University) for their sup-
port to develop a course on big data analytics and machine learning at University of
North Carolina at Greensboro through a sub-award approved by the National Sci-
ence Foundation. I am also thankful to Professor Richard Smith, Director of the
ix
x Acknowledgements

Statistical and Applied Mathematical Sciences Institute at North Carolina, for


the opportunity to attend the workshops on low-dimensional structure in high-
dimensional systems and to conduct research at the institute as a visiting research
fellow during spring 2014. I greatly appreciate the resources that he provided during
this visiting appointment. I also greatly appreciate the support and resources that the
University of North Carolina at Greensboro provided during the development of this
book.
The research work conducted with Professor Vaithilingam Jeyakumar and Dr.
Guoyin Li at the University of New South Wales (Australia) helped me simplify the
explanation of support vector machines. The technical report written by Michelle
Dunbar under Professor Jeyakumar’s supervision also contributed to the enhance-
ment of the chapter on support vector machines. I would also like to express my
gratitude to Professors Sat Gupta, Scott Richter, and Edward Hellen for sharing
their knowledge of some of the statistical and mathematical techniques. Professor
Steve Tate’s support and encouragement, as the department head and as a colleague,
helped me engage in this challenging book project for the last three semesters. My
sincere gratitude also goes out to Professor Jing Deng for his support and engage-
ment in some of my research activities.
My sincere thanks also go to the following students who recently contributed di-
rectly or indirectly to my research and knowledge that helped me develop some of
the topics presented in this book: Piyush Agarwal, Mokhaled Abd Allah, Michelle
Bayait, Swarna Bonam, Chris Cain, Tejo Sindhu Chennupati, Andrei Craddock,
Luning Deng, Anudeep Katangoori, Sweta Keshpagu, Kiranmayi Kotipalli, Varnika
Mittal, Chitra Reddy Musku, Meghana Narasimhan, Archana Polisetti, Chadwik
Rabe, Naga Padmaja Tirumal Reddy, Tyler Wendell, and Sumanth Reddy Yanala.
Finally, I would like to thank my wife, Manimehala Suthaharan, and my lovely
children, Lovepriya Suthaharan, Praveen Suthaharan, and Prattheeba Suthaharan,
for their understanding, encouragement, and support which helped me accomplish
this project. This project would not have been completed successfully without their
support.

Greensboro, NC, USA Shan Suthaharan


June 2015
About the Author

Shan Suthaharan is a Professor of Computer Science at the University of North


Carolina at Greensboro (UNCG), North Carolina, USA. He also serves as the Di-
rector of Undergraduate Studies at the Department of Computer Science at UNCG.
He has more than 25 years of university teaching and administrative experience and
has taught both undergraduate and graduate courses. His aspiration is to educate
and train students so that they can prosper in the computer field by understand-
ing current real-world and complex problems, and develop efficient techniques and
technologies. His current teaching interests include big data analytics and machine
learning, cryptography and network security, and computer networking and anal-
ysis. He earned his doctorate in Computer Science from Monash University, Aus-
tralia. Since then, he has been actively working on disseminating his knowledge and
experience through teaching, advising, seminars, research, and publications.
Dr. Suthaharan enjoys investigating real-world, complex problems, and develop-
ing and implementing algorithms to solve those problems using modern technolo-
gies. The main theme of his current research is the signature discovery and event
detection for a secure and reliable environment. The ultimate goal of his research
is to build a secure and reliable environment using modern and emerging technolo-
gies. His current research primarily focuses on the characterization and detection
of environmental events, the exploration of machine learning techniques, and the
development of advanced statistical and computational techniques to discover key
signatures and detect emerging events from structured and unstructured big data.
Dr. Suthaharan has authored or co-authored more than 75 research papers in the
areas of computer science, and published them in international journals and refer-
eed conference proceedings. He also invented a key management and encryption
technology, which has been patented in Australia, Japan, and Singapore. He also re-
ceived visiting scholar awards from and served as a visiting researcher at the Univer-
sity of Sydney, Australia; the University of Melbourne, Australia; and the University
of California, Berkeley, USA. He was a senior member of the Institute of Electri-
cal and Electronics Engineers, and volunteered as an elected chair of the Central
North Carolina Section twice. He is a member of Sigma Xi, the Scientific Research
Society and a Fellow of the Institution of Engineering and Technology.
xi
Contents

1 Science of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Technological Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Technological Advancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Big Data Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Facts and Statistics of a System . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Big Data Versus Regular Data . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Machine Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Modeling and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Supervised and Unsupervised . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Collaborative Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 A Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 The Purpose and Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 The Goal and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.3 The Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Part I Understanding Big Data

2 Big Data Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Big Data Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Big Data Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Big Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 Big Data Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Classification Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
xiii
xiv Contents

2.3 Big Data Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


2.3.1 High-Dimensional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Low-Dimensional Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


3.1 Analytics Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Choices of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Pattern Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Graphical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Patterns of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Standardization: A Coding Example . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Evolution of Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.3 Data Expansion Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Deformation of Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.5 Classification Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Low-Dimensional Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.2 A Real Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Part II Understanding Big Data Systems

4 Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


4.1 Hadoop Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.2 MapReduce Programming Model . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Hadoop System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2 Distributed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.3 Programming Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Hadoop Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Essential Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2 Installation Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.3 RStudio Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Testing the Hadoop Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.1 Standard Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.2 Alternative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Contents xv

4.5 Multinode Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


4.5.1 Virtual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Hadoop Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5 MapReduce Programming Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


5.1 MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.1 Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 MapReduce Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Mapper Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2 Reducer Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 MapReduce Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.4 A Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 MapReduce Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.1 Naming Convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.2 Coding Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.3 Application of Coding Principles . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 File Handling in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1 Pythagorean Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.2 File Split Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 File Split Improved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Part III Understanding Machine Learning

6 Modeling and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123


6.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1.2 Domain Division Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1.3 Data Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1.4 Domain Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2.2 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.3 Layered Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.4 Comparison of the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3.2 Types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xvi Contents

7 Supervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


7.1 Supervised Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.1 Parametrization Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.2 Optimization Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2.1 Continuous Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.2 Theory of Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.3.1 Discrete Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.3.2 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.4.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.5 Layered Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.5.1 Shallow Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.5.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


8.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.1.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.2.1 Tenfold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2.2 Leave-One-Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2.3 Leave-p-Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.2.4 Random Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.2.5 Dividing Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.3.1 Quantitative Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.3.2 Qualitative Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.4 A Simple 2D Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


9.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1.1 Linear Classifier: Separable Linearly . . . . . . . . . . . . . . . . . . . . 208
9.1.2 Linear Classifier: Nonseparable Linearly . . . . . . . . . . . . . . . . . 218
9.2 Lagrangian Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.1 Modeling of LSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.2 Conceptualized Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.3 Algorithm and Coding of LSVM . . . . . . . . . . . . . . . . . . . . . . . 220
Contents xvii

9.3 Nonlinear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


9.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.2 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.3 SVM Algorithms on Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.3.4 Real Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

10 Decision Tree Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237


10.1 The Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10.1.1 A Coding Example—Classification Tree . . . . . . . . . . . . . . . . . 241
10.1.2 A Coding Example—Regression Tree . . . . . . . . . . . . . . . . . . . 244
10.2 Types of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.2.1 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.2.2 Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.3 Decision Tree Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.3.1 Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.4 Quantitative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.4.1 Entropy and Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.4.2 Gini Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.4.3 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10.5 Decision Tree Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.5.1 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.5.2 Validation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.5.3 Testing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.6 Decision Tree and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.6.1 Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Part IV Understanding Scaling-Up Machine Learning

11 Random Forest Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273


11.1 The Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.1 Parallel Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.1.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.1.3 Gain/Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.1.4 Bootstrapping and Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.2 Random Forest Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.2.1 Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.3 Random Forest Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.3.1 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.3.2 Testing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
xviii Contents

11.4 Random Forest and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284


11.4.1 Random Forest Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.4.2 Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

12 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
12.2 Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.1 No-Drop Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.2 Dropout Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.3 Dropconnect Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.2.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.2.5 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
12.2.6 MapReduce Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.3 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
12.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
12.3.2 Parameters Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
12.4 Implementation of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
12.4.1 Analysis of Domain Divisions . . . . . . . . . . . . . . . . . . . . . . . . . 303
12.4.2 Analysis of Classification Accuracies . . . . . . . . . . . . . . . . . . . 303
12.5 Ensemble Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

13 Chandelier Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309


13.1 Unit Circle Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.1.1 UCA Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.1.2 Improved UCA Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.1.3 A Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
13.1.4 Drawbacks of UCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.2 Unit Circle Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.2.1 UCM Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.2.2 A Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.2.3 Drawbacks of UCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
13.3 Unit Ring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
13.3.1 A Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
13.3.2 Unit Ring Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
13.3.3 A Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
13.3.4 Drawbacks of URM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
13.4 Chandelier Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
13.4.1 CDT-Based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
13.4.2 Extension to Random Chandelier . . . . . . . . . . . . . . . . . . . . . . . 328
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Contents xix

14 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.2 Feature Hashing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
14.2.1 Standard Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.2.2 Flagged Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.3 Proposed Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
14.3.1 Binning and Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
14.3.2 Mitigation Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
14.3.3 Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
14.4 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14.4.1 A Matlab Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14.4.2 A MapReduce Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 337
14.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.5.1 Eigenvector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
14.5.2 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.5.3 The Principal Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
14.5.4 A 2D Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
14.5.5 A 3D Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
14.5.6 A Generalized Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 352
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Chapter 1
Science of Information

Abstract The main objective of this chapter is to provide an overview of the modern
field of data science and some of the current progress in this field. The overview
focuses on two important paradigms: (1) big data paradigm, which describes a prob-
lem space for the big data analytics, and (2) machine learning paradigm, which
describes a solution space for the big data analytics. It also includes a preliminary
description of the important elements of data science. These important elements
are the data, the knowledge (also called responses), and the operations. The terms
knowledge and responses will be used interchangeably in the rest of the book. A pre-
liminary information of the data format, the data types and the classification are also
presented in this chapter. This chapter emphasizes the importance of collaboration
between the experts from multiple disciplines and provides the information on some
of the current institutions that show collaborative activities with useful resources.

1.1 Data Science

Data science is an emerging field in the twenty-first century. The article by Mike
Loukides at the O’reilly website [1] provides an overview, and it talks about data
source, and data scalability. We can define data science as the management and
analysis of data sets, the extraction of useful information, and the understanding of
the systems that produce the data. The system can be a single unit (e.g., a com-
puter network or a wireless sensor network) that is formed by many interconnecting
subunits (computers or sensors) that can collaborate under a certain set of prin-
ciples and strategies to carry out tasks, such as the collection of data, facts, or
statistics of an environment the system is expected to monitor. Some examples of
these systems include network intrusion detection systems [2], climate-change det-
ection systems [3], and public space intruder detection systems [4]. These real-world
systems may produce massive amounts of data, called big data, from many data
sources that are highly complex, unstructured, and hard to manage, process, and

© Springer Science+Business Media New York 2016 1


S. Suthaharan, Machine Learning Models and Algorithms for Big
Data Classification, Integrated Series in Information Systems 36,
DOI 10.1007/978-1-4899-7641-3 1
2 1 Science of Information

analyze. This is currently a challenging problem for many industries, institutions,


and organizations, including businesses, health care sectors, information technology
divisions, government agencies, and research organizations. To address this prob-
lem, a separate field, big data science, has been created and requires a new direction
in research and educational efforts for its speedy and successful advancements [5].
One of the research problems in big data science is the big data classification,
as reported in [6, 7], which involves the classification of different types of data
and the extraction of useful information from the massive and complex data sets.
The big data classification research requires technology that can handle problems
caused by the data characteristics (volume, velocity, and variety) of big data [5].
It also requires mathematical models and algorithms to classify the data efficiently
using appropriate technology, and these mathematical models and algorithms form
the field of machine learning discussed in [8–10].

1.1.1 Technological Dilemma

One of the technological dilemmas in big data science is the nonexistence of a


technology that can manage and analyze dynamically growing massive data effi-
ciently and extract useful information. Another dilemma is the lack of intelligent ap-
proaches that can select suitable techniques from many design choices (i.e., models
and algorithms) to solve big data problems. Additionally, if we invest in expensive
and modern technology, assuming that the data in hand is big data, and we later find
out that the data is not big data (which could have been solved by simple technology
and tools), then the investment is basically lost. In this case, the machine-learning
techniques like the supervised learning [11] and the dimensionality reduction [8, 12]
techniques are useful. A simple explanation on supervised learning can be found
at the MATLAB website [13]. One of the dimensionality reduction approaches is
called principal component analysis (PCA), and a simple tutorial on PCA can be
found at Brian Russell’s website [14]. In addition to these techniques, a framework
(or a systematic design) to test and validate the data early is also required and a
framework for this purpose is presented in Chap. 3.

1.1.2 Technological Advancement

The current advancements in the technology include the modern distributed file sys-
tems and the distributed machine learning. One such technology is called Hadoop
[15, 16], which facilitates distributed machine learning using external libraries, like
the scikit-learn library [17], to process big data. Among several machine-learning
techniques in the libraries, most of them are based on classical models and algo-
rithms, may not be suitable for big data processing. However, some techniques,
1.2 Big Data Paradigm 3

like the decision tree learning and the deep learning, are suitable for big data clas-
sification, and they may help develop better supervised learning techniques in the
upcoming years. The classification techniques evolved from these models and alg-
orithms are the main focus, and they will be discussed in detail in the rest of the
book.

1.2 Big Data Paradigm

In this book, it is assumed that the big data paradigm consists of a big data system
and an environment. The goal of a system is to observe an environment and learn
its characteristics to make accurate decisions. For example, the goal of a network
intrusion detection system is to learn traffic characteristics and detect intrusions
to improve the security of a computer network. Similarly, the goal of a wireless
sensor network is to monitor changes in the weather to learn the weather patterns
for forecasting. The environment generates events, and the system collects the facts
and statistics, transforms them into knowledge with suitable operations, learns the
event characteristics, and predicts the environmental characteristics.

1.2.1 Facts and Statistics of a System

To understand a system and develop suitable technology, mathematical/statistical


models, and algorithms, we need clear definitions for two important terms, data and
knowledge, and for three operations, physical, mathematical, and logical operations.
The descriptions of these terms and operations are presented below.

1.2.1.1 Data

Data can be described as the hidden digital facts that the monitoring system collects.
Hidden digital facts are the digitized facts that are not obvious to the system without
further comprehensive processing. The definition of data should be based on the
knowledge that must be gained from it. One of the important requirements for the
data is the format. For example, the data could be presented mathematically or in a
two-dimensional tabular representation. Another important requirement is the type
of data. For example, the data could be labeled or not labeled. In the labeled data,
the digital facts are not hidden and can be used for training the machine-learning
techniques. In the unlabeled data, the digital facts are hidden and can be used for
testing or validation as a part of the machine-learning approach.
4 1 Science of Information

Fig. 1.1 Transformation of data into knowledge

1.2.1.2 Knowledge

Knowledge can be described as the learned information acquired from the data.
For example, the knowledge could be the detection of patterns in the data, the
classification of the varieties of patterns in the data, the calculation of unknown
statistical distributions, or the computation of the correlations of the data. It forms
the responses for the system, and it is called the “knowledge set” or “response
set” (sometimes called the “labeled set”). The data forms the domain, called “data
domain,” on which the responses are generated using a model f as illustrated in
Fig. 1.1. In addition to these two elements (i.e., the data and the knowledge), a
monitoring system needs three operations, called physical operations, mathemati-
cal operations, and logical operations in this book. The descriptions of these three
important operations are presented in the following subsections.

1.2.1.3 Physical Operation

Physical operations describe the steps involved in the processes of data capture,
data storage, data manipulation, and data visualization [18]. These are the important
contributors to the development of a suitable data domain for a system so that the
machine-learning techniques can be applied efficiently. Big data also means mas-
sive data, and the assumption is that it cannot be solved with a single file or a single
machine. Hence, the indexing and distribution of the big data over a distributed net-
work becomes necessary. One of the popular tools available in the market for this
purpose is the Hadoop distributed file system (http://hadoop.apache.org/), which
uses the MapReduce framework (http://hadoop.apache.org/mapreduce/) to accom-
plish these objectives. These modern tools help enhance the physical operations of a
system which, in turn, helps generate sophisticated, supervised learning models and
algorithms for big data classifications.
1.2 Big Data Paradigm 5

1.2.1.4 Mathematical Operation

Mathematical operations describe the theory and applications of appropriate


mathematical and statistical techniques and tools required for the transformation
of data into knowledge. This transformation can be written as a knowledge function
f : D ⇒ K as illustrated in Fig. 1.1, where the set D stands for the data domain and
the set K stands for the knowledge or response set. In this knowledge function, if the
data (i.e., the data domain) is structured, then the executions of these operations are
not difficult. Even if the structured data grows exponentially, these operations are not
difficult because they can be carried out using existing resources and tools. Hence,
the size of the data does not matter in the case of a structured data in general.

1.2.1.5 Logical Operation

Logical operations describe the logical arguments, justifications, and interpretations


of the knowledge, which can be used to derive meaningful facts. For example, the
knowledge function f : D ⇒ K can divide (classify) the data domain and provide
data patterns, and then the logical operations and arguments must be used to justify
and interpret the class types from the patterns.

1.2.2 Big Data Versus Regular Data

In addition to the terminologies mentioned earlier, we also need to understand the


distinction between the new definition of big data and the definition of regular data.
Figure 1.2 demonstrates this distinction. Before we understand the information in
this figure, we need to understand three parameters, n, p, and t of a system, because
they determine the characteristics of data whether it is a big data set or a regular
data set.

1.2.2.1 Scenario

An element of a monitoring system’s data can also be called an observation (or an


event). This book will use the term “observation” and the term “event” interchange-
ably. For example, an observation of a network intrusion detection system is the
traffic packet captured at a particular instance. Millions of events (n) may be cap-
tured within a short period of time (t) using devices like sensors and network routers
and analyzed using software tools to measure the environmental characteristics. An
observation generally depends on many independent variables called features, and
they form a space called feature space. The number of features (p) determines the
dimensionality of the system, and it controls the complexity of processing the data.
The features represent the characteristics of the environment that is monitored by
6 1 Science of Information

Fig. 1.2 Big data and regions of interest

the system. As an example, the source bytes, destination count, and protocol type
information found in a packet can serve as features of the computer network traf-
fic data. The changes in the values of feature variables determine the type (or the
class) of an event. To determine the correct class for an event, the event must be
transformed into knowledge.
In summary, the parameter n represents the number of observations captured
by a system at time t, which determines the size (volume) of the data set, and the
parameter p represents the number of features that determines the dimension of the
data and contributes to the number of classes (variety) in the data set. In addition,
the ratio between the parameters n and t determines the data rate (velocity) term as
described in the standard definition of big data [6].
Now referring back to Fig. 1.2, the horizontal axis represents p (i.e., the dimen-
sion) and the vertical axis represents n (i.e., the size or the volume). The domain
defined by n and p is divided into four subdomains (small, large, high dimension,
and massive) based on the magnitudes of n and p. The arc boundary identifies the
regular data and massive data regions, and the massive data region becomes big data
when velocity and variety are included.

1.2.2.2 Data Representation

A data set may be defined in mathematical or tabular form. The tabular form is vi-
sual, and it can be easily understood by nonexperts. Hence this section first presents
the data representation tool in a tabular form, and it will be defined mathematically
from Chap. 2 onward. The data sets generally contain a large number of events as
mentioned earlier. Let us denote these events by E1 , E2 , . . . , Emn . Now assume that
1.3 Machine Learning Paradigm 7

these observations can be divided into n separable classes denoted by C1 ,C2 , . . . ,Cn
(where n is much smaller than mn), where C1 is a set of events E1 , E2 , . . . , Em1 , C2
is a set of events E1 , E2 , . . . , Em2 , and so on (where m1 + m2 + · · · = mn). These
classes of events may be listed in the first column of a table. The last column of
the table identifies the corresponding class types. In addition, every set of events
depends on p features that are denoted by F1 , F2 , . . . , Fp , and the values associated
with these features can be presented in the other columns of the table. For example,
the values associated with feature F1 of the first set E1 , E2 , . . . , Em1 can be denoted
by x11 , x12 , . . . , x1m1 , indicating the event E1 takes x11 , event E2 takes x12 , and so on.
The same pattern can be followed for the other sets of events.

1.3 Machine Learning Paradigm

Machine learning is about the exploration and development of mathematical models


and algorithms to learn from data. Its paradigm focuses on classification objectives
and consists of modeling an optimal mapping between the data domain and the
knowledge set and developing the learning algorithms. The classification is also
called supervised learning, which requires a training (labeled) data set, a validation
data set, and a test data set. The definitions and the roles of these data sets will be
discussed in Chap. 2. However, to briefly explain, the training data set helps find the
optimal parameters of a model, the validation data set helps avoid overfitting of the
model, and the test data set helps determine the accuracy of the model.

1.3.1 Modeling and Algorithms

The term modeling refers to both mathematical and statistical modeling of data.
The goal of modeling is to develop a parametrized mapping between the data do-
main and the response set. This mapping could be a parametrized function or a
parametrized process that learn the characteristics of a system from the input (la-
beled) data. The term algorithm is a confusing term in the context of machine learn-
ing. For a computer scientist, the term algorithm means step-by-step systematic in-
structions for a computer to solve a problem. In machine learning, the modeling,
itself, may have several algorithms to derive a model; however, the term algorithm
here refers to a learning algorithm. The learning algorithm is used to train, validate,
and test the model using a given data set to find an optimal value for the parameters,
validate it, and evaluate its performance.

1.3.2 Supervised and Unsupervised

It is best to define supervised learning and unsupervised learning based on the class
definition. In supervised learning, the classes are known and class boundaries are
8 1 Science of Information

Fig. 1.3 Classification is defined

well defined in the given (training) data set, and the learning is done using these
classes (i.e., class labels). Hence, it is called classification. In unsupervised learning,
we assume the classes or class boundaries are not known, hence the class labels
themselves are also learned, and classes are defined based on this. Hence, the class
boundaries are statistical and not sharply defined, and it is called clustering.

1.3.2.1 Classification

In classification problems [11], we assume labeled data (classes) are available to


generate rules (i.e., generate classifiers through training) that can help to assign a
label to new data (i.e., testing) that does not have labels. In this case, we can derive
an exact rule because of the availability of the labels. Figure 1.3 illustrates this
example. It shows two classes, labeled with white dots and black dots, and a straight
line rule that helps to assign a label to a new data point. As stated before, the labeled
data sets are available for the purpose of evaluating and validating machine-learning
techniques, hence the classification problem can be defined mathematically.
The classification problem may be addressed mathematically based on the data-
to-knowledge transformation mentioned earlier. Suppose a data set is given, and its
data domain D is Rl , indicating that the events of the data set depend on l features
and form an l-dimensional vector space. If we assume that there are n classes, then
we can define the knowledge function (i.e., the model) as follows:

f : Rl ⇒ {0, 1, 2, . . ., n} (1.1)

In this function definition, the range {0, 1, 2, . . . , n} is the knowledge set which
assigns the discrete values (labels) 0, 1, 2, . . . , n to different classes. This mathe-
matical function helps us to define suitable classifiers for the classification of the
data. Several classification techniques have been proposed in the machine learning
1.3 Machine Learning Paradigm 9

Fig. 1.4 Clustering is defined

literature, and some of the well-known techniques are: support vector machine [19],
decision tree [20], random forest [21], and deep learning [22]. These techniques will
be discussed in detail in this book with programming and examples.

1.3.2.2 Clustering

In clustering problems [23, 24], we assume data sets are available to generate rules,
but they are not labeled. Hence, we can only derive an approximated rule that can
help to label new data that do not have labels. Figure 1.4 illustrates this example.
It shows a set of points labeled with white dots; however, a geometric pattern that
determines two clusters can be found. These clusters form a rule that helps to assign
a label to the given data points and thus to a new data point. As a result, the data may
only be clustered, not classified. Hence, the clustering problem can also be defined
as follows with an approximated rule. The clustering problem may also be addressed
mathematically based on the data-to-knowledge transformation mentioned earlier.
Once again, let us assume a data set is given, and its domain D is Rl , indicating that
the events of the data set depend on l features and form an l-dimensional vector
space. If we extract structures (e.g., statistical or geometrical) and estimate there are
n̂ classes, then we can define the knowledge function as follows:

fˆ : Rl ⇒ {0, 1, 2 . . . , n̂} (1.2)

The range {0, 1, 2, . . ., n̂} is the knowledge set which assigns the discrete labels
0, 1, 2, . . . , n̂ to different classes. This function helps us to assign suitable labels to
new data. Several clustering algorithms have been proposed in machine learning:
k-Means clustering, Gaussian mixture clustering, and hierarchical clustering [23].
10 1 Science of Information

1.4 Collaborative Activities

Big data means big research. Without strong collaborative efforts between the
experts from many disciplines (e.g., mathematics, statistics, computer science, med-
ical science, biology, and chemistry) and the dissemination of educational resources
in a timely fashion, the goal of advancing the field of data science may not be prac-
tical. These issues have been realized not only by researchers and academics but
also by government agencies and industries. This momentum can be noticed in the
last several years. Some of the recent collaborative efforts and the resources that can
provide long-term impacts in the field of big data science are:
• Simons Institute UC Berkeley—http://simons.berkeley.edu/
• Statistical Applied Mathematical Science Institute—http://www.samsi.info/
• New York University Center for Data science—http://datascience.nyu.edu/
• Institute for Advanced Analytics—http://analytics.ncsu.edu/
• Center for Science of Information, Purdue University—http://soihub.org/
• Berkeley Institute for Data Science—http://bids.berkeley.edu/
• Stanford and Coursera—https://www.coursera.org/
• Institute for Data Science—http://www.rochester.edu/data-science/
• Institute for Mathematics and its Applications—http://www.ima.umn.edu/
• Data Science Institute—http://datascience.columbia.edu/
• Data Science Institute—https://dsi.virginia.edu/
• Michigan Institute for Data Science—http://minds.umich.edu/
An important note to the readers: The websites (or web links) cited in the entire book
may change rapidly, please be aware of it. My plan is to maintain the information in
this book current by updating the information at the following website: http://www.
uncg.edu/cmp/downloads/

1.5 A Snapshot

The snapshot of the entire book always helps readers by informing the topics cov-
ered in the book ahead of time. This allows them to conceptualize, summarize, and
understand the theory and applications. This section provides a snapshot of this book
under three categories: the purpose and interests, the goals and objectives, and the
problems and challenges.

1.5.1 The Purpose and Interests

The purpose of this book is to provide information on big data classification and
the related topics with simple examples and programming. Several interesting top-
ics contribute to big data classification, including the characteristics of data, the
relationships between data and knowledge, the models and algorithms that can help

You might also like