Big Data Analytics With Applications in Insider Threat Detection
Big Data Analytics With Applications in Insider Threat Detection
Applications in Insider
Threat Detection
Big Data Analytics with
Applications in Insider
Threat Detection
Bhavani Thuraisingham
Mohammad Mehedy Masud
Pallabi Parveen
Latifur Khan
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the
validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to
publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let
us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users.
For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Chapter 1 Introduction...................................................................................................................1
1.1 Overview............................................................................................................1
1.2 Supporting Technologies.................................................................................... 2
1.3 Stream Data Analytics........................................................................................ 3
1.4 Applications of Stream Data Analytics for Insider Threat Detection................ 3
1.5 Experimental BDMA and BDSP Systems......................................................... 4
1.6 Next Steps in BDMA and BDSP........................................................................4
1.7 Organization of This Book................................................................................. 5
1.8 Next Steps........................................................................................................... 9
vii
viii Contents
Chapter 11 Classification and Novel Class Detection in Concept-Drifting Data Streams.......... 127
11.1 Introduction.................................................................................................... 127
11.2 ECSMiner....................................................................................................... 127
11.2.1 Overview........................................................................................... 127
11.2.2 High Level Algorithm....................................................................... 128
11.2.3 Nearest Neighborhood Rule.............................................................. 129
11.2.4 Novel Class and Its Properties.......................................................... 130
11.2.5 Base Learners.................................................................................... 131
11.2.6 Creating Decision Boundary during Training.................................. 132
11.3 Classification with Novel Class Detection..................................................... 133
11.3.1 High-Level Algorithm....................................................................... 133
11.3.2 Classification..................................................................................... 134
11.3.3 Novel Class Detection....................................................................... 134
11.3.4 Analysis and Discussion.................................................................... 137
11.3.4.1 Justification of the Novel Class Detection Algorithm....... 137
11.3.4.2 Deviation between Approximate and Exact q-NSC
Computation...................................................................... 138
11.3.4.3 Time and Space Complexity.............................................. 140
11.4 Experiments.................................................................................................... 141
11.4.1 Datasets............................................................................................. 141
11.4.1.1 Synthetic Data with only Concept Drift (SynC)................ 141
11.4.1.2 Synthetic Data with Concept Drift and Novel Class
(SynCN)............................................................................. 141
11.4.1.3 Real Data—KDDCup 99 Network Intrusion Detection
(KDD)................................................................................ 141
11.4.1.4 Real Data—Forest Covers Dataset from UCI
Repository (Forest)............................................................ 142
11.4.2 Experimental Set-Up......................................................................... 142
11.4.3 Baseline Approach............................................................................ 142
11.4.4 Performance Study............................................................................ 143
11.4.4.1 Evaluation Approach......................................................... 143
11.4.4.2 Results................................................................................ 143
11.5 Summary and Directions................................................................................ 148
References................................................................................................................. 148
Chapter 12 Data Stream Classification with Limited Labeled Training Data............................. 149
12.1 Introduction.................................................................................................... 149
12.2 Description of ReaSC..................................................................................... 149
12.3 Training with Limited Labeled Data.............................................................. 152
12.3.1 Problem Description.......................................................................... 152
12.3.2 Unsupervised K-Means Clustering.................................................... 152
12.3.3 K-Means Clustering with Cluster-Impurity Minimization............... 152
12.3.4 Optimizing the Objective Function with Expectation
Maximization (E-M)......................................................................... 154
12.3.5 Storing the Classification Model....................................................... 155
xii Contents
Chapter 22 Stream Mining and Big Data for Insider Threat Detection...................................... 251
22.1 Introduction.................................................................................................... 251
22.2 Discussion....................................................................................................... 251
22.3 Future Work.................................................................................................... 252
22.3.1 Incorporate User Feedback............................................................... 252
22.3.2 Collusion Attack................................................................................ 252
22.3.3 Additional Experiments.................................................................... 252
22.3.4 Anomaly Detection in Social Network and Author Attribution....... 252
22.3.5 Stream Mining as a Big Data Mining Problem................................. 253
22.4 Summary and Directions................................................................................ 253
References................................................................................................................. 254
Conclusion to Part III................................................................................................ 257
Chapter 24 Big Data Analytics for Multipurpose Social Media Applications............................ 289
24.1 Introduction.................................................................................................... 289
24.2 Our Premise....................................................................................................290
24.3 Modules of Inxite........................................................................................... 291
24.3.1 Overview........................................................................................... 291
24.3.2 Information Engine........................................................................... 291
24.3.2.1 Entity Extraction................................................................ 292
24.3.2.2 Information Integration..................................................... 293
24.3.3 Person of Interest Analysis................................................................ 293
24.3.3.1 InXite Person of Interest Profile Generation and
Analysis............................................................................. 293
24.3.3.2 InXite POI Threat Analysis............................................... 294
24.3.3.3 InXite Psychosocial Analysis............................................ 296
24.3.3.4 Other features.................................................................... 297
24.3.4 InXite Threat Detection and Prediction............................................ 298
24.3.5 Application of SNOD........................................................................300
24.3.5.1 SNOD++...................................................................300
24.3.5.2 Benefits of SNOD++...................................................300
24.3.6 Expert Systems Support....................................................................300
24.3.7 Cloud-Design of Inxite to Handle Big Data...................................... 301
24.3.8 Implementation..................................................................................302
24.4 Other Applications.........................................................................................302
24.5 Related Work.................................................................................................. 303
24.6 Summary and Directions................................................................................304
References.................................................................................................................304
Chapter 25 Big Data Management and Cloud for Assured Information Sharing........................307
25.1 Introduction....................................................................................................307
25.2 Design Philosophy..........................................................................................308
25.3 System Design................................................................................................309
25.3.1 Design of CAISS...............................................................................309
25.3.2 Design of CAISS++.................................................................. 312
25.3.2.1 Limitations of CAISS........................................................ 312
25.3.3 Formal Policy Analysis..................................................................... 321
25.3.4 Implementation Approach................................................................. 321
25.4 Related Work.................................................................................................. 321
xvi Contents
Chapter 28 A Semantic Web-Based Inference Controller for Provenance Big Data................... 355
28.1 Introduction.................................................................................................... 355
28.2 Architecture for the Inference Controller....................................................... 356
28.3 Semantic Web Technologies and Provenance................................................360
28.3.1 Semantic Web-Based Models............................................................360
28.3.2 Graphical Models and Rewriting...................................................... 361
28.4 Inference Control through Query Modification............................................. 361
28.4.1 Our Approach.................................................................................... 361
28.4.2 Domains and Provenance.................................................................. 362
28.4.3 Inference Controller with Two Users................................................ 363
28.4.4 SPARQL Query Modification...........................................................364
Contents xvii
Chapter 30 Unified Framework for Secure Big Data Management and Analytics...................... 391
30.1 Overview........................................................................................................ 391
30.2 Integrity Management and Data Provenance for Big Data Systems.............. 391
30.2.1 Need for Integrity.............................................................................. 391
30.2.2 Aspects of Integrity........................................................................... 392
30.2.3 Inferencing, Data Quality, and Data Provenance.............................. 393
30.2.4 Integrity Management, Cloud Services and Big Data....................... 394
30.2.5 Integrity for Big Data........................................................................ 396
30.3 Design of Our Framework.............................................................................. 397
30.4 The Global Big Data Security and Privacy Controller...................................400
30.5 Summary and Directions................................................................................ 401
References................................................................................................................. 401
Chapter 33 Toward a Case Study in Healthcare for Big Data Analytics and Security................ 433
33.1 Introduction.................................................................................................... 433
Contents xix
xxiii
xxiv Preface
and Chapter 11 of Book #2 (for multimedia data mining). Book #5 (XML, Databases and the
Semantic Web) described XML technologies related to data management. It elaborated on Chapter
11 of Book #3. Book #6 (Web Data Mining and Applications in Business Intelligence and Counter-
terrorism) elaborated on Chapter 9 of Book #3. Book #7 (Database and Applications Security)
examined security for technologies discussed in each of our previous books. It focuses on the tech-
nological developments in database and applications security. It is essentially the integration of
Information Security and Database Technologies. Book #8 (Building Trustworthy Semantic Webs)
applies security to semantic web technologies and elaborates on Chapter 25 of Book #7. Book #9
(Secure Semantic Service-Oriented Systems) is an elaboration of Chapter 16 of Book #8. Book #10
(Developing and Securing the Cloud) is an elaboration of Chapters 5 and 25 of Book #9.
Our second series of books at present consists of four books. Book #1 is Design and
Implementation of Data Mining Tools. Book #2 is Data Mining Tools for Malware Detection. Book
#3 is Secure Data Provenance and Inference Control with Semantic Web. Book #4 is Analyzing
and Securing Social Networks. Book #5, which is the current book, is Big Data Analytics with
Applications in Insider Threat Detection. For this series, we are converting some of the practical
aspects of our work with students into books. The relationships between our texts will be illus-
trated in Appendix A.
information and knowledge management as separate areas, in this book we take a different approach
to data, information, and knowledge by differentiating between these terms as much as possible.
For us data is usually some value like numbers, integers, and strings. Information is obtained when
some meaning or semantics is associated with the data such as John’s salary is 20K. Knowledge is
something that you acquire through reading and learning, and as a result understand the data and
information and take actions. That is, data and information can be transferred into knowledge when
uncertainty about the data and information is removed from someone’s mind. It should be noted
that it is rather difficult to give strict definitions of data, information, and knowledge. Sometimes
we will use these terms interchangeably also. Our framework for data management discussed in the
appendix helps clarify some of the differences. To be consistent with the terminology in our previ-
ous books, we will also distinguish between database systems and database management systems. A
database management system is that component which manages the database containing persistent
data. A database system consists of both the database and the database management system.
FINAL THOUGHTS
The goal of this book is to explore big data analytics techniques and apply them for cyber secu-
rity including insider threat detection. We will discuss various concepts, technologies, issues, and
challenges for both BDMA and BDSP. In addition, we also present several of the experimental
systems in cloud computing and secure cloud computing that we have designed and developed at
The University of Texas at Dallas. We have used some of the material in this book together with
the numerous references listed in each chapter for graduate level courses at The University of Texas
at Dallas on “Big Data Analytics” as well on “Developing and Securing the Cloud.” We have also
provided several experimental systems developed by our graduate students.
It should be noted that the field is expanding very rapidly with several open source tools and
commercial products for managing and analyzing big data. Therefore, it is important for the reader
to keep up with the developments of the various big data systems. However, security cannot be an
afterthought. Therefore, while the technologies for big data are being developed, it is important to
include security at the onset.
Acknowledgments
We thank the administration at the Erik Jonsson School of Engineering and Computer Science
at The University of Texas at Dallas for giving us the opportunity to conduct our research. We
also thank Ms. Rhonda Walls, our project coordinator, for proofreading and editing the chapters.
Without her hard work this book would not have been possible. We thank many additional people
who have supported our work or collaborated with us.
• Dr. Robert Herklotz (retired) from the Air Force Office of Scientific Research for funding
our research on insider threat detection as well as several of our experimental systems.
• Dr. Victor Piotrowski from the National Science Foundation for funding our capacity
building work on assured cloud computing and secure mobile computing.
• Dr. Ashok Agrawal, formerly of National Aeronautics and Space Administration, for fund-
ing our research on stream data mining.
• Professor Jiawei Han and his team from the University of Illinois at Urbana Champaign as
well as Dr. Charu Agrawal from IBM Research for collaborating with us on stream data
mining.
• Our colleagues Dr. Murat Kantarcioglu, Dr. Kevin Hamlen, Dr. Zhiqiang Lin, Dr. Kamil
Sarac and Dr. Alvaro Cardenas at The University of Texas at Dallas for discussions on our
work.
• Our collaborators on Assured Information Sharing at Kings College, University of London
(Dr. Maribel Fernandez and the late Dr. Steve Barker), the University of Insubria, Italy
(Dr. Elena Ferrari and Dr. Barbara Carminati), Purdue University (Dr. Elisa Bertino), and
the University of Maryland, Baltimore County (Dr. Tim Finin and Dr. Anupam Joshi).
• The following people for their technical contributions: Dr. Murat Kantarciogu for his con-
tributions to Chapters 25, 26, 28, 31, and 34; Mr Ramkumar Paranthaman from Amazon for
his contributions to Chapter 7; Dr. Tyrone Cadenhead from Blue Cross Blue Shield for his
contributions to Chapter 28 (part of his PhD thesis); Dr. Farhan Husain and Dr. Arindam
Khaled, both from Amazon, for their contributions to Chapter 23 (part of Husain’s PhD
thesis); Dr. Satyen Abrol, Dr. Vaibhav Khadilkar, and Mr Gunasekar Rajasekar for their
contributions to Chapter 24; Dr. Vaibhav Khadilkar and Dr. Jyothsna Rachapalli for their
contributions to Chapter 25; Mr Pranav Parikh from Yahoo for his contributions to Chapter
26 (part of his MS thesis); Dr. David Lary and Dr. Vibhav Gogate, both from The University
of Texas at Dallas, for their contributions to Chapter 33; Dr. Alvaro Cardenas for his contri-
butions to Chapter 31; Dr. Zhiqiang Lin for his contributions to Chapters 32 and 34.
xxvii
Permissions
Chapter 8: Challenges for Stream Data Classification
A practical approach to classify evolving data streams: Training with limited amount of labeled
data. M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: ICDM ’08: Proceedings
of the 2008 International Conference on Data Mining, pp. 929–934, Pisa, Italy, Dec. 15–19, 2008.
Copyright 2008 IEEE. Reprinted with permission from IEEE Proceedings.
Integrating novel class detection with classification for concept-drifting data streams. M.
M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: Buntine, W., Grobelnik, M.,
Mladenić, D., Shawe-Taylor, J. (eds). Machine Learning and Knowledge Discovery in Databases.
ECML PKDD 2009. Lecture Notes in Computer Science, Vol. 5782. Springer, Berlin. Copyright
2009, with permission of Springer.
A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. M.
M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: PAKDD09: Proceedings of the
13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 363–375, Bangkok,
Thailand, Apr. 27–30, 2009. Springer-Verlag. Also Advances in Knowledge Discovery and Data
Mining. Copyright 2009, with permission of Springer.
Classification and novel class detection in concept-drifting data streams under time constraints.
M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: IEEE Transactions on
Knowledge and Data Engineering, Vol. 23, no. 6, pp. 859–874, June 2011. Copyright 2011 IEEE.
Reprinted with permission from IEEE.
xxix
xxx Permissions
Chapter 11: Classification and Novel Class Detection in Concept-Drifting Data Streams
A practical approach to classify evolving data streams: Training with limited amount of labeled
data. M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: ICDM ’08: Proceedings
of the 2008 International Conference on Data Mining, pp. 929–934, Pisa, Italy, December 15–19,
2008. Copyright 2008 IEEE. Reprinted with permission from IEEE Proceedings.
Integrating novel class detection with classification for concept-drifting data streams. M.
M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: Buntine, W., Grobelnik, M.,
Mladenić, D., Shawe-Taylor, J. (eds). Machine Learning and Knowledge Discovery in Databases.
ECML PKDD 2009. Lecture Notes in Computer Science, Vol. 5782. Springer, Berlin. Copyright
2009, with permission of Springer.
A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. M.
M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: PAKDD09: Proceedings of the
13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 363–375, Bangkok,
Thailand, April 27–30, 2009. Springer-Verlag. Also Advances in Knowledge Discovery and Data
Mining. Copyright 2009, with permission of Springer.
Classification and novel class detection in concept-drifting data streams under time con-
straints. M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: IEEE Transactions
on Knowledge and Data Engineering, Vol. 23, no. 6, pp. 859–874, June 2011. doi: 10.1109/
TKDE.2010.61. Copyright 2011 IEEE. Reprinted with permission from IEEE.
Chapter 12: Data Stream Classification with Limited Labeled Training Data
Facing the reality of data stream classification: Coping with scarcity of labeled data. M M.
Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. Hamlen, and B. M. Thuraisingham. Journal of
Knowledge and Information Systems, Vol. 1, no. 33, pp. 213–244. 2012. Copyright 2012, with per-
mission of Springer.
A practical approach to classify evolving data streams: Training with limited amount of labeled
data. M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: ICDM ’08: Proceedings
of the 2008 International Conference on Data Mining, pp. 929–934, Pisa, Italy, December 15–19,
2008. Copyright 2008 IEEE. Reprinted with permission from IEEE Proceedings.
A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. M.
M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. In: PAKDD09: Proceedings of the
13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 363–375, Bangkok,
Thailand, April 27–30, 2009. Springer-Verlag. Also Advances in Knowledge Discovery and Data
Mining. Copyright 2009, with permission of Springer.
Chapter 23: Cloud Query Processing System for Big Data Management
Heuristics-based query processing for large RDF graphs using cloud computing. M. F. Husain, J.
P. McGlothlin, M. M. Masud, L. R. Khan, IEEE Transactions on Knowledge and Data Engineering,
xxxii Permissions
Vol. 23, no. 9, pp. 1312–1327, 2011. Copyright 2011 IEEE. Reprinted with permission from IEEE
Transactions on Knowledge and Data Engineering.
A token-based access control system for RDF data in the clouds. A. Khaled, M. F. Husain, L.
Khan, K. W. Hamlen. In: The 2010 IEEE Second International Conference on Cloud Computing
Technology and Science (CloudCom), pp. 104–111, 2010. Copyright 2010 IEEE. Reprinted with
permission from IEEE Proceedings.
Chapter 25: Big Data Management and Cloud for Assured Information Sharing
Cloud-centric assured information sharing. V. Khadilkar, J. Rachapalli, T. Cadenhead, M.
Kantarcioglu, K. W. Hamlen, L. Khan, M. F. Husain. Lecture Notes in Computer Science 7299, 2012,
pp. 1–26. Proceedings of Intelligence and Security Informatics—Pacific Asia Workshop, PAISI
2012, Kuala Lumpur, Malaysia, May 29, 2012. Springer-Verlag, Berlin, 2012. Copyright 2012, with
permission from Springer. DOI 10.1007/978-3-642-30428-6_1, Print ISBN 978-3-642-30427-9.
Chapter 29: Confidentiality, Privacy, and Trust for Big Data Systems
Administering the semantic web: Confidentiality, privacy and trust management. B. M.
Thuraisingham, N. Tsybulnik, A. Alam, International Journal of Information Security and Privacy,
Vol. 1, no. 1, pp. 18–34. Copyright 2007, with permission from IGI Global.
Authors
Dr. Bhavani Thuraisingham is the Louis A. Beecherl, Jr. Distinguished Professor in the Erik
Jonsson School of Engineering and Computer Science at The University of Texas, Dallas (UTD)
and the executive director of UTD’s Cyber Security Research and Education Institute. Her current
research is on integrating cyber security, cloud computing, and data science. Prior to joining UTD,
she worked at the MITRE Corporation for 16 years including a 3-year stint as a program director
at the NSF. She initiated the Data and Applications Security program at NSF and was part of the
Cyber Trust theme. Prior to MITRE, she worked for the commercial industry for 6 years including
at Honeywell. She is the recipient of numerous awards including the IEEE Computer Society 1997
Technical Achievement Award, the ACM SIGSAC 2010 Outstanding Contributions Award, 2012
SDPS Transformative Achievement Gold Medal, 2013 IBM Faculty Award, 2017 ACM CODASPY
Research Award, and 2017 IEEE Computer Society Services Computing Technical Committee
Research Innovation Award. She is a 2003 Fellow of the IEEE and the AAAS and a 2005 Fellow of
the British Computer Society. She has published over 120 journal articles, 250 conference papers,
15 books, has delivered over 130 keynote addresses, and is the inventor of five patents. She has
chaired conferences and workshops for women in her field including Women in Cyber Security,
Women in Data Science, and Women in Services Computing/Cloud and has delivered featured
addresses at SWE, WITI, and CRA-W.
Dr. Mohammad Mehedy Masud is currently an associate professor at the College of Information
Technology (CIT) at United Arab Emirates University (UAEU). Prior to joining UAEU in January
2012, Dr. Masud worked at The University of Texas at Dallas as a research associate for 2 years.
He earned his PhD in computer science from The University of Texas at Dallas, USA, in December
2009. Dr. Masud’s research interests include big data mining, data stream mining, machine
learning, healthcare data analytics, and e-health. His research also contributes to cyber security
(network security, intrusion detection, and malware detection) using machine learning and data
mining. He has published more than 50 research articles in high impact factor journals includ-
ing IEEE Transactions on Knowledge and Data Engineering (TKDE), Journal of Knowledge and
Information Systems (KAIS), and top tier conferences including IEEE International Conference on
Data Mining (ICDM). He is the lead author of the book Data Mining Tools for Malware Detection
and is also the principal inventor of a U.S. patent. He is the principal investigator of several presti-
gious research grants funded by government and private funding organizations.
Dr. Pallabi Parveen is a principal big data engineer at AT&T since 2017 where she is conducting
research, design, and development activities on big data analytics for various applications. Prior
to her work at AT&T, she was a senior software engineer at VCE/EMC2 for 4 years where she
was involved in the research and prototyping efforts on big data systems. She completed her PhD
at UT Dallas in 2013 on Big Data Analytics with Applications for Insider Threat Detection. She
has also conducted research on facial recognition systems. Prior to her PhD, she worked for Texas
Instruments in embedded software systems. She has published her research in top tier journals and
conferences. She is an expert on big data management and analytics technologies and has published
her research in top tier journals and conferences.
Dr. Latifur Khan is a professor of computer science and director of data analytics at The
University of Texas at Dallas (UTD) where he has been teaching and conducting research in data
management and data analytics since September 2000. He earned his PhD in computer science
from the University of Southern California in August of 2000. Dr. Khan is an ACM Distinguished
Scientist and has received prestigious awards including the IEEE Technical Achievement Award
xxxiii
xxxiv Authors
for Intelligence and Security Informatics. Dr. Khan has published over 250 papers in prestigious
journals, and in peer-reviewed top tier conference proceedings. He is also the author of four books
and has delivered keynote addresses at various conferences and workshops. He is the inventor of
a number of patents and is involved in technology transfer activities. His research focuses on big
data management and analytics, machine learning for cyber security, complex data management
including geospatial data and multimedia data management. He has served as the program chair for
multiple conferences.
1 Introduction
1.1 OVERVIEW
The U.S. Bureau of Labor and Statistics (BLS) defines big data as a collection of large datasets
that cannot be analyzed with normal statistical methods. The datasets can represent numerical,
textual, and multimedia data. Big data is popularly defined in terms of five Vs: volume, velocity,
variety, veracity, and value. Big data management and analytics (BDMA) requires handling huge
volumes of data, both structured and unstructured, arriving at high velocity. By harnessing big
data, we can achieve breakthroughs in several key areas such as cyber security and healthcare,
resulting in increased productivity and profitability. Big data spans several important fields: busi-
ness, e-commerce, finance, government, healthcare, social networking, and telecommunications,
as well as several scientific fields such as atmospheric and biological sciences. BDMA is evolving
into a field called data science that not only includes BDMA, but also machine learning, statistical
methods, high-performance computing, and data management.
Data scientists aggregate, process, analyze, and visualize big data in order to derive useful
insights. BLS projected both computer programmers and statisticians to have high employment
growth during 2012–2022. Other sources have reported that by 2018, the United States alone could
face a shortage of 140,000–190,000 skilled data scientists. The demand for data science experts is
on the rise as the roles and responsibilities of a data scientist are steadily taking shape. Currently,
there is no debate on the fact that data science skillsets are not developing proportionately with high
industry demands. Therefore, it is imperative to bring data science research, development, and edu-
cation efforts into the mainstream of computer science. Data are being collected by every organiza-
tion regardless of whether it is industry, academia, or government. Organizations want to analyze
this data to give them a competitive edge. Therefore, the demand for data scientists including those
with expertise in BDMA techniques is growing by several folds every year.
While BDMA is evolving into data science with significant progress over the past 5 years, big
data security and privacy (BDSP) is becoming a critical need. With the recent emergence of the
quantified self (QS) movement, personal data collected by wearable devices and smartphone apps
are being analyzed to guide users in improving their health or personal life habits. This data are also
being shared with other service providers (e.g., retailers) using cloud-based services, offering poten-
tial benefits to users (e.g., information about health products). But such data collection and sharing
are often being carried out without the users’ knowledge, bringing grave danger that the personal
data may be used for improper purposes. Privacy violations could easily get out of control if data
collectors could aggregate financial and health-related data with tweets, Facebook activity, and pur-
chase patterns. In addition, access to the massive amounts of data collected has to be stored. Yet few
tools and techniques exist for privacy protection in QS applications or controlling access to the data.
While securing big data and ensuring the privacy of individuals are crucial tasks, BDMA tech-
niques can be used to solve security problems. For example, an organization can outsource activities
such as identity management, email filtering, and intrusion detection to the cloud. This is because
massive amounts of data are being collected for such applications and this data has to be analyzed.
Cloud data management is just one example of big data management. The question is: how can the
developments in BDMA be used to solve cyber security problems? These problems include malware
detection, insider threat detection, intrusion detection, and spam filtering.
We have written this book to elaborate on some of the challenges in BDMA and BDSP as well
as to provide some details of our ongoing efforts on big data analytics and its applications in cyber
security. The specific BDMA techniques we will focus on include stream data analytics. Also, the
1
2 Big Data Analytics with Applications in Insider Threat Detection
specific cyber security applications we will discuss include insider threat detection. We will also
describe some of the experimental systems we have designed relating to BDMA and BDSP as well
as provide some of our views on the next steps including developing infrastructures for BDMA and
BDSP to support education and experimentation.
This chapter details the organization of this book. The organization of this chapter is as f ollows.
Supporting technologies for BDMA and BDSP will be discussed in Section 1.2. Our research
and experimental work in stream data analytics including processing of massive data streams is
discussed in Section 1.3. Application of stream data analytics to insider threat detection is discussed
in Section 1.4. Some of the experimental systems we have designed and developed in topics related
to BDMA and BDSP will be discussed in Section 1.5. The next steps, including developing education
and experimental programs in BDMA and BDSP as well as some emerging topics such as Internet
of things (IoT) security as it relates to BDMA and BDSP are discussed in Section 1.6. Organization
of this book will be given in Section 1.7. We conclude this chapter with useful resources in Section
1.8. It should be noted that the contents of Sections 1.2 through 1.5 will be elaborated in Parts I
through V of this book. Figure 1.1 illustrates the contents covered in this chapter.
Supporting
technologies
With respect to data security and privacy, we will describe database security issues, security
policy enforcement, access control, and authorization models for database systems, as well as data
privacy issues. With respect to data mining, which we will also refer to as data analytics, we will
introduce the concept and provide an overview of the various data mining techniques to lay the
foundations for some of the techniques to be discussed in Parts II through V. With respect to data
mining applications in security, we will provide an overview of how some of the data mining tech-
niques discussed may be applied for cyber security applications. With respect to cloud computing
and semantic web, we will provide some of the key points including cloud data management and
technologies such as resource description framework for representing and managing large amounts
of data. With respect to data mining and insider threat detection, we will discuss some of our
work on applying data mining for insider threat detection that will provide the foundations for the
concepts to be discussed in Parts II and III. Finally, with respect to BDMA technologies, we will
discuss infrastructures and frameworks, data management, and data analytics systems that will be
applied throughout the various sections in this book.
Stream data
analytics
sensitive product designs and sell them to the competitors. This could be achieved manually or often
via cyber espionage. The malicious processes in the system can also carry out such covert operations.
Data mining techniques have been applied for cyber security problems including insider threat
detection. Techniques such as support vector machines and supervised learning methods have
been applied. Unfortunately, the training process for supervised learning methods tends to be
time-consuming and expensive and generally requires large amounts of well-balanced training
data to be effective. Also, traditional training methods do not scale well for massive amounts of
insider threat data. Therefore, we have applied BDMA techniques for insider threat detection.
We have designed and developed several BDMA techniques for detecting malicious insiders.
In particular, we have adapted our stream data analytics techniques to handle massive amounts of
data and detect malicious insiders in Part III of this book. The concepts addressed in Part III are
illustrated in Figure 1.4.
Stream data
analytics for insider
threat detection
Survey of
insider threat
Scalability
and stream
mining
Ensemble- Experimental
Insider threat Experimental
based insider Learning results for
detection for results for
threat classes nonsequence
sequence data sequence data
detection data
Experimental
systems in big
data, cloud, and
security
Semantic
Big data and
Big data Big data web-based
Cloud query Big data cloud for
for secure analytics for inference
processing for analytics for assured
information malware controller for
big data social media information
integration detection provenance
sharing
big data
we have also begun developing both experimental and educational infrastructures for both BDMA
and BDSP.
The chapters in Part V will discuss the research, infrastructures, and educational challenges in
BDMA and BDSP. In particular, we will discuss the integration of confidentiality, privacy, and trust in
big data systems. We will also discuss big data challenges for securing IoT systems. We will discuss
our work in smartphone security as an example of an IoT system. We will also describe a proposed
case study for applying big data analytics techniques as well as discuss the experimental infrastructure
and education programs we have developed for both BDMA and BDSP. Finally, we will discuss the
research issues in BDSP. The topics to be covered in Part V are illustrated in Figure 1.6.
classification and describes our approach to meet those challenges. Chapter 9 discusses related work
in data stream classification, semisupervised clustering, and novelty detection. Chapter 10 describes
the multiple partitions of multiple chunks ensemble classification technique. Chapter 11 explains
ECSMiner, our novel class detection technique, in detail. Chapter 12 describes the limited labeled
data problem and our solution, ReaSC. Chapter 13 discusses our findings and provides directions for
further work in stream data analytics in general and stream data classification in particular.
Part III, consisting of nine chapters, will discuss the applications of stream analytics for insider
threat detection. In Chapter 14, we cast insider threat detection as a stream mining problem and dis-
cuss techniques for efficiently detecting anomalies in stream data. In Chapter 15, we present related
work with regard to insider threat and stream mining as well as related work with respect to BDMA.
In Chapter 16, we discuss ensemble-based classification methods. In Chapter 17, we describe the
different classes of learning techniques for nonsequence data. In Chapter 18, we discuss our test-
ing methodology and experimental results for the techniques discussed in Chapters 16 and 17. In
Chapter 19, we describe insider threat detection for sequence data, an ordered list of objects (or
events). Experiment results for sequence data are provided in Chapter 20. Scalability issues for
dealing with very large data streams are discussed in Chapter 21. Finally, stream mining techniques
with respect to big data for insider threat detection are discussed in Chapter 22.
Part IV, consisting of six chapters, will discuss some of the experimental systems we have
developed based on BDMA and BDSP. Some of these systems are also discussed in our previous
books. Our goal is to give some practical examples of systems based on the concepts discussed in
Parts I through III. The first is a cloud query processing system and will be discussed in Chapter
23. Chapter 24 discusses a stream-based social media analytics system that we have designed and
developed. Chapter 25 describes assured information sharing in the cloud. That is, we discuss the
policy-based information sharing system we have developed that operates in a cloud. Chapter 26
describes how information can be integrated in the cloud. Chapter 27 shows how some other cyber
security applications such as malware detection could be improved by implementing the data ana-
lytics techniques in the cloud. Finally, in Chapter 28, we describe the inference controller we have
developed for controlling unauthorized inference with provenance data and the use of big data
techniques to improve the performance.
Part V, consisting of seven chapters, discusses some of the challenges for BDMA and BDSP. In
particular, Chapter 29 describes how the notions of security, privacy, and trust management can be
incorporated into big data management systems. Chapter 30 describes a framework for BDMA and
BDSP systems. In particular, the design of a global inference controller automated with reason-
ing engines, policy enforcement systems, and data management systems is discussed. Chapter 31
describes secure IoT systems with respect to BDMA and BDSP. An example of secure IoT system,
which is essentially a collection of connected smartphones, is discussed in Chapter 32. Chapter 33
describes a proposed case study for BDMA and BDSP based on scientific applications. In Chapter 34,
we discuss our planned experimental infrastructure and educational program for BDMA and BDSP.
Finally, Chapter 35 presents the results of the NSF workshop on BDSP and the research directions
discussed at this workshop.
Each part begins with an introduction and ends with a conclusion. Furthermore, each of the
Chapters 2 through 35 starts with an overview and ends with a summary and references. Chapter 36
summarizes this book and discusses future directions. We have included Appendix A that provides
an overview of data management and discusses the relationship between the texts we have written.
This has been the standard practice with all of our books. In Appendix B, we discuss database
systems management. Much of the work discussed in this book has evolved from the technologies
discussed in Appendix B.
We have essentially developed a five-layer framework to explain the concepts in this book. This
framework is illustrated in Figure 1.7. Layer 1 is the supporting technologies layer and covers the
chapters in Part I of this book. Layer 2 is the stream data analytics layer and covers the chapters
in Part II. Layer 3 is the stream data analytics for insider threat applications layer and covers the
Introduction 7
Confidentiality,
privacy, and trust
for big data
systems
Big data
Cloud computing Data mining and
management and
and semantic web insider threat
analytics Layer 1
technologies detection
technologies
Supporting
technologies
Data mining for
Data security and Data mining
security
privacy techniques
applications
A unified framework for Big data, security Big data analytics for Layer 5
secure big data and the internet of malware detection in
Big data analytics
management and things smartphones
and security layer
analytics Chapter 30 Chapter 31 Chapter 32
Confidentiality,
privacy and trust
for big data systems
Chapter 29
chapters in Part III. Layer 4 is the experimental systems layer and covers the chapters in Part IV.
Layer 5 is the big data analytics and security layer consisting of the chapters in Part V. The relation-
ship between the various parts of this book is given in Figure 1.8.
Before we discuss the big data management and analytics (BDMA) and big data security and pri-
vacy (BDSP) systems that we have designed and developed in Parts II through IV, we need to
provide some of the technologies that are needed for an understanding of our systems. Part I will
provide an overview of such technologies.
Part I, consisting of six chapters, will describe supporting technologies for BDMA and BDSP.
Chapter 2 will describe security technologies. In particular, we will discuss various aspects of data
security and privacy. In Chapter 3, we provide some background information about general data
mining techniques so that the reader can have an understanding of the field. In Chapter 4, we will
discuss ways of applying data mining for cyber security. In particular, we will discuss the threats
to computers and networks and describe the applications of data mining to detect such threats and
attacks. In Chapter 5, we will provide an overview of cloud computing and semantic web technolo-
gies. This is because several of our experimental systems discussed in Part IV utilized cloud and
semantic web technologies. In Chapter 6, we will discuss how data mining technologies could be
applied for insider threat detection in the cloud. First, we will discuss how semantic web technolo-
gies may be used to represent the communication between insiders and then discuss our approach to
insider threat detection. Finally, in Chapter 7, we will discuss some of the BDMA technologies we
have used in the experimental systems we have designed and developed.
It should be noted that the big data systems supporting technologies that we have discussed have
evolved from database systems technology. Therefore, a basic knowledge of database systems tech-
nology is essential for an understanding of big data systems. An overview of database systems is
discussed in Appendix B of this book.
2 Data Security and Privacy
2.1 OVERVIEW
As we have stated in Chapter 1, secure big data technologies integrate big data technologies with
security technologies. In this chapter, we will discuss security technologies. In particular, we will
discuss various aspects of data security and privacy. Big data technologies will be discussed in
Chapter 7 after we provide an overview of some related technologies such as data mining and cloud
computing.
Since much of the discussion in this book is on big data analytics and security, we will provide a
fairly comprehensive overview of access control in data management systems. In particular, we will
discuss security policies as well as enforcing the policies in database systems. Our focus will be on
discretionary security policies. We will also discuss data privacy aspects. More details on secure
data management can be found in [FERR00] and [THUR05a].
The most popular discretionary security policy is the access control policy. Access control
policies were studied for operating systems back in the 1960s and then for database systems in
the 1970s. The two prominent database systems, System R and INGRES, were the first to investi-
gate access control for database systems (see [GRIF76] and [STON74]). Since then several varia-
tions of access control policies have been reported including role-based access control (RBAC) and
attribute-based access control [NIST]. Other discretionary policies include administration policies.
We also discuss identification and authentication under discretionary policies. Note that much of
the discussion in this chapter will focus on discretionary security in relational database systems.
Many of the principles are applicable to other systems such as object database systems, distributed
database systems, and cloud data management systems (see e.g., [THUR94]).
Before one designs a secure system, the first question that must be answered is what is the
security policy to be enforced by the system? Security policy is essentially a set of rules that enforce
security. Security policies include mandatory security policies and discretionary security policies.
Mandatory security policies are the policies that are “mandatory” in nature and enforced by the
systems. Discretionary security policies are policies that are specified by the administrator or the
owner of the data.
By policy enforcement, we mean the mechanisms to enforce the policies. For example, back
in the 1970s, the relational database system products such as System R and INGRES developed
techniques such as the query modification mechanisms for policy enforcement (see e.g., [GRIF76]
and [STON74]). The query language Structured Query Language (SQL) has been extended to spec-
ify security policies and access control rules. More recently languages such as eXtensible Markup
Language (XML) and resource description framework (RDF) have been extended to specify
security policies (see e.g., [BERT02] and [CARM04]).
The organization of this chapter is as follows. In Section 2.2, we introduce discretionary security
including access control and authorization models for database systems. We also discuss RBAC
systems. In Section 2.3, we discuss ways of enforcing discretionary security including a discus-
sion of query modification. We also provide an overview of the various commercial products. This
chapter is summarized in Section 2.4. Figure 2.1 illustrates the concepts discussed in this chapter.
We assume that the reader has some knowledge of data management. For more details on this topic
we refer the reader to some texts such as [DATE90] and [THUR97]. We also provide an overview
of database systems in Appendix B.
15
16 Big Data Analytics with Applications in Insider Threat Detection
Data privacy
Data security Privacy-preserving
Access control data mining
Policy enforcement Privacy policy
processing
Discretionary
security
Access
control
security
policies
the context in which the data is displayed. Such rules can be enforced for discretionary security
also. For example, in the case of content-based constraints, John has read access to tuples only in
DEPT D100. In the case of context or association-based constraints, John does not have read access
to names and salaries taken together, however, he can have access to individual names and salaries.
In the case of event-based constraints, after the election, John has access to all elements in relation
EMP.
Authorization rules:
Division manager
has all access
that a
department
manager
has
Department
manager
has all access
that a
group manager
has
Group
manager
role hierarchy? What happens to the child nodes? That is, does access propagate downward? For
example, if a department manager has access to certain information, then do his subordinates
have access to that information? Are there cases where the subordinates have access to data that
the department manager does not have? What happens if an employee has to report to two super-
visors, one his department manager and the other his project manager? What happens when the
department manager is working on a project and has to report to his project leader who also
works for him?
RBAC has been examined for relational systems, object systems, distributed systems, and
now some of the emerging technologies such as data warehouses, knowledge management sys-
tems, semantic web, e-commerce systems, and digital libraries. Furthermore, object models have
been used to represent roles and activities (see e.g., Proceedings of the IFIP Database Security
Conference series and more recently the Proceedings of the ACM Conference series on Data and
Applications Security and Privacy).
Administration
policies
the user. These techniques are showing a lot of promise and are already being used. We can expect
widespread use of biometric techniques as face recognition technologies advance.
systems. Figure 2.8 illustrates the various aspects involved in enforcing security policies. These
include specification, implementation, and visualization (where visualization tools are being used
for the visualization of the policies).
If we are to grant John read access to the employees who earn <30K, then this assertion is specified
as follows.
GRANT JOHN READ
EMP
Where EMP.SALARY <30K
Note that the assertions we have specified have been incorporated into any standards. These are
some of our ideas. We need to explore ways of incorporating these assertions into the standards.
Data Security and Privacy 23
Policy Specification:
SQL extensions have also been proposed for RBAC. In fact, products such as Oracle’s Trusted
database product enforce RBAC. The access control rules are specified in an SQL-like language.
Note that there are many other specification languages that have been developed. These include
XML, RDF, and related languages for the web and the semantic web. Semantic web is essentially an
intelligent web. SQL-like languages have been specified for XML and RDF. For example, XML-QL
was developed for XML which then evolved into a language called XQuery. SPARQL is now the
query language for RDF (see [THUR07]). We will use such languages in our systems to be discussed
in Part IV. Figure 2.9 illustrates specification aspects for security policies.
Where we assume that the attributes of EMP are say name, salary, age, and department.
Essentially what happens is that the “where” clause of the query has all the constraints associated
with the relation. We can also have constraints that span across multiple relations. For example, we
could have two relations EMP and DEPT joined by Dept #. Then the query is modified as follows:
Select * from EMP
Where EMP.Salary < 30K
And EMP.D# = DEPT.D#
And DEPT.Name is not Security
We have used some simple examples for query modification. The detailed algorithms can be
found in [DWYE87] and [STON74]. The high level algorithm is illustrated in Figure 2.10.
That is, once the query is modified, then the query tree has to be built. The idea is to push selections
and projections down in the query tree and carry out the join operation later.
Other functions are also impacted by security constraints. Let us consider transaction man-
agement. Bertino and Musto have developed algorithms for integrity constraint processing for
transactions management (see [BERT89]). We have examined their techniques for mandatory
security constraint processing during transaction management. The techniques may be adapted for
discretionary security constraints. The idea is to ensure that the constraints are not violated during
transaction execution.
Constraints may be enforced on the metadata. For example, one could grant and revoke access to
users to the metadata relations. Discretionary security constraints for metadata could be handled in
the same way they are handled for data. Other database functions include storage management. The
issues in storage management include developing appropriate access methods and index strategies.
One needs to examine the impact of the security constraints on the storage management functions.
That is, can one partition the relations based on the constraints and store them in such a way so
that the relations can be accessed efficiently? We need to develop secure indexing technologies for
database systems. Some work on secure indexing for geospatial information systems is reported
in [ATLU04]. Databases are audited to determine whether any security violation has occurred.
Furthermore, views have been used to grant access to individuals for security purposes. We need
efficient techniques for auditing as well as for view management.
In this section, we have examined the impact of security on some of the database functions includ-
ing query management, transaction processing, metadata management, and storage management.
We need to also investigate the impact of security on other functions such as integrity constraint
processing and fault-tolerant computing. Figure 2.11 illustrates the impact of security on the database
functions. It should be noted that some of the discussions in this section have been extended for big
data management. This will be our focus especially in Parts IV and V of this book.
mining for counter-terrorism applications, there has been an increasing interest in this topic over the
past 15 years. Much research has been reported on balancing the need between privacy and security.
The first effort on privacy-preserving data mining was reported in [AGRA00]. Several other efforts
on this topic followed since the early 2000s [KANT04]. In addition, treating the privacy problem as
a variation of the inference problem was studied in [THUR05b].
With the developments in big data technologies, there is significant interest in data privacy.
For example, a National Science Foundation workshop on Big Data Security and Privacy was
held in September 2014 and the results have been reported in [NSF14]. We will discuss some of
the findings at this workshop in Part V. With advancements in technology such as data analytics
and the interest in data privacy among the policy makers, lawyers, social scientists, and computer
scientists, we can expect significant developments in protecting the privacy of individuals as well
as ensuring their security. More details on big data security and privacy will be provided in Part
V of this book.
Policy management in the cloud and big data is an active area of research. Our work includes
access control as well as policy-based information sharing in the cloud. The experimental systems
we have developed on security policy enforcement in the cloud are discussed in Part IV.
REFERENCES
[AGRA00]. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” SIGMOD Conference, pp. 439–450,
2000.
[ATLU04]. V. Atluri and S. Chun, “An Authorization Model for Geospatial Data,” IEEE Transaction on
Dependable and Secure Computing, 1 (4), 238–254, 2004.
[BERT89]. E. Bertino and D. Musto, “Integrity Constraint Processing During Transaction Processing,”
Acta Informatica, 26 (1–2), 25–57, 1988.
[BERT02]. E. Bertino et al., “Access Control for XML Documents,” Data and Knowledge Engineering, 43 (3),
2002.
[BERT06]. E. Bertino, “Digital Identity Management and Protection,” Proceedings of the 2006 International
Conference on Privacy, Security and Trust, Ontario, Canada, 2006.
[CARM04]. B. Carminati et al., “Security for RDF,” Proceedings of the DEXA Conference Workshop on Web
Semantics, Zaragoza, Spain, August, 2004.
[DATE90]. C. Date, An Introduction to Database Systems, Addison-Wesley, Reading, MA, 1990.
[DWYE87]. P. Dwyer et al., “Multilevel Security for Relational Database Systems,” Computers and Security,
6 (3), 252–260, 1987.
[FERR00]. E. Ferrari and B. Thuraisingham, “Secure Database Systems,” In Advances in Database
Management, M. Piatini and O. Diaz, editors, Artech House, UK, 2000.
[GRIF76]. P. Griffiths and B. Wade, “An Authorization Mechanism for a Relational Database System,”
ACM Transactions on Database Systems, 1 (3), 242–255, 1976.
[KANT04]. M. Kantarcioglu and C. Clifton, “Privacy-Preserving Distributed Mining of Association Rules
on Horizontally Partitioned Data,” IEEE Transactions on Knowledge and Data Engineering, 16 (9),
1026–1037, 2004.
[NIST]. Guide to Attribute-Based Access Control (ABAC) Definition and Considerations, NIST Special
Publication 800-162, 2014.
[NSF14]. National Science Foundation Workshop, http://csi.utdallas.edu/events/NSF/NSF-workhop-Big-Data-
SP-Feb9-2015_FINAL.pdf
[PARK04]. J. Park and R. Sandhu, “The UCON Usage Control Model,” ACM Transactions on Information and
Systems Security, 7 (#1), 128–174, 2004.
[SAND96]. R. Sandhu et al., “Role-Based Access Control Models,” IEEE Computer, 29 (2), 38–47, 1996.
[SQL3]. en.wikipedia.org/wiki/SQL, American National Standards Institute, Draft, Maynard, MN, 1992.
[STON74]. M. Stonebraker and E. Wong, “Access Control in a Relational Database Management System by
Query Modification,” Proceedings of the ACM Annual Conference, ACM Press, NY, 1974.
[THUR87]. B. Thuraisingham “Security Checking in Relational Database Management Systems Augmented
with Inference Engines,” Computers and Security, 6 (6), 479–492, 1987.
[THUR89]. B. Thuraisingham and P. Stachour, “SQL Extensions for Security Assertions,” Computer Standards
and Interface Journal, 11 (1), 5–14, 1989.
[THUR93]. B. Thuraisingham, W. Ford, and M. Collins, “Design and Implementation of a Database Inference
Controller,” Data and Knowledge Engineering, 11 (3), 5–14, 1993.
[THUR94]. B. Thuraisingham, “Security Issues for Federated Database Systems,” Computers and Security, 13
(6), 509–525, 1994.
[THUR97]. B. Thuraisingham, Data Management Systems: Evolution and Interoperation, CRC Press, Boca
Raton, FL, 1997.
[THUR05a]. B. Thuraisingham, Database Security, Integrating Database Systems and Information Security,
CRC Press, Boca Raton, FL, 2005.
[THUR05b]. B. M. Thuraisingham, “Privacy Constraint Processing in a Privacy-Enhanced Database
Management System,” Data and Knowledge Engineering, 55 (2), 159–188, 2005.
[THUR07]. B. Thuraisingham, Building Trustworthy Semantic Webs, CRC Press, Boca Raton, FL, 2007.
3 Data Mining Techniques
3.1 INTRODUCTION
We have used data mining and analytics techniques in several of our efforts for various applications
such as social media systems and intrusion detection systems. For example, in our previous book
[THUR16], we discussed algorithms for location-based data mining that will extract the locations
of the various social media (e.g., Twitter) users. These algorithms can be extended to extract other
demographics data. Our prior research has also developed data mining tools for sentiment analysis
as well as for cyber security applications. In Parts II and III, we will discuss scalability aspects of
stream data mining and will apply the techniques for cyber security applications (e.g., insider threat
detection). In this chapter, we provide some background information about general data mining
techniques so that the reader can have an understanding of the field. Cyber security applications of
data mining will be discussed in Chapter 4.
Data mining outcomes (also called tasks) include classification, clustering, forming associations,
as well as detecting anomalies. Our tools have mainly focused on classification as the outcome, and
we have developed classification tools. The classification problem is also referred to as supervised
learning in which a set of labeled examples is learned by a model, and then a new example with
unknown labels is presented to the model for prediction.
There are many prediction models that have been used such as the Markov model, decision
trees, artificial neural networks (ANNs), support vector machines (SVMs), association rule mining
(ARM), among others. Each of these models has its strengths and weaknesses. However, there is a
common weakness among all of these techniques, which is the inability to suit all applications. The
reason that there is no such ideal or perfect classifier is that each of these techniques was initially
designed to solve specific problems under certain assumptions.
In this chapter, we discuss the data mining techniques that have been commonly used. Specifically,
we present the Markov model, SVM, ANN, ARM, the problem of multiclassification, as well as
image classification, which are the aspects of image mining. In our research and development, we
have designed hybrid models to improve the prediction accuracy of data mining algorithms in vari-
ous applications, namely, intrusion detection, social media analytics, WWW prediction, and image
classification [AWAD09].
The organization of this chapter is as follows. In Section 3.2, we provide an overview of various
data mining tasks and techniques. The techniques that we have used in our work are discussed in
Sections 3.3 through 3.8. In particular, neural networks, SVM, Markov models, and ARM, as well
as some other classification techniques will be described. This chapter is summarized in Section
3.9. It should be noted that a breakthrough data mining technique we have designed, called novel
class detection, to be discussed in Part III has evolved from the experiences we have gained from
using techniques such as SVM, ANN, and ARM. It should be noted that we have used the term data
mining and data analytics interchangeably throughout this book.
27
28 Big Data Analytics with Applications in Insider Threat Detection
Data mining
outcomes and
techniques
Clustering: Divide
Anomaly detection: population; people
Association: John Classification:
Link analysis: John registers at from country X of a
and James often Build profiles of
Follow chain from A flight school but he certain religion;
seen together after terrorists and
to B to C to D does not care about people from country
an attack classify terrorists
take off or landing Y interested in
airplanes
Data mining
techniques
is an anomaly or not. Descriptive tasks in general include making associations and forming clusters.
Therefore, classification, anomaly detection, making associations and forming clusters are also
thought to be data mining tasks.
Next, the data mining techniques can either be predictive or descriptive or both. For example,
neural networks can perform classification as well as clustering techniques. Classification t echniques
include decisions trees, SVM, as well as memory-based reasoning. ARM techniques are used in
general to make associations. Link analysis that analyzes links can also make associations between
links and predict new links. Clustering techniques include K-means clustering. An overview of the
data mining tasks (i.e., the outcomes of data mining) is illustrated in Figure 3.1. The techniques
(e.g., neural networks, SVMs) are illustrated in Figure 3.2.
x0 = 1
x1
w1 w0
x2
w2
.
∑
. n n
w wi xi
. n ∑ 1 if ∑ wi xi > 0
i=0 i=0
O=
xn –1 otherwise
1 if w0 + w1 x1 + + wn xn > 0
o( x1, … , xn ) = (3.1)
−1 otherwise
Notice that wi corresponds to the contribution of the input vector component xi of the percep-
tron output. Also, in order for the perceptron to output a 1, the weighted combination of the inputs
(∑in=1 wi xi ) must be greater than the threshold w0.
Learning the perceptron involves choosing values for the weights w0 + w1x1 + … + wnxn.
Initially, random weight values are given to the perceptron. Then the perceptron is applied to each
training example updating the weights of the perceptron whenever an example is misclassified. This
process is repeated many times until all training examples are correctly classified. The weights are
updated according to the following rule:
wi = wi + δ wi
(3.2)
δ wi = η(t − o) xi
where η is a learning constant, o is the output computed by the perceptron, and t is the target output
for the current training example.
The computation power of a single perceptron is limited to linear decisions. However, the
perceptron can be used as a building block to compose powerful multilayer networks. In this case,
a more complicated updating rule is needed to train the network weights. In this work, we employ
an ANN of two layers and each layer is composed of three building blocks (see Figure 3.4). We use
the back-propagation algorithm for learning the weights. The back-propagation algorithm attempts
to minimize the squared error function.
A typical training example in WWW prediction is 〈[kt−τ+1, …, kt−1, kt]T, d〉, where [kt−τ+1, …, kt−1, kt]T
is the input to the ANN and d is the target web page. Notice that the input units of the ANN in Figure
3.5 are τ previous pages that the user has recently visited, where k is a web page ID. The output of the
network is a Boolean value, not a probability. We will see later how to approximate the probability
of the output by fitting a sigmoid function after ANN output. The approximated probabilistic output
pt+1
...
X1 w10
w11
X2 w21
w
wn1
Xn
Sigmoid unit
becomes o′ = f(o(I) = pt+1, where I is an input session and pt+1 = p(d|kt−τ+1, …, kt). We choose the
sigmoid function (2.3) as a transfer function so as the ANN can handle nonlinearly separable datasets
[MITC97]. Notice that in our ANN design (Figure 3.5), we use a sigmoid transfer function (2.3), in
each building block. In Equation 2.3, I is the input to the network, O is the output of the network, W
is the matrix of weights, and σ is the sigmoid function:
o = σ (w . I )
1 (3.3)
σ ( y) =
1 + e− y
1
E (W ) =
2 ∑ ∑ (t
k ∈ D i∈ouputs
ik − oik )2 (3.4)
w ji = w ji + δ w ji
∂Ed (3.5)
δ w ji = −η
∂w ji
∂Ed
δ w ji (n) = −η + αδ w ji (n −1) (3.6)
∂w ji
We implement the back-propagation algorithm for training the weights. The back-propagation
algorithm employs gradient descent to attempt to minimize the squared error between the network
output values and the target values of these outputs. The sum of the error over all of the network
output units is defined in Equation 3.3. In Equation 3.4, the outputs is the set of output units in the
network, D is the training set, and tik and oik are the target and the output values associated with
the ith output unit and training example k, respectively. For a specific weight wji in the network, it
is updated for each training example as in Equation 2.5, where η is the learning rate and wji is the
weight associated with the ith input to the network unit j (for details see [MITC97]). As we can see
from Equation 3.5, the search direction δw is computed using the gradient descent that guarantees
convergence toward a local minimum. In order to mitigate that, we add a momentum to the weight
update rule such that the weight update direction δwji(n) depends partially on the update direction in
the previous iteration δwji(n − 1). The new weight update direction is shown in Equation 3.6, where
n is the current iterations and α is the momentum constant. Notice that in Equation 3.6, the step size
Data Mining Techniques 31
is slightly larger than that in Equation 3.5. This contributes to a smooth convergence of the search
in regions where the gradient is unchanging [MITC97].
In our implementation, we set the step size η dynamically based on the distribution of the classes
in the dataset. Specifically, we set the step size to large values when updating the training examples
that belong to low-distribution classes and vice versa. This is because when the distribution of
the classes in the dataset varies widely (e.g., a dataset might have 5% positive examples and 95%
negative examples), the network weights converge toward the examples from the class of the larger
distribution, which causes a slow convergence. Furthermore, we adjust the learning rates slightly
by applying the momentum constant (3.6) to speed up the convergence of the network [MITC97].
1 T
minimize( w,b ) w w (3.7)
2
subject to yi (w ⋅ xi − b) ≥ 1
w·x+δ>0
w·x+δ<0
1
1w
w
w · x + b = +1
w·x+b=0
w · x + b = –1
(a) (b)
Class 2 Class 2
Class 1 Class 1
FIGURE 3.7 The SVM separator that causes the maximum margin.
f ( x ) = sign(w ⋅ x − b) (3.8)
N N
1 T
maximize L (w, b, α) =
2
w w− ∑i=1
αi yi (w ⋅ xi − b) + ∑α i=1
i (3.9)
N
f ( x ) = sign(wx − b) = sign
∑α y ( x ⋅ x − b)
i =1
i i i (3.10)
Notice that Equation 3.8 computes the sign of the functional margin of point x in addition to the
prediction label of x, that is, the functional margin of x equals wx − b.
The SVM optimization problem is a convex quadratic programming problem (in w, b) in a convex
set (Equation 3.7). We can solve the Wolfe dual instead (as in Equation 3.9) with respect to α, subject
to the constraints that the gradient of L(w, b, α) with respect to the primal variables w and b vanish and
αi ≥ 0. The primal variables are eliminated from L(w, b, α) (see [CRIS00] for more details). When we
solve αi, we can obtain w = ∑ iN=1 αi yi xi , and we can classify a new object x using Equation 3.10. Note
that the training vectors occur only in the form of the dot product and that there is a Lagrangian mul-
tiplier αi for each training point, which reflects the importance of the data point. When the maximal
margin hyper-plane is found, only points that lie closest to the hyper-plane will have αi > 0, and these
points are called support vectors. All other points will have αi = 0 (see Figure 3.8a). This means that
only those points that lie closest to the hyper-plane give the representation of the hypothesis/classifier.
These most important data points serve as support vectors. Their values can also be used to give an
independent boundary with regard to the reliability of the hypothesis/classifier [BART99].
Figure 3.8a shows two classes and their boundaries, that is, margins. The support vectors are
represented by solid objects, while the empty objects are nonsupport vectors. Notice that the mar-
gins are only affected by the support vectors, that is, if we remove or add empty objects, the margins
will not change. Meanwhile any change in the solid objects, either adding or removing objects,
could change the margins. Figure 3.8b shows the effects of adding objects in the margin area. As
we can see, adding or removing objects far from the margins, for example, data point 1 or −2, does
not change the margins. However, adding and/or removing objects near the margins, for example,
data point 2 and/or −1, has created new margins.
2 1
–1
–2
α>0
Support vectors
Old margins
FIGURE 3.8 (a) The values of for support vectors and nonsupport vectors. (b) The effect of adding new data
points on the margins.
typically considered as Markov chains that are then fed as input. The basic concept of the Markov model
is that it predicts the next action depending on the result of previous action or actions. Actions can mean
different things for different applications. For the purpose of illustration, we will consider actions specific
for the WWW prediction application. In WWW prediction, the next action corresponds to the prediction
of the next page to be traversed. The previous actions correspond to the previous web pages to be con-
sidered. Based on the number of previous actions considered, Markov models can have different orders:
The 0th-order Markov model is the unconditional probability of the state (or web page) (Equation
3.11). In Equation 3.11, Pk is a web page and Sk is the corresponding state. The first-order Markov
model, Equation 3.12, can be computed by taking page-to-page transitional probabilities or the
n-gram probabilities of {P1, P2}, {P2, P3}, …, {Pk−1, Pk}.
In the following, we present an illustrative example of different orders of the Markov model and
how it can predict.
EXAMPLE:
Imagine a website of six web pages: P1, P2, P3, P4, P5, and P6. Suppose we have user sessions as
given in Table 3.1. This table depicts the navigation of many users of that website. Figure 3.9 shows
the first-order Markov model, where the next action is predicted based on only the last action
TABLE 3.1
Collection of User Sessions and Their Frequencies
Session Frequency
P1, P2, P4 5
P1, P2, P6 1
P5, P2, P6 6
P5, P2, P3 3
34 Big Data Analytics with Applications in Insider Threat Detection
P5 P3
(9, 1.0) (3, 1.0)
(9, 0.6)
(3, 0.2)
(7, 1.0) F
(7, 0.47) P6
S P2
performed, that is, the last page traversed by the user. States S and F correspond to the initial and
final states, respectively. The probability of each transition is estimated by the ratio of the number
of times the sequence of states was traversed and the number of times the anchor state was visited.
Next, to each arch in Figure 3.8, the first number is the frequency of that transition, and the second
number is the transition probability. For example, the transition probability of the transition (P2 to
P3) is 0.2 because the number of times users traverse from page 2 to page 3 is 3 and the number
of times page 2 is visited is 15 (i.e., 0.2 = 3/15).
Notice that the transition probability is used to resolve prediction. For example, given that a user
has already visited P2, the most probable page he or she visits next is P6. That is because the transi-
tion probability from P2 to P6 is the highest.
Notice that the transition probability might not be available for some pages. For example, the
transition probability from P2 to P5 is not available because no user has visited P5 after P2. Hence,
these transition probabilities are set to zeros. Similarly, the Kth-order Markov model is where the
prediction is computed after considering the last Kth action performed by the users (Equation 3.13).
In WWW prediction, the Kth-order Markov model is the probability of user visits to PKth page given
its previous k − 1 page visits.
Figure 3.10 shows the second-order Markov model that corresponds to Table 3.1. In the second-order
model, we consider the last two pages. The transition probability is computed in a similar fashion. For
example, the transition probability of the transition (P1, P2) to (P2, P6) is 0.16 = 1 × 1/6 because the
number of times users traverse from state (P1, P2) to state (P2, P6) is 1 and the number of times pages
(P1, P2) is visited is 6 (i.e., 0.16 = 1/6). The transition probability is used for prediction. For example,
(3, 0.33)
P5, P2 P2, P3
S (7, 1.0)
P2, P6 F
(1, 0.16)
(6, 0.4)
(5, 1.0)
P1, P2 P2, P4
(5, 0.84)
given that a user has visited P1 and P2, he or she most probably visits P4 because the transition prob-
ability from state (P1, P2) to state (P2, P4) is greater than that from state (P1, P2) to state (P2, P6).
The order of the Markov model is related to the sliding window. The Kth-order Markov model
corresponds to a sliding window of size K − 1.
Notice that there is another concept that is similar to the sliding window concept, which is the
number of hops. In this appendix, we use the number of hops and the sliding window interchangeably.
In WWW prediction, Markov models are built based on the concept of n-gram. The n-gram can
be represented as a tuple of the form 〈x1, x2, …, xn〉 to depict sequences of page clicks by a population
of users surfing a website. Each component of the n-gram takes a specific page ID value that reflects
the surfing path of a specific user surfing a webpage. For example, the n-gram 〈P10, P21, P4, P12〉 for
some user U states that the user U has visited the pages 10, 21, 4, and finally 12 in a sequence.
example, if the current session is 〈A, B, C〉, and the user refer to page D, then the new active session
becomes 〈B, C, D〉, using a sliding window 3. Notice that page A is dropped, and 〈B, C, D〉 will be
used for prediction. The rationale behind this is because most users go back and forth while surf-
ing the web trying to find the desired information, and it may be most appropriate to use the recent
portions of the user history to generate recommendations/predictions [MOBA01].
Mobasher et al. [MOBA01] proposed a recommendation engine that matches an active user ses-
sion with the frequent itemsets in the database and predicts the next page the user most probably vis-
its. The engine works as follows. Given an active session of size w, the engine finds all the frequent
itemsets of length w + 1, satisfying some minimum support minsup and containing the current active
session. Prediction for the active session A is based on the confidence (ψ) of the corresponding asso-
ciation rule. The confidence (ψ) of an association rule X→z is defined as ψ(X→z) = σ(X∪z)/σ(X),
where the length of z is 1, page p is recommended/predicted for an active session A, iff
The engine uses a cyclic graph called the frequent itemset graph. The graph is an extension of the
lexicographic tree used in the tree projection algorithm of Agrawal et al. [AGRA01]. The graph is
organized in levels. The nodes in level l has itemsets of size of l. For example, the sizes of the nodes
(i.e., the size of the itemsets corresponding to these nodes) in levels 1 and 2 are 1 and 2, respectively.
The root of the graph, level 0, is an empty node corresponding to an empty itemset. A node X in
level l is linked to a node Y in level l + 1 if X⊂Y. To further explain the process, suppose we have
the following sample web transactions involving pages 1, 2, 3, 4, and 5 as given in Table 3.2. The
Apriori algorithm produces the itemsets as given in Table 3.3, using a minsup = 0.49. The frequent
itemset graph is shown in Figure 3.11.
Suppose we are using a sliding window of size 2, and the current active session A = 〈2,3〉.
To predict/recommend the next page, in the frequent itemset graph, we first start at level 2 and
extract all the itemsets at level 3 linked to A. From Figure 3.11, the node {2, 3} is linked to {1, 2, 3}
and {2, 3, 5} nodes with confidence:
TABLE 3.2
Sample Web Transaction
Transaction ID Items
T1 1, 2, 4, 5
T2 1, 2, 5, 3, 4
T3 1, 2, 5, 3
T4 2, 5, 2, 1, 3
T5 4, 1, 2, 5, 3
T6 1, 2, 3, 4
T7 4, 5
T8 4, 5, 3, 1
Data Mining Techniques 37
TABLE 3.3
Frequent Itemsets Generated by the Apriori Algorithm
Size 1 Size 2 Size 3 Size 4
{2}(6) {2, 3}(5) {2, 3, 1}(5) {2, 3, 1, 5}(4)
{3}(6) {2, 4}(4) {2, 3, 5}(4)
{4}(6) {2, 1}(6) {2, 4, 1}(4)
{1}(7) {2, 5}(5) {2, 1, 5}(5)
{5}(7) {3, 4}(4) {3, 4, 1}(4)
{3, 1}(6) {3, 1, 5}(5)
{3, 5}(5) {4, 1, 5}(4)
{4, 1}(5)
{4, 5}(5)
{1, 5}(6)
The recommended page is 1 because its confidence is larger. Notice that, in recommendation
engines, the order of the clickstream is not considered, that is, there is no distinction between the
sessions 〈1, 2, 4〉 and 〈1, 4, 2〉. This is a disadvantage of such systems because the order of pages
visited might bear important information about the navigation patterns of users.
0 Depth 0
1 3 4 5 Depth 1
2
Depth 2
1, 2 1, 3 1, 4 1, 5 2, 3 2, 4 2, 5 3, 4 3, 5 4, 5
1, 2, 3 2, 3, 5 1, 2, 4 1, 2, 5 1, 3, 4 1, 3, 5 1, 4, 5 Depth 3
Depth 4
2, 3, 1, 5
On the other hand, the size of the training set for each classifier is small because we exclude all
instances that do not belong to that pair of classes.
One-VS-all: The one-VS-all approach creates a classifier for each class in the dataset. The train-
ing set is preprocessed such that for a classifier j instances that belong to class j are marked as class
(+1) and instances that do not belong to class j are marked as class (−1). In the one-VS-all scheme,
we compute n classifiers, where n is the number of pages that users have visited (at the end of each
session). A new instance x is predicted by assigning it to the class that its classifier outputs the larg-
est positive value (i.e., maximal marginal) as in Equation 3.15. We can compute the margin of point
x as in Equation 3.13. Notice that the recommended/predicted page is the sign of the margin value
of that page (see Equation 3.10):
f ( x ) = wx − b = ∑α y ( x.x − b)
i
i i i (3.14)
In Equation 3.15, M is the number of classes, x = 〈x1, x2, …, xn〉 is the user session, and fi is the
classifier that separates class i from the rest of the classes. The prediction decision in Equation
3.15 resolves to the classifier fc that is the most distant from the testing example x. This might be
explained as fc has the most separating power, among all other classifiers, of separating x from the
rest of the classes.
The advantage of this scheme (one-VS-all) compared to the one-VS-one scheme is that it has fewer
classifiers. On the other hand, the size of the training set is larger for a one-VS-all scheme than that
for a one-VS-one scheme because we use the whole original training set to compute each classifier.
of a “red rose.” If we undertake the annotation of images with keywords, a typical way to publish an
image data repository is to create a keyword-based query interface addressed to an image database.
If all images came with a detailed and accurate description, image retrieval would be convenient
based on current powerful pure text search techniques. These search techniques would retrieve the
images if their descriptions/annotations contained some combination of the keywords specified
by the user. However, the major problem is that most of images are not annotated. It is a labori-
ous, error-prone, and subjective process to manually annotate a large collection of images. Many
images contain the desired semantic information, even though they do not contain the user-specified
keywords. Furthermore, keyword-based search is useful especially to a user who knows what key-
words are used to index the images and who can therefore easily formulate queries. This approach
is problematic; however, when the user does not have a clear goal in mind, does not know what is
in the database, and does not know what kinds of semantic concepts are involved in the domain.
Image mining is a more challenging research problem than retrieving relevant images in CBIR
systems. The goal of image mining is to find an image pattern that is significant for a given set of
images and helpful to understand the relationships between high-level semantic concepts/descrip-
tions and low-level visual features. Our focus is on aspects such as feature selection and image
classification. It should be noted that image mining and analytics are important for social media as
the members of postnumerous images. These images could be used to embed messages that could
penetrate into computing systems. Images in social media could also be analyzed to extract various
demographics such as location.
3.8.2 Feature Selection
Usually, data saved in databases with well-defined semantics is structured data such as numbers
or structured data entries. In comparison, data with ill-defined semantics is unstructured data.
Images, audio, and video are data with ill-defined semantics. In the domain of image process-
ing, images are represented by derived data or features such as color, texture, and shape. Many of
these features have multi values (e.g., color histogram, moment description) When people generate
these derived data or features, they generally generate as many features as possible, since they are
not aware which feature is more relevant. Therefore, the dimensionality of derived image data is
usually very high. Actually, some of the selected features might be duplicated or may not even be
relevant to the problem. Including irrelevant or duplicated information is referred to as noise. Such
problems are referred to as the “curse of dimensionality.” Feature selection is the research topic
for finding an optimal subset of features. In this chapter, we will discuss this curse and feature
selection in detail.
We developed a wrapper-based simultaneous feature weighing and clustering algorithm.
Clustering algorithm will bundle similar image segments together and generate a finite set of visual
symbols (i.e., blob-token). Based on histogram analysis and chi-square value, we assign features
of image segments different weights instead of removing some of them. Feature weight evaluation
is wrapped in a clustering algorithm. In each iteration of the algorithm, feature weights of image
segments are re-evaluated based on the clustering result. The re-evaluated feature weights will
affect the clustering results in the next iteration.
Unfortunately, the automatic image annotation problem has not been solved in general, and perhaps
this problem is impossible to solve.
However, in certain subdomains, it is still possible to obtain some interesting results. Many
statistical models have been published for image annotation. Some of these models took feature
dimensionality into account and applied singular value decomposition (SVD) or principle compo-
nent analysis (PCA) to reduce dimension. But none of them considered feature selection or feature
weight. We proposed a new framework for image annotation based on a translation model (TM).
In our approach, we applied our weighted feature selection algorithm and embedded it in an image
annotation framework. Our weighted feature selection algorithm improves the quality of visual
tokens and generates better image annotations.
3.8.4 Image Classification
Image classification is an important area, especially in the medical domain because it helps manage
large medical image databases and has great potential on diagnostic aid in a real-world clinical set-
ting. We describe our experiments for the image CLEF medical image retrieval task. Sizes of classes
of CLEF medical image dataset are not balanced, which is really a serious problem for all classi-
fication algorithms. To solve this problem, we resample data by generating subwindows. K nearest
neighbor (KNN) algorithm, distance weighted KNN, fuzzy KNN, nearest prototype c lassifier, and
evidence theory-based KNN are implemented and studied. Results show that evidence-based KNN
has the best performance based on classification accuracy.
3.9 SUMMARY
In this chapter, we first provided an overview of the various data mining tasks and techniques,
and then discussed some of the techniques that we have used in our work. These include neural
networks, SVM, and ARM. We have utilized a combination of these techniques together with some
other techniques in the literature as well as our own techniques to develop data analytics techniques
for very large databases. Some of these techniques are discussed in Parts II through V.
Numerous data mining techniques have been designed and developed, and many of them are
being utilized in commercial tools. Several of these techniques are variations of some of the basic
classification, clustering, and ARM techniques. One of the major challenges today is to determine
the appropriate technique for various applications. We still need more benchmarks and perfor-
mance studies. In addition, the techniques should result in fewer false positives and negatives. While
there is still much to be done, the progress over the last decade has been extremely promising. Our
challenge is to develop data mining techniques for big data systems.
REFERENCES
[AGRA93]. R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules between Sets of Items in Large
Database,” Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C.,
May, pp. 207–216, 1993.
[AGRA94]. R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Database,”
Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA,
pp. 487–499, 1994.
[AGRA01]. R. Agrawal, C. Aggarawal, V. Prasad, “A Tree Projection Algorithm for Generation of Frequent
Itemsets,” Journal of Parallel and Distributed Computing Archive 61 (3), 350–371, 2001.
[AWAD09]. M. Awad, L. Khan, B. Thuraisingham, L. Wang, Design and Implementation of Data Mining Tools.
CRC Press, Boca Raton, FL, 2009.
[BART99]. P. Bartlett and J. Shawe-Taylor, “Generalization Performance of Support Vector Machines and
Other Pattern Classifiers,” Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C.
Burges, A. J. Smola (eds.), MIT Press, Cambridge, MA, pp. 43–53, 1999.
Data Mining Techniques 41
[CRIS00]. N. Cristianini and J. Shawe-Taylor, Introduction to Support Vector Machines, 1st ed.
Cambridge University Press, Cambridge, pp. 93–122, 2000.
[HOUT95]. M. Houtsma and A. Swanu, “Set-Oriented Mining of Association Rules in Relational
Databases,” Proceedings of the 11th International Conference on Data Engineering, Washington, D.C.,
pp. 25–33, 1995.
[LI07]. C. Li, L. Khan, B. M. Thuraisingham, M. Husain, S. Chen, F. Qiu, “Geospatial Data Mining for
National Security: Land Cover Classification and Semantic Grouping,” In ISI’07: Proceedings of the
IEEE Conference on Intelligence and Security Informatics, May 23−24, New Brunswick, NJ, 2007.
[LIU99]. B. Liu, W. Hsu, Y. Ma, “Mining Association Rules with Multiple Minimum Supports,” Proceedings of
the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego,
CA, pp. 337–341, 1999.
[MITC97]. T. M. Mitchell, Machine Learning. McGraw Hill, New York, NY, Chapter 3, 1997.
[MOBA01]. B. Mobasher, H. Dai, T. Luo, M. Nakagawa, “Effective Personalization Based on Association Rule
Discovery from Web Usage Data,” In WIDM01: Proceedings of the ACM Workshop on Web Information
and Data Management, pp. 9–15, 2001.
[PIRO96]. P. Pirolli, J. Pitkow, R. Rao, “Silk from a Sow’s Ear: Extracting Usable Structures from the Web,”
In CHI-96: Proceedings of the 1996 Conference on Human Factors in Computing Systems, Vancouver,
British Columbia, Canada, pp. 118–125, 1996.
[THUR16]. B. Thuraisingham, S. Abrol, R. Heatherly, M. Kantarcioglu, V. Khadilkar, L. Khan, Analyzing and
Securing Social Networks. CRC Press, Boca Raton, FL, 2016.
[VAPN95]. V. N. Vapnik, The Nature of Statistical Learning Theory. Springer, New York, NY, 1995.
[VAPN98]. V. N. Vapnik, Statistical Learning Theory. Wiley, New York, 1998.
[VAPN99]. V. N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, 1999.
[YANG01]. Q. Yang, H. Zhang, T. Li, “Mining Web Logs for Prediction Models in WWW Caching and
Prefetching,” The 7th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Aug. 26–29, pp. 473–478, 2001.
4 Data Mining for Security
Applications
4.1 OVERVIEW
Data mining has many applications in security including in national security (e.g., surveillance)
as well as in cyber security (e.g., virus detection). The threats to national security include attack-
ing buildings and destroying critical infrastructures such as power grids and telecommunication
systems [BOLZ05]. Data mining techniques are being investigated to find out who the suspicious
people are and who is capable of carrying out terrorist activities [THUR03]. Cyber security is
involved with protecting the computer and network systems against corruption due to Trojan horses
and viruses. Data mining is also being applied to provide solutions such as intrusion and malware
detection and auditing [MASU11]. In this chapter, we will focus mainly on data mining for cyber
security applications.
To understand the mechanisms to be applied to safeguard the nation and the computers and net-
works, we need to understand the types of threats. In [THUR03] we described real-time threats, as
well as non real-time threats. A real-time threat is a threat that must be acted upon within a certain
time to prevent some catastrophic situation. Note that a nonreal-time threat could become a real-
time threat over time. For example, one could suspect that a group of terrorists will eventually per-
form some act of terrorism. However, when we set time bounds such as a threat that will likely occur
say before July 1, 2018, then it becomes a real-time threat and we have to take actions immediately.
If the time bounds are tighter such as “a threat will occur within 2 days,” then we cannot afford to
make any mistakes in our response.
There has been a lot of work on applying data mining for both national security and cyber secu-
rity and our previous books have focused on both aspects (e.g., [THUR03] and [MASU11]). Our
focus in this chapter will be mainly on applying data mining for cyber security. In Section 4.2, we
will discuss data mining for cyber security applications. In particular, we will discuss the threats
to computers and networks and describe the applications of data mining to detect such threats and
attacks. Some of the data mining tools for security applications developed at The University of
Texas at Dallas will be discussed in Section 4.3. We are reimplementing some of our tools to ana-
lyze massive amounts of data. That is, we are developing big data analytics tools for cyber security
applications and some of our current work will be discussed later in this book. This chapter is sum-
marized in Section 4.4. Figure 4.1 illustrates data mining applications in security.
43
44 Big Data Analytics with Applications in Insider Threat Detection
Cyber
security
threats
computers as well as networks, databases, and the Internet could be devastating to businesses. We
are hearing almost daily about the cyber attacks to businesses. It is estimated that cyber terrorism
could cost billions of dollars to businesses. For example, consider a banking information system. If
terrorists attack such a system and deplete accounts of the funds, then the bank could lose millions
and perhaps billions of dollars. By crippling the computer system, millions of hours of productivity
could be lost and that equates to money in the end. Even a simple power outage at work through
some accident could cause several hours of productivity loss and as a result a major financial loss.
Therefore, it is critical that our information systems be secure. We discuss various types of cyber
terrorist attacks. One is spreading malware that can wipe away files and other important documents
and another is intruding the computer networks.
Note that threats can occur from outside or from the inside of an organization. Outside attacks
are attacks on computers from someone outside the organization. We hear of hackers breaking into
computer systems and causing havoc within an organization. These hackers infect the computers
with malware that can not only cause great damage to the files stored in the systems but also spread
to other systems via the networks. But a more sinister problem is the insider threat problem. People
inside an organization who have studied the business practices develop schemes to cripple the orga-
nization’s information assets. These people could be regular employees or even those working at
computer centers and contractors. The problem is quite serious as someone may be masquerading
as someone else and causing all kinds of damage. Malicious processes in the system can also mas-
querade as benign processes and cause damage. Data mining techniques have been applied to detect
the various attacks. We discuss some of these attacks next. Part III will elaborate on applying data
mining for the insider threat problem.
Data Mining for Security Applications 45
Attacks on critical
infrastructures
Data mining
services for
cyber security
Insider
Intrusion Malware Inference
threat
detection detection problem
detection
Data mining
tools at
UT Dallas
detection, we extract n-gram features both with assembly code and binary code. We first train the
data mining tool using the SVM technique and then test the model. The classifier will determine
whether the code is malicious or not. For buffer overflow detection, we assume that malicious mes-
sages contain code while normal messages contain data. We train SVM and then test to see if the
message contains code or data.
We have also reimplemented some of our data mining tools to operate in a cloud. Essentially,
we have applied big data analytics techniques for malware detection and showed the significant
improvement we can get by using big data analytics versus data mining. This is the approach we
have taken for the insider threat detection problems discussed in this book. That is, we discuss
stream analytics techniques that we have developed and show how they can be implemented in the
cloud for detecting insider threats. We believe that due to the very large amounts of malware data
that are dynamic and heterogeneous in nature, we need big data mining tools to analyze such data
to detect for security violations.
REFERENCES
[AWAD09]. M. Awad, L. Khan, B. Thuraisingham, L. Wang, Design and Implementation of Data Mining
Tools, CRC Press, Boca Raton, FL, 2009.
[BOLZ05]. F. Bolz, K. Dudonis, D. Schulz, The Counterterrorism Handbook: Tactics, Procedures, and
Techniques, Third Edition Practical Aspects of Criminal & Forensic Investigations, CRC Press, Boca
Raton, FL, 2005.
[CHAN99]. P. Chan, W. Fan, A. Prodromidis, S. Stolfo, “Distributed Data Mining in Credit Card Fraud
Detection,” IEEE Intelligent Systems, 14, #6, 67–74, 1999.
[MASU11]. M. Masud, L. Khan, B. Thuraisingham, Data Mining Tools for Malware Detection, CRC Press,
Boca Raton, FL, 2011.
Data Mining for Security Applications 49
[THUR02]. B. Thuraisingham, Data Mining, National Security, Privacy and Civil Liberties, SIGKDD
Explorations, Vol. 4, #2, New York, NY, December 2002.
[THUR03]. B. Thuraisingham, Web Data Mining Technologies and Their Applications in Business Intelligence
and Counter-Terrorism, CRC Press, Boca Raton, FL, 2003.
[THUR04]. B. Thuraisingham, Data mining for security applications. Managing Threats to Web Databases
and Cyber Systems, Issues, Solutions and Challenges, V. Kumar, J. Srivastava, Al. Lazarevic, editors,
Kluwer, MA, 2004.
[THUR05]. B. Thuraisingham, Database and Applications Security, CRC Press, Boca Raton, FL, 2005.
5 Cloud Computing and
Semantic Web Technologies
5.1 INTRODUCTION
Chapters 2 through 4 have discussed concepts in data security and privacy, data mining, and data
mining for cyber security. These three supporting technologies are part of the foundational tech-
nologies for the concepts discussed in this book. For example, Section II describes stream data
analytics for large datasets. In particular, we discuss an innovative technique called “novel class
detection” where we integrate data mining with stream data management technologies. Section III
describes our approach to applying the techniques for stream mining discussing Section II for
insider threat detection. We utilize the cloud platform for managing and analyzing large datasets.
We will see throughout this book that cloud computing is at the heart of managing large datasets.
In addition, for some of our experimental systems, to be discussed in Section IV, we have utilized
semantic web technologies. Therefore, in this chapter, we discuss two additional technologies that
we have used in several of the chapters in this book. They are cloud computing and semantic web
technologies.
Cloud computing has emerged as a powerful computing paradigm for service-oriented com-
puting. Many of the computing services are being outsourced to the cloud. Such cloud-based ser-
vices can be used to host the various cyber security applications such as insider threat detection
and identity management. Another concept that is being used for a variety of applications is the
notion of the semantic web. A semantic web is essentially a collection of technologies to produce
machine-understandable web pages. These technologies can also be used to represent any type of
data including schema for big data and malware data. We have based some of our analytics and
security investigation for data represented using semantic web technologies.
The organization of this chapter is as follows. Section 5.2 discusses cloud computing concepts.
Concepts in semantic web are discussed in Section 5.3. Semantic web and security concepts are
discussed in Section 5.4. Cloud computing frameworks based on semantic web are discussed in
Section 5.5. This chapter is concluded in Section 5.6. Figure 5.1 illustrates the concepts discussed
in this chapter.
51
52 Big Data Analytics with Applications in Insider Threat Detection
Cloud computing
and
semantic web
technologies
has emerged to address the explosive growth of web-connected devices and handle massive amounts
of data. It is defined and characterized by massive scalability and new Internet-driven economics.
In this chapter, we will discuss some preliminaries in cloud computing and semantic web. We
will first introduce what is meant by cloud computing. While various definitions have been pro-
posed, we will adopt the definition provided by the National Institute of Standards and Technology
(NIST). This will be followed by a service-based paradigm for cloud computing. Next, we will
discuss the various key concepts including virtualization and data storage in the cloud. We will also
discuss some of the technologies such as Hadoop and MapReduce.
The organization of this chapter is as follows. Cloud computing preliminaries will be discussed
in Section 5.2.2. Virtualization will be discussed in Section 5.2.3. Cloud storage and data manage-
ment issues will be discussed in Section 5.2.4. Cloud computing tools will be discussed in Section
5.2.5. Figure 5.2 illustrates the components addressed in this section.
5.2.2 Preliminaries
As stated in [CLOUD], cloud computing delivers computing as a service, while in traditional com-
puting, it is provided in the form of a product. Therefore, users pay for the services based on a
pay-as-you-go model. The services provided by a cloud may include hardware services, systems
services, data services, and storage services. Users of the cloud need not know where the software
and data are located; that is, the software and data services provided by the cloud are transparent to
the user. NIST has defined cloud computing to be the following [NIST]:
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared
pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that
can be rapidly provisioned and released with minimal management effort or service provider interaction.
Cloud
computing
components
The cloud model is composed of multiple deployment models and service models. These models
are described next.
5.2.3 Virtualization
Virtualization essentially means creating something virtual and not actual. It could be hardware,
software, memory, and data. The notion of virtualization has existed for decades with respect to
computing. Back in the 1960s, the concept of virtual memory was introduced. This virtual memory
Cloud
deployment
models
Cloud
service
models
gives the application program the illusion that it has contiguous working memory. Mapping is devel-
oped to map the virtual memory to the actual physical memory.
Hardware virtualization is a basic notion in cloud computing. This essentially creates virtual
machines hosted on a real computer with an OS. This means while the actual machine may be run-
ning a Windows OS, through virtualization it may provide a Linux machine to the users. The actual
machine is called the host machine while the virtual machine is called the guest machine. The term
virtual machine monitor, also known as the hypervisor, is the software that runs the virtual machine
on the host computer.
Other types of virtualization include OS level virtualization, storage virtualization, data virtu-
alization, and database virtualization. In OS level virtualization, multiple virtual environments are
created within a single OS. In storage virtualization, the logical storage is abstracted from the physi-
cal storage. In data virtualization, the data is abstracted from the underlying databases. In network
virtualization, a virtual network is created. Figure 5.5 illustrates the various types of virtualizations.
As we have stated earlier, at the heart of cloud computing is the notion of hypervisor or the
virtual machine monitor. Hardware virtualization techniques allow multiple OSs (called guests)
to run concurrently on a host computer. These multiple OSs share virtualized hardware resources.
Hypervisor is not a new term; it was first used in the mid 1960s in the IBM 360/65 machines.
There are different types of hypervisors; in one type the hypervisor runs on the host hardware and
manages the guest OSs. Both VMware and XEN which are popular virtual machines are based on
this model. In another model, the hypervisor runs within a conventional OS environment. Virtual
machines are also incorporated into embedded systems and mobile phones. Embedded hypervisors
have real-time processing capability. Some details of virtualization are provided in [VIRT].
Types
of
virtualization
Operating
Hardware Database
system
virtualization virtualization
virtualization
Virtual
storage
Physical
storage
the hosting companies. The actual location of the data is transparent to the users. What is presented
to the users is virtualized storage; the storage managers will map the virtual storage with the actual
storage and manage the data resources for the customers. A single object (e.g., the entire video data-
base of a customer) may be stored in multiple locations. Each location may store objects for multiple
customers. Figure 5.6 illustrates cloud storage management.
Virtualizing cloud storage has many advantages. Users need not purchase expensive storage
devices. Data could be placed anywhere in the cloud. Maintenance such as backup and recovery are
provided by the cloud. The goal is for users to have rapid access to the cloud. However, due to the
fact that the owner of the data does not have complete control of his data, there are serious security
concerns with respect to storing data in the cloud.
A database that runs on the cloud is a cloud database manager. There are multiple ways to uti-
lize a cloud database manager. In the first model, for users to run databases on the cloud, a virtual
machine image must be purchased. The database is then run on the virtual machines. The second
model is the database as a service model; the service provider will maintain the databases. The
users will make use of the database services and pay for the service. An example is the Amazon
relational database service which is a Structured Query Language (SQL) database service and has
a MySQL interface [AMAZ]. A third model is the cloud provider which hosts a database on behalf
of the user. Users can either utilize the database service maintained by the cloud or they can run
their databases on the cloud. A cloud database must optimize its query, storage, and transaction
processing to take full advantage of the services provided by the cloud. Figure 5.7 illustrates cloud
data management.
Cloud
data
management
5.2.5.2 MapReduce
A MapReduce job consists of three phases: (1) A “map” phase in which each slave node performs
some computation on the data blocks of the input that it has stored. The output of this phase is a
key–value pair based on the computation that is performed. (2) An intermediate “sort” phase in
which the output of the map phase is sorted based on keys. (3) A “reduce” phase in which a reducer
aggregates various values for a shared key and then further processes them before producing the
desired result.
5.2.5.3 CouchDB
Apache CouchDB is a distributed, document-oriented database which can be queried and indexed in
a MapReduce fashion [ANDE10]. Data is managed as a collection of JSON documents [CROC06].
Users can access the documents with a web browser, via HTTP as well as querying, combining, and
transforming documents with JavaScript.
5.2.5.4 HBase
Apache HBase is a distributed, versioned, column-oriented store modeled after Google’s Bigtable,
written in Java. Organizations such as Mendeley, Facebook, and Adobe are using HBase [GEOR11].
5.2.5.5 MongoDB
It is an open source, schema-free, (JSON) document-oriented database written in C++ [CHOD10].
It is developed and supported by 10gen and is part of the NoSQL family of database systems.
MongoDB stores structured data as JSON-like documents with dynamic schemas (MongoDB calls
the format BSON), making the integration of data in certain types of applications easier and faster.
5.2.5.6 Hive
Apache Hive is a data warehousing framework that provides the ability to manage, query, and
analyze large datasets stored in HDFS or HBase [THUS10]. Hive provides basic tools to perform
extract-transfer-load (ETL) operations over data, project structure onto the extracted data, and query
the structured data using a SQL-like language called HiveQL. HiveQL performs query execution
using the MapReduce paradigm, while allowing advanced Hadoop programmers to plug in their
custom-built MapReduce programs to perform advanced analytics not supported by the language.
Some of the design goals of Hive include dynamic scale-out, user-defined analytics, fault-tolerance,
and loose coupling with input formats.
Cloud Computing and Semantic Web Technologies 57
Trust
SWRL
OWL
RDF
XML
Foundations
advantage of the technologies of the previous layer. The lowest layer is the protocol layer and this is
usually not included in the discussion of the semantic technologies. The next layer is the XML layer.
XML is a document representation language. While XML is sufficient to specify syntax, semantics
such as “the creator of document D is John” is hard to specify in XML. Therefore, the W3C devel-
oped RDF which uses XML syntax. The semantic web community then went further and came up
with a specification of ontologies in languages such as Web Ontology Language (OWL). Note that
OWL addresses the inadequacies of RDF. In order to reason about various policies, the semantic web
community has come up with web rules language such as Semantic Web Rules Language (SWRL).
Next, we will describe the various technologies that constitute the semantic web.
5.3.1 XML
XML is needed due to the limitations of hypertext markup language and complexities of standard
generalized markup language. XML is an extensible markup language specified by the W3C and
designed to make the interchange of structured documents over the Internet easier. An important
aspect of XML used to be Document Type Definitions which define the role of each element of text
in a formal model. XML schemas have now become critical to specify the structure of data. XML
schemas are also XML documents [BRAY97].
5.3.2 RDF
The RDF is a standard for describing resources on the semantic web. It provides a common frame-
work for expressing this information so it can be exchanged between applications without loss
of meaning. RDF is based on the idea of identifying things using web identifiers (called uniform
resource identifiers (URIs)), and describing resources in terms of simple properties and property
values [KLYN04].
The RDF terminology T is the union of three pairwise disjoint infinite sets of terms: the set U of
URI references, the set L of literals (itself partitioned into two sets, the set Lp of plain literals and
the set Lt of typed literals), and the set B of blanks. The set U ∪ L of names is called the vocabulary.
A RDF triple can be viewed as an arc from s to o, where p is used to label the arc. This is rep-
p
resented as s → o . We also refer to the ordered triple (s, p, o) as the subject, predicate, and object of
a triple.
RDF has a formal semantics which provide a dependable basis for reasoning about the meaning
of an RDF graph. This reasoning is usually called entailment. Entailment rules state which implicit
information can be inferred from explicit information. In general, it is not assumed that complete
information about any resource is available in an RDF query. A query language should be aware
of this and tolerate incomplete or contradicting information. The notion of class and operations on
classes are specified in RDF though the concept of RDF schema [ANTO08].
5.3.3 SPARQL
SPARQL (Simple Protocol and RDF Query Language) [PRUD06] is a powerful query language. It
is a key semantic web technology and was standardized by the RDF Data Access Working Group
Cloud Computing and Semantic Web Technologies 59
of the W3C. SPARQL syntax is similar to SQL, but it has the advantage whereby it enables que-
ries to span multiple disparate data sources that consist of heterogeneous and semistructured data.
SPARQL is based around graph pattern matching [PRUD06].
5.3.4 OWL
The OWL [MCGU04] is an ontology language that has more expressive power and reasoning capa-
bilities than RDF and RDF schema (RDF-S). It has additional vocabulary along with a formal
semantics. OWL has three increasingly expressive sublanguages: OWL Lite, OWL DL, and OWL
Full. These are designed for use by specific communities of implementers and users. The formal
semantics in OWL is based on description logics (DL), which is a decidable fragment of first-order
logics.
5.3.5 Description Logics
DL is a family of knowledge representation (KR) formalisms that represent the knowledge of an
application domain [BAAD03]. It defines the concepts of the domain (i.e., its terminology) as sets
of objects called classes, and it uses these concepts to specify properties of objects and individu-
als occurring in the domain. A DL is characterized by a set of constructors that allow one to build
complex concepts and roles from atomic ones.
ALCQ: A DL language ALCQ consists of a countable set of individuals Ind, a countable set
of atomic concepts CS, a countable set of roles RS and the concepts built on CS and RS as follows:
C , D := A | ¬A | C D | C D | ∃R ⋅ C | ∀R ⋅ C | (≤ nR ⋅ C ) | (≥ nR ⋅ C )
where A ∈ CS, R ∈ RS, C, and D are concepts and n is a natural number. Also, individuals are
denoted by a, b, c, … (e.g., lower case letters of the alphabet).
This language includes only concepts in negation normal form. The complement of a concept
¬(C) is inductively defined, as usual, by using the law of double negation, de Morgan laws and the
dualities for quantifiers. Moreover, the constants and ⊥ abbreviate A ¬A and A ¬A, respec-
tively, for some A ∈ CS.
An interpretation I consists of a nonempty domain, ΔI, and a mapping, aI, that assigns
¬A I = Δ I \ A I
I
(C D) = C I ∪ DI
(C D)I = C I ∩ D I
(∃R ⋅ C )I = {x ∈ ΔI | ∃y (( x, y) ∈ R I ∧ y ∈ C I )}
( ∀R ⋅ C )I = {x ∈ ΔI | ∀y (( x, y) ∈ R I ⇒ y ∈ C I )}
(≤ R ⋅ C )I = {x ∈ ΔI | #{y | (( x, y) ∈ R I ∧ y ∈ C I )} ≤ n}
(≥ R ⋅ C )I = {x ∈ ΔI | #{y | (( x, y) ∈ R I ∧ y ∈ C I )} ≥ n}
We can define the notion of a knowledge base and its models. An ALCQ knowledge base is the
union of the following.
1. A finite terminological set (TBox) of inclusion axioms that have the form T C, where C
is called inclusion concept.
2. A finite assertional set (ABox) of assertions of the form a:C (concept assertion) or (a, b):R
(role assertion) where R is called assertional role, and C is called assertional concept.
• An inclusion axiom T C( I T C) if CI = ΔI
• A concept assertion a: C(I a:C) if aI ∈ CI
• A role assertion a, b: (I (a, b): R) if (aI, bI ) ∈ R I
This shows that an entailment can be decided in ExpTime. Moreover, the inconsistency problem is
reducible to the entailment problem and so deciding an entailment is an ExpTime-complete problem too.
5.3.6 Inferencing
The basic inference problem for DL is checking a knowledge base consistency. A knowledge base
K is consistent if it has a model. The additional inference problems are
All these reasoning problems can be reduced to KB consistency. For example, concept C is sat-
isfiable with regard to the knowledge base K if K ∪ C(a) is consistent where a is an individual not
occurring in K.
5.3.7 SWRL
The SWRL extends the set of OWL axioms to include horn-like rules, and it extends the Horn-like
rules to be combined with an OWL knowledge base [HORR04].
The proposed rules are of the form of an implication between an antecedent (body) and a conse-
quent (head). The intended meaning can be read as: whenever the conditions specified in the ante-
cedent hold, the conditions specified in the consequent must also hold. Both the antecedent (body)
and consequent (head) consist of zero or more atoms. An empty antecedent is treated as trivially
true (i.e., satisfied by every interpretation), so the consequent must also be satisfied by every inter-
pretation. An empty consequent is treated as trivially false (i.e., not satisfied by any interpretation),
so the antecedent must not be satisfied by any interpretation.
Multiple atoms are treated as a conjunction, and both the head and body can contain conjunction
Wof such atoms. Note that rules with conjunctive consequents could easily be transformed (via Lloyd-
Topor transformations) into multiple rules each with an atomic consequent. Atoms in these rules can
be of the form C(x), P(x, y), SameAs(x, y) or DifferentFrom(x, y) where C is an OWL description, P is
an OWL property, and x, y are either variables, OWL individuals, or OWL data values.
else needs to be done so that the information on the web can be managed, integrated, and exchanged
securely. Logic, proof, and trust are at the highest layers of the semantic web. That is, how can we
trust the information that the web gives us? Next we will discuss the various security issues for
XML, RDF, ontologies, and rules.
5.4.1 XML Security
Various research efforts have been reported on XML security (see e.g., [BERT02]. We briefly discuss
some of the key points. The main challenge is whether to give access to all the XML documents or
to parts of the documents. Bertino and Ferrari have developed authorization models for XML. They
have focused on access control policies as well as on dissemination policies. They also considered
push and pull architectures. They specified the policies in XML. The policy specification contains
information about which users can access which portions of the documents. In [BERT02], algorithms
for access control as well as computing views of the results are presented. In addition, architectures
for securing XML documents are also discussed. In [BERT04] and [BHAT04], the authors go further
and describe how XML documents may be published on the web. The idea is for owners to publish
documents, subjects request access to the documents, and untrusted publishers give the subjects the
views of the documents they are authorized to see. W3C is specifying standards for XML secu-
rity. The XML security project is focusing on providing the implementation of security standards
for XML. The focus is on XML-Signature Syntax and Processing, XML-Encryption Syntax and
Processing, and XML Key Management. While the standards are focusing on what can be imple-
mented in the near term, much research is needed on securing XML documents (see also [SHE09]).
5.4.2 RDF Security
RDF is the foundation of the semantic web. While XML is limited in providing machine under-
standable documents, RDF handles this limitation. As a result, RDF provides better support for
interoperability as well as searching and cataloging. It also describes contents of documents as well
as relationships between various entities in the document. While XML provides syntax and nota-
tions, RDF supplements this by providing semantic information in a standardized way [ANTO08].
The basic RDF model has three components: they are resources, properties, and statements.
Resource is anything described by RDF expressions. It could be a web page or a collection of pages.
Property is a specific attribute used to describe a resource. RDF statements are resources together with
a named property plus the value of the property. Statement components are subject, predicate, and
object. So, for example, if we have a sentence of the form “John is the creator of xxx,” then xxx is the
subject or resource, property, or predicate is “creator” and object or literal is “John.” There are RDF
diagrams very much like, say, the entity relationship diagrams or object diagrams to represent state-
ments. It is important that the intended interpretation be used for RDF sentences. This is accomplished
by RDF-S. Schema is sort of a dictionary and has interpretations of various terms used in sentences.
More advanced concepts in RDF include the container model and statements about statements.
The container model has three types of container objects and they are bag, sequence, and alterna-
tive. A bag is an unordered list of resources or literals. It is used to mean that a property has multiple
values but the order is not important. A sequence is a list of ordered resources. Here the order is
important. Alternative is a list of resources that represent alternatives for the value of a property.
Various tutorials in RDF describe the syntax of containers in more detail. RDF also provides sup-
port for making statements about other statements. For example, with this facility, one can make
statements of the form “The statement A is false,” where A is the statement “John is the creator of
X.” Again, one can use object-like diagrams to represent containers and statements about state-
ments. RDF also has a formal model associated with it. This formal model has a formal grammar.
The query language to access RDF document is SPARQL. For further information on RDF, we refer
to the excellent discussion in the book by Antoniou and van Harmelen [ANTO08].
Cloud Computing and Semantic Web Technologies 63
Now to make the semantic web secure, we need to ensure that RDF documents are secure. This
would involve securing XML from a syntactic point of view. However, with RDF, we also need to
ensure that security is preserved at the semantic level. The issues include the security implications of the
concepts resource, properties, and statements. That is, how is access control ensured? How can state-
ments and properties about statements be protected? How can one provide access control at a finer grain
of granularity? What are the security properties of the container model? How can bags, lists, and alter-
natives be protected? Can we specify security policies in RDF? How can we resolve semantic inconsis-
tencies for the policies? What are the security implications of statements about statements? How can we
protect RDF-S? These are difficult questions and we need to start research to provide answers. XML
security is just the beginning. Securing RDF is much more challenging (see also [CARM04]).
5.5.1 RDF Integration
We have developed an RDF-based policy engine for use in the cloud for various applications includ-
ing social media and information sharing applications. The reasons for using RDF as our data model
are as follows: (1) RDF allows us to achieve data interoperability between the seemingly disparate
sources of information that are cataloged by each agency/organization separately. (2) The use of RDF
allows participating agencies to create data-centric applications that make use of the integrated data
that is now available to them. (3) Since RDF does not require the use of an explicit schema for data
64 Big Data Analytics with Applications in Insider Threat Detection
generation, it can be easily adapted to ever-changing user requirements. The policy engine’s flex-
ibility is based on its accepting high-level policies and executing them as rules/constraints over a
directed RDF graph representation of the provenance and its associated data. The strength of our pol-
icy engine is that it can handle any type of policy that could be represented using RDF technologies,
horn logic rules (e.g., SWRL), and OWL constraints. The power of these semantic web technologies
can be successfully harnessed in a cloud computing environment to provide the user with capability
to efficiently store and retrieve data for data intensive applications. Storing RDF data in the cloud
brings a number of new features such as: scalability and on-demand services, resources and services
for users on demand, ability to pay for services and capacity as needed, location independence, guar-
antee quality of service for users in terms of hardware/CPU performance, bandwidth, and memory
capacity. We have examined the following efforts in developing our framework for RDF integration.
In [SUN10], the authors adopted the idea of Hexastore and considered both RDF data model and
HBase capability. They stored RDF triples into six HBase tables (S_PO, P_SO, O_SP, PS_O, SO_P
and PO_S), which covered all combinations of RDF triple patterns. They indexed the triples with
HBase-provided index structure on row key. They also proposed a MapReduce strategy for SPARQL
basic graph pattern (BGP) processing, which is suitable for their storage schema. This strategy uses
multiple MapReduce jobs to process a typical BGP. In each job, it uses a greedy method to select
join key and eliminates multiple triple patterns. Their evaluation result indicated that their approach
worked well against large RDF datasets. In [HUSA09], the authors described a framework that uses
Hadoop to store and retrieve large numbers of RDF triples. They described a schema to store RDF data
in the HDFS. They also presented algorithms to answer SPARQL queries. This made use of Hadoop’s
MapReduce framework to actually answer the queries. In [HUAN11], the authors introduced a scal-
able RDF data management system. They introduced techniques for (1) leveraging state-of-the-art
single node RDF-store technology and (2) partitioning the data across nodes in a manner that helps
accelerate query processing through locality optimizations. In [PAPA12], the authors presented
H2RDF, which is a fully distributed RDF store that combines the MapReduce processing framework
with a NoSQL distributed data store. Their system features unique characteristics that enable efficient
processing of both simple and multijoin SPARQL queries on virtually unlimited number of triples.
These include join algorithms that execute joins according to query selectivity to reduce processing,
and include adaptive choice among centralized and distributed (MapReduce-based) join execution for
fast query responses. They claim that their system can efficiently answer both simple joins and com-
plex multivariate queries, as well as scale up to 3 billion triples using a small cluster consisting of nine
worker nodes. In [KHAD12b], the authors designed a Jena-HBase framework. Their HBase-backed
triple store can be used with the Jena framework. Jena-HBase provides end users with a scalable stor-
age and querying solution that supports all features from the RDF specification.
REFERENCES
[ABRA10]. J. Abraham, P. Brazier, A. Chebotko, J. Navarro, A. Piazza, “Distributed Storage and Querying
Techniques for a Semantic Web of Scientific Workflow Provenance,” In Proceedings Services Computing
(SCC), 2010 IEEE International Conference on Services Computing, Miami, FL, 2010.
[AKOU13]. A. Sherif, R. Sohan, H. Andy, “HadoopProv: Towards Provenance as a First Class Citizen in
MapReduce,” In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance,
Lombard, IL, 2013.
[AMAZ]. Amazon Relational Database Service, http://aws.amazon.com/rds/
[ANDE10]. A. Chris, J. Lehnardt, N. Slater, CouchDB: The Definitive Guide, The Definitive Guide, O’Reilly
Media, Sebastopol, CA, 2010.
[ANTO08]. A. Grigoris and V. Harmelen, Frank, Semantic Web Primer, MIT Press, Cambridge, MA, 2008.
[BAAD03]. F. Baader, The description logic handbook: Theory, implementation, and applications, 2003.
[BERT02]. E. Bertino and E. Ferrari, “Secure and Selective Dissemination of XML Documents,” ACM
Transactions on Information and System Security (TISSEC), 5, (3), 290–331, 2002.
[BERT04]. E. Bertino, and G. Giovanna, M. Marco, “A matching Algorithm for Measuring the Structural
Similarity between an XML Document and a DTD and its Applications,” Information Systems, 29 (1),
23–46, 2004.
[BHAT04]. B. Rafae, E. Bertino, G. Arif, J. James, “XML-Based Specification for Web Services Document
Security,” Computer, 37 (4), 41–49, 2004.
[BRAY97]. B. Tim, P. Jean, S. McQueen, C. Michael, M. Eve, Y. Francois, “Extensible Markup Language
(XML),” World Wide Web Journal, 2 (4), 1997.
[CARM04]. B. Carminati, E. Ferrari, B.M. Thuraisingham, “Using RDF for Policy Specification and
Enforcement,” DEXA Workshops, Zaragoza, Spain, 2004.
[CATT11]. C. Rick, “Scalable SQL and NoSQL Data Stores,” ACM SIGMOD Record, 39, (4), 12–27, 2011.
[CHEB13]. C. Artem, J. Abraham, P. Brazier, A. Piazza, A. Kashlev, S. Lu, “Storing, Indexing and Querying
Large Provenance Data Sets as RDF Graphs in Apache HBase,” IEEE International Workshop on
Scientific Workflows, 2013, Santa Clara, CA.
[CHOD10]. C. Kristina and M. Dirolf, MongoDB: The Definitive Guide, O’Reilly Media, Sebastopol, CA, 2010.
66 Big Data Analytics with Applications in Insider Threat Detection
6.1 INTRODUCTION
The discussions in Chapters 2 through 5 provide the background for applying data mining for
insider threat detection. Effective detection of insider threats requires monitoring mechanisms that
are far more fine-grained than for external threat detection. These monitors must be efficiently and
reliably deployable in the software environments where actions endemic to malicious insider mis-
sions are caught in a timely manner. Such environments typically include user-level applications,
such as word processors, email clients, and web browsers for which reliable monitoring of internal
events by conventional means is difficult.
To monitor the activities of the insiders, tools are needed to capture the communications and
relationships between the insiders, store the captured relationships, query the stored relationships,
and ultimately analyze the relationships so that patterns can be extracted that would give the analyst
better insights into the potential threats. Over time, the number of communications and relation-
ships between the insiders could be in the billions. Using the tools developed under our project,
the billions of relationships between the insiders can be captured, stored, queried, and analyzed to
detect malicious insiders.
In this chapter, we will discuss how data mining technologies could be applied for insider threat
detection in the cloud. First, we will discuss how semantic web technologies may be used to rep-
resent the communication between insiders. Next, we will discuss our approach to insider threat
detection. Finally, we will provide an overview of our framework for insider threat detection that
also incorporated some other techniques.
The organization of this chapter is as follows. In Section 6.2, we provide an overview of insider
threat detection. In Section 6.3, we will discuss the challenges, related work, and our approach to
this problem. Our approach will be discussed in detail in Section 6.4. Our framework will be dis-
cussed in Section 6.5. This chapter is concluded in Section 6.6. Figure 6.1 illustrates the contents
of this chapter. It should be noted that while the discussion in this chapter provides our overall
approach to insider threat detection using data mining, more details of the big data analytics tools
we have designed and developed for insider threat detection will be the subject of Section 6.3.
Therefore, this chapter is essentially a snapshot of the contents to be described in Section 6.3.
67
68 Big Data Analytics with Applications in Insider Threat Detection
Insider
threat
detection
Semantic web-
Comprehensive
based
framework
architecture
and competitive advantage are often carried out through abusing access rights, theft of materials,
and mishandling physical devices. Insiders do not always act alone and may not be aware they are
aiding a threat actor (i.e., the unintentional insider threat). It is vital that organizations understand
normal employee baseline behaviors and also ensure employees understand how they may be used
as a conduit for others to obtain information.”
The activities carried out by the malicious insiders could generate massive volumes of data
over time. Our challenge is to analyze this data and detect whether the activities are malicious or
not. One traditional approach to the insider threat detection problem is supervised learning which
builds data classification models from training data. Unfortunately, the training process for super-
vised learning methods tends to be time-consuming and expensive, and generally requires large
amounts of well-balanced training data to be effective. In our experiments, we observe that <3% of
the data in realistic datasets for this problem is associated with insider threats (the minority class);
over 97% of the data is associated with nonthreats (the majority class). Hence, traditional support
vector machines (SVM) ([CHAN11], [MANE02]) trained from such imbalanced data are likely to
perform poorly on test datasets.
After an extensive investigation of the various data mining techniques for insider threat detec-
tion, we believe that the best way to handle the insider threat problem is to conceptualize it as a
stream mining problem that applies to continuous data streams. Whether using a supervised or
unsupervised learning algorithm, the method chosen must be highly adaptive to correctly deal with
concept drifts under these conditions. In Section III of this book, we describe the various big data
analytics techniques for data streams that we have developed for insider threat detection. In this
chapter, we will discuss our preliminary investigation of modeling the activities of the insider as
a collection of graphs and discuss our approach to mining the graphs to extract the patterns of the
insiders.
1. Storing these large graphs in an expressive and unified manner in a secondary storage.
2. Devising scalable solutions for querying the large graphs to find relevant data.
3. Identifying relevant features for the complex graphs and subsequently detecting insider
threats in a dynamic environment that changes over time.
The motivation behind our approach is to address these challenges. We have developed solu-
tions based on cloud computing to (i) characterize graphs containing up to billions of nodes and
Data Mining and Insider Threat Detection 69
edges between nodes representing activities (e.g., credit card transactions), e-mail, or text mes-
sages. Since the graphs will be massive, we have developed technologies for efficient and persis-
tent storage. (ii) In order to facilitate novel anomaly detection, we require an efficient interface to
fetch relevant data in a timely manner from this persistent storage. Therefore, we have developed
efficient query techniques on the stored graphs. (iii) The fetched relevant data can then be used for
further analysis to detect anomalies. In order to do this, first we have to identify relevant features
from the complex graphs and subsequently develop techniques for mining large graphs to extract
the nuggets.
Insider threat detection is a difficult problem to solve. The problem becomes increasingly
complex with more data originating from heterogeneous sources and sensors. Recently, there are
some that focus on anomaly-based insider threat detection from graphs [EBER09]. This method
is based on the minimum description length principle. The solution proposed by [EBER09] has
some limitations. First, with their approach, scalability is an issue. In other words, they have
not discussed any issue related to large graphs. Second, the heterogeneity issue has not been
addressed. Finally, it is unclear how their algorithm will deal with a dynamic environment which
changes over time.
There are also several graph mining techniques that have been developed especially for social
network analysis ([COOK06], [TONG09], [CARM09], [THUR09]). The scalability of these tech-
niques is still an issue. There is some work from the mathematics research community to apply
linear programming techniques for graph analysis [BERR07]. Whether these techniques will work
in a real-world setting is not clear.
For a solution to be viable, it must be highly scalable and support multiple heterogeneous data
sources. Current state-of-the-art solutions do not scale well or preserve accuracy. By leveraging
Hadoop technology, our solution will be highly scalable. Furthermore, by utilizing the flexible
semantic web RDF data model, we are able to easily integrate and align heterogeneous data. Thus,
our approach will create a scalable solution in a dynamic environment. No existing threat detection
tools offer this level of scalability and interoperability. We have combined these technologies with
novel data mining techniques to create a complete insider threat detection solution.
We have exploited the cloud computing framework based on Hadoop/MapReduce technologies.
The insiders and their relationships are represented by nodes and links in the form of graphs. In
particular, in our approach, the billions of nodes and links will be presented as resource description
framework (RDF) graphs. By exploiting RDF representation, we have addressed heterogeneity. We
have developed mechanisms to efficiently store the RDF graphs, query the graphs using SPARQL
technologies, and mine the graphs to extract patterns within the cloud computing framework.
Source 1
Selected data
Source N
We are assuming that the large graphs already exist. To facilitate persistent storage and
efficient retrieval of this data, we use a distributed framework based on the cloud computing
framework Hadoop [HADO]. By leveraging the Hadoop technology, our framework is readily
fault-tolerant and scalable. To support large amounts of data, we can simply add more nodes
to the Hadoop cluster. All the nodes of a cluster are commodity class machines; there is no
need to buy expensive server machines. To handle large complex graphs, we exploit the Hadoop
Distributed File System (HDFS) and MapReduce framework. The former is the storage layer
which stores data in multiple nodes with replication. The latter is the execution layer where
MapReduce jobs can be run. We use HDFS to store RDF data and the MapReduce framework
to answer queries.
With regard to feature selection, we need to use a class label for supervised data. Here, for the
message we may not have a class label; however, we know the source/sender and the destination/
recipient of a message. Now, we would like to use this knowledge to construct an artificial label. The
sender and destination pair will form a unique class label and all messages sent from this sender to
the recipient will serve as data points. Hence, our goal is to find appropriate features that will have
discriminating power across all these class labels based on these messages. There are several meth-
ods for feature selection that are widely used in the area of machine learning, such as information
gain (IG) ([MITC97], [MASU10a], [MASU10b]), Gini index, chi-square statistics, subspace clus-
tering [AHME09], and so on. Here, we present IG, which is very popular and for the text domain,
we can use subspace clustering for feature selection.
IG can be defined as a measure of the effectiveness of a feature in classifying the training data
[MITC97]. If we split the training data on these attribute values, then IG provides the measurement
of the expected reduction in entropy after the split. The more an attribute can reduce entropy in the
training data, the better the attribute in classifying the data. IG of an attribute A on a collection of
examples S is given by
| Sv |
Gain(S, A) ≡ Entropy(S ) − ∑
V ∈Values ( A )
|S |
Entropy(Sv ) (6.1)
where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which
attribute A has value v. Entropy of S is computed using the following equation:
n
k n m k m
F (W , Z , Λ) = ∑∑∑
l =1 j =1 i =1
wljf λliq Dlij ∗ (1 + Impl ) + γ ∑ ∑λ χ
l =1 i =1
q
li
2
li
Where
Dlij = ( zli − x ji )2
72 Big Data Analytics with Applications in Insider Threat Detection
Subject to
∑w
jl=1
lj = 1, 1 ≤ j ≤ n, 1 ≤ l ≤ k, 0 ≤ wlj ≤ 1
∑λ = 1,
i=1
li 1 ≤ i ≤ m, 1 ≤ l ≤ k, 0 ≤ λli ≤ 1
In this objective function, W, Z, and Λ represent the cluster membership, cluster centroid, and
dimension weight matrices, respectively. Also, the parameter f controls the fuzziness of the mem-
bership of each data point, q further modifies the weight of each dimension of each cluster (λli), and
finally, γ controls the strength of the incentive given to the Chi Square component and dimension
weights. It is also assumed that there are n documents in the training dataset, m features for each of
the data points and k subspace clusters are generated during the clustering process. Impl indicates
the cluster impurity, whereas χ2 indicates the chi square statistic. Details about these notations and
how the clustering is done can be found in our prior work, funded by NASA [AHME09]. It should
be noted that feature selection using subspace clustering can be considered as an unsupervised
approach toward feature selection as no label information is required during an unsupervised clus-
tering process.
Once we select features, a message between two nodes will be represented as a vector using these
features. Each vector’s individual value can be binary or weighted. Hence, this will be a compact
representation of the original message and it can be loaded into main memory along with graph
structure. In addition, the location or URL of the original message will be kept in the main memory
data structure. If needed, we fetch the message. Over time, the feature vector may be changed due to
dynamic nature content [MASU10a], and hence, the feature set may evolve. Based on our prior work
for evolving streams with dynamic feature sets [MASU10b], we investigate alternative options.
Preprocessor
N-Triple convertor
PS
Data in RDF/XML
POS
Summary statistics
Our MapReduce framework has three subcomponents in it. It takes the SPARQL query from the
user and passes it to the input selector and plan generator. This component will select the input files
and decide how many MapReduce jobs are needed and pass the information to the Join Executer
component which runs the jobs using MapReduce framework. It will then relay the query answer
from Hadoop to the user.
6.4.4 Data Storage
We store the data in the N-Triples format because in this format we have a complete RDF triple
(subject, predicate and object) in one line of a file, which is very convenient to use with MapReduce
jobs. The data is dictionary encoded for increased efficiency. Dictionary encoding means replacing
text strings with a unique binary number. This not only reduces disk space required for storage but
also query answering will be fast because handling primitive data type is much faster than string
matching. The processing steps to get the data in our intended format are described below.
the files with predicates, for example, all the triples containing a predicate p1:pred go into a file
named p1-pred. However, in case we have a variable predicate in a triple pattern and if we cannot
determine the type of the object, we have to consider all files. If we can determine the type of the
object, then we consider all files having that type of object.
• Tools that will analyze and model benign and anomalous mission.
• Techniques to identify right dimensions and activities and apply pruning to discard irrel-
evant dimensions.
• Techniques to cope with changes and novel class/anomaly detection.
In a typical data stream classification task, it is assumed that the total number of classes is
fixed. This assumption may not be valid in insider threat detection cases, where new classes may
evolve. Traditional data stream classification techniques are not capable of recognizing novel class
instances until the appearance of the novel class is manually identified, and labeled instances of
that class are presented to the learning algorithm for training. The problem becomes more chal-
lenging in the presence of concept drift, when the underlying data distribution changes over time.
We have proposed a novel and efficient technique that can automatically detect the emergence of a
novel class (i.e., brand new anomaly) by quantifying cohesion among unlabeled test instances, and
separating the test instances from training instances. Our goal is to use the available data and build
this model.
One interesting aspect of this model is that it should capture the dynamic nature of dimensions of
the mission, as well as filter out the noisy behaviors. The dimensions (both benign and anomalous)
have dynamic nature because they tend to change over time, which we denote as concept drift. A
major challenge of the novel class detection is to differentiate the novel class from concept drift and
noisy data. We are exploring this challenge in our current work.
Machine learning
tool for
feature analysis
In-line reference
monitoring tool to
generate features
the heart of our framework is the module that implements in-line reference monitor (IRM)-based
techniques for feature collection. This feature collection process will be aided by two modules: one
uses game theory approach and the other uses the natural language-based approach to determine
which f eatures could be collected. The fourth module implements machine learning techniques to
analyze the collected features. In summary, the relationship between the four approaches can be
characterized as follows:
Details of our framework are provided in [HAML11]. We assume that the IRM tool, game
theoretic tool, and honey token generation tool will select and refine the features we need. Our
data mining tools will analyze the features and determine whether there is a potential for insider
threat.
We have started implementing parts of the framework. In particular, we have developed a number
of data and stream mining techniques for insider threat detection, some of which will be discussed
in Section III. Evidence of malicious insider activity is often buried within large data streams such
as system logs accumulated over months or years. Ensemble-based stream mining leverages mul-
tiple classification models to achieve highly accurate anomaly detection in such streams even when
the stream is unbounded, evolving, and unlabeled. This makes the approach effective for identify-
ing insider threats who attempt to conceal their activities by varying their behaviors over time. Our
approach applies ensemble-based stream mining, unsupervised learning, supervised learning, and
graph-based anomaly detection to the problem of insider threat detection, demonstrating that the
ensemble-based approach is significantly more effective than traditional single-model methods. We
further investigate suitability of various learning strategies for evolving insider threat data. We also
developed unsupervised machine learning algorithms for insider threat detection. Our implementa-
tion is being hosted on the cloud. More information can also be found in [PALL12]. For more infor-
mation on ensemble-based stream mining applications, we also refer to the chapters in Sections 6.2
and 6.3. Details of our algorithms are presented in [MASU10a].
Data Mining and Insider Threat Detection 77
REFERENCES
[AHME09]. M.S. Ahmed and L. Khan, “SISC: A Text Classification Approach Using Semi supervised
Subspace Clustering,” DDDM ’09: The 3rd International Workshop on Domain Driven Data Mining in
Conjunction with ICDM 2009, Dec. 6, Miami, FL, pp. 1–6, 2009.
[BERR07]. M.W. Berry, M. Browne, A. Langville, V.P. Pauca, and R.J. Plemmons, “Algorithms and Applications
for Approximate Nonnegative Matrix Factorization,” Computational Statistics and Data Analysis, 52 (1),
155–173, 2007.
[BRAC04]. R.C. Brackney and R.H. Anderson, editors, Understanding the Insider Threat. RAND Corporation,
Arlington, VA, 2004.
[CARM09]. B. Carminati, E. Ferrari, R. Heatherly, M. Kantarcioglu, B. Thuraisingham, “A Semantic Web
Based Framework for Social Network Access Control” SACMAT 2009, Stresa, Italy, pp. 177–186, 2009.
[CHAN11]. C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions
on Intelligent Systems and Technology, 2, 27:1–27:27, 2011. Software available at http://www.csie.ntu.
edu.tw/ cjlin/libsvm.
[COOK06]. D. Cook and L. Holder, Mining Graph Data, Wiley Interscience, NY, 2006.
[DHS14]. Department of Homeland Security, Combating Insider Threat, National Cyber Security and
Communications Integraiob Center, Department of Homeland Security, May 2014.
[EBER09]. W. Eberle and L. Holder, “Applying Graph-Based Anomaly Detection Approaches to the Discovery
of Insider Threats,” IEEE International Conference on Intelligence and Security Informatics (ISI), Dallas,
TX, June, pp. 206–208, 2009.
[HADO]. Apache Hadoop http://hadoop.apache.org/
[HAML11]. K. Hamlen, L. Khan, M. Kantarcioglu, V. Ng, and B. Thuraisingham. Insider Threat Detection,
UTD Technical Report, April 2011.
[HAMP99]. M.P. Hampton and M. Levi, “Fast Spinning into Oblivion? Recent Developments in Money-
Laundering Policies and Offshore Finance Centres,” Third World Quarterly, 20 (3), 645–656, 1999.
[MANE02]. L.M. Manevitz and M. Yousef. “One-Class SVMs for Document Classification,” The Journal of
Machine Learning Research, 2, 139–154, 2002.
[MASU10a]. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “Classification and Novel Class Detection
in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge and Data
Engineering (TKDE), April 2010, IEEE Computer Society, 2010.
[MASU10b]. M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, B. Thuraisingham, “Classification and Novel
Class Detection of Data Streams in a Dynamic Feature Space,” Proceedings of European Conference on
Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2010, Barcelona, Spain, Sept.
20–24, 2010, Springer 2010, ISBN 978-3-642-15882-7, pp. 337–352, 2010.
[MATZ04]. S. Matzner and T. Hetherington, “Detecting Early Indications of a Malicious Insider,” IA Newsletter,
7 (2), 42–45, 2004.
[MITC97]. T. Mitchell, Machine Learning. McGraw Hill, New York, NY, 1997.
78 Big Data Analytics with Applications in Insider Threat Detection
[PALL12]. P. Parveen, N. McDaniel, V.S. Hariharan, B.M. Thuraisingham, L. Khan. “Unsupervised Ensemble
Based Learning for Insider Threat Detection,” SocialCom/PASSAT 2012, pp. 718–727, 2012.
[SALE11]. M.B. Salem and S.J. Stolfo. “Modeling User Search Behavior for Masquerade Detection,”
Proceedings of Recent Advances in Intrusion Detection (RAID), Menlo Park, CA, pp. 101–200, 2011.
[THUR09]. B. Thuraisingham, M. Kantarcioglu, and L. Khan, “Building a Geosocial Semantic Web for
Military Stabilization and Reconstruction Operations,” PAISI 2009, 1, 2009.
[TONG09]. H. Tong, Fast Algorithms for Querying and Mining Large Graphs, CMU Report, ML-09-112,
September 2009.
7 Big Data Management and
Analytics Technologies
7.1 INTRODUCTION
Over the past 10 years or so, numerous big data management and analytics systems have emerged.
In addition, various cloud service providers have also implemented big data solutions. In addition,
infrastructures/platforms for big data systems have also been developed. Notable among the big
data systems include MongoDB, Google’s BigQuery, and Apache HIVE. The big data solutions are
being developed by cloud providers including Amazon, IBM, Google, and Microsoft. In addition,
infrastructures/platforms based on products such as Apache’s Hadoop, Spark, and Storm have been
developed.
Selecting the products to discuss is a difficult task. This is because almost every database ven-
dor as well as cloud computing vendors together with analytics tools vendors are now marketing
products as big data management and analytics (BDMA) solutions. When we combine the products
offered by all vendors as well as include the open-source products, then there are hundreds of prod-
ucts to discuss. Therefore, we have selected the products that we are most familiar with by discuss-
ing these products in the courses we teach and/or using them in our experimentation. In other words,
we have only selected the service providers, products, and frameworks that we are most familiar
with and those that we have examined in our work. Describing all of the service providers, products,
and frameworks is beyond the scope of this book. Furthermore, we are not endorsing any product
in this book.
The organization of this chapter is as follows. In Section 7.2, we will describe the various infra-
structure products that host big data systems. Examples of big data systems are discussed in Section
7.3. Section 7.4 discusses the big data solutions provided by some cloud service providers. This
chapter is summarized in Section 7.5. Figure 7.1 illustrates the concepts discussed in this chapter.
79
80 Big Data Analytics with Applications in Insider Threat Detection
BDMA systems
and tools
FIGURE 7.1 Big data management and analytics systems and tools.
The framework takes care of scheduling tasks, monitoring them, and re-executes the failed tasks.”
More details of the MapReduce model are given in [MAPR].
Apache Spark: Apache Spark is an open-source distributed computing framework for processing
massive amounts of data. The application programmers use Spark through an interface that consists
of a data structure called the resilient distributed dataset (RDD). Spark was developed to overcome
the limitations in the MapReduce programming model. The RDD data structure of Spark provides
the support for distributed shared memory. Due to the in-memory processing capabilities, Spark
offers good performance. Spark has interfaces with various NoSQL-based big data systems such as
Cassandra and Amazon’s cloud platform. Spark supports SQL capabilities with Spark SQL. More
details on Spark can be found in [SPAR].
Apache Pig: Apache Pig is a scripting platform for analyzing and processing large datasets.
Apache Pig enables Hadoop users to write complex MapReduce transformations using simple script-
ing language called Pig Latin. Pig converts Pig Latin script to a MapReduce job. The MapReduce
jobs are then executed by Hadoop for the data stored in HDFS. Pig Latin programming is similar to
specifying a query execution plan. That is, the Pig Latin scripts can be regarded to be an execution
plan. This makes it simpler for the programmers to carry out their tasks. More details on Pig can
be found in [PIG].
Apache Storm: Apache Storm is an open-source distributed real-time computation system for
processing massive amounts of data. Storm is essentially a real-time framework for processing
streaming data and real-time analytics. It can be integrated with the HDFS. It provides features
like scalability, reliability, and fault tolerance. The latest version of Storm supports streaming SQL,
predictive modeling, and integration with systems such as Kafka. In summary, Storm is for real-
time processing and Hadoop is for batch processing. More details on Storm can be found in [STOR].
Apache Flink: Flink is an open-source scalable stream processing framework. As stated in
[FLIN], Flink consists of the following features: (i) provides results that are accurate, even in the
case of out-of-order or late-arriving data, (ii) is stateful and fault tolerant and can seamlessly recover
from failures while maintaining exactly once application state, and (iii) performs at large scale,
running on thousands of nodes with very good throughput and latency characteristics. Flink is
essentially a distributed data flow engine implemented in Scala and Java. It executes programs both
in parallel and pipelined modes. It supports Java, Python, and SQL programming environments.
While it does not have its own data storage, it integrates with systems such as HDFS, Kafka, and
Cassandra.
Apache Kafka: Kafka was initially developed by LinkedIn and then further developed as an
open-source Apache project. It is also implemented in Scala and Java and is a distributed stream
processing system. It is highly scalable and handles massive amounts of streaming data. Its storage
layer is based on a pub/sub messaging queue architecture. The design is essentially based on distrib-
uted transaction logs. Transaction logs are used in database systems to recover from the failure of
the transactions. More details on Kafka can be found in [KAFK].
Big Data Management and Analytics Technologies 81
7.3.2 Google BigQuery
BigQuery is essentially a data warehouse that manages petabyte scale data. It runs on Google’s
infrastructure and can process SQL queries or carry out analytics extremely fast. For example,
terabyte data can be accessed in seconds, while petabyte data can be accessed in minutes. The
BigQuery data is stored in different types of tables: native tables store the BigQuery data, Views
store the virtual tables, and External tables store the external data. BigQuery can be accessed in
many ways such as command line tools, RESTful interface or a web user interface, and client librar-
ies (e.g., Java,.NET, and Python). More details on BigQuery can be found in [BIGQ].
7.3.3 NoSQL Database
NoSQL database is a generic term for essentially a nonrelational database design or scalability for
the web. It is known as a nonrelational high performance database. The data models for NoSQL
databases may include graphs, document structures, and key–value pairs. It can be argued that the
databases that were developed in the 1960s such as IBM’s information management system (IMS)
and those based on the network data model are NoSQL databases. However, other object-oriented
data models that were developed in the 1990s led the way to develop NoSQL databases in the 2000s.
What is different from the NoSQL databases and the older hierarchical, network, and object data-
bases is that the NoSQL databases have been designed with the web in mind. That is, the goal is to
access massive amounts of data on the web rapidly.
The most popular NoSQL database model is the key–value pair. While relational databases con-
sist of a collection of relations where each relation has a collection or attributes, these attributes are
labeled and included in the schema. NoSQL databases have tables that have two columns: key and
value. Key could be anything such as a person’s name or the index of a stock. However, the value
could be a collection of attributes such as the name of the stock, the value of the stock and other
information such as whether to buy the stock and if so the quantity recommended. Therefore, all the
information pertaining to a stock can be retrieved without having to perform many joins. Some of
the popular NoSQL databases will be discussed in this section (e.g., MongoDB and HBase). For a
detailed discussion of NoSQL databases, we refer the reader to [NOSQ].
82 Big Data Analytics with Applications in Insider Threat Detection
7.3.4 Google BigTable
BigTable is one of the early NoSQL databases running on top of the Google file system (GFS). It is
now provided as a service in the cloud. BigTable maps the row key and the column together with a
time stamp into a byte array. That is, it is essentially a NoSQL database that is based on the key value
pair model. It was designed to handle petabyte-sized data. It uses compression algorithms when the
data gets too large. Each table in BigTable has many dimensions and may be divided into what is
called tablets to work with GFS. BigTable is used by many applications including Google’s YouTube,
Google Maps, Google Earth, and Gmail. More details on BigTable can be found in [BIGT].
7.3.6 MongoDB
MongoDB is a NoSQL database. It is a cross-platform open-source distributed database. It has
been used to store and manage documents. That is, it is mainly a document-oriented database. The
documents are stored in a JSON-like format. It supports both field and range queries and regular
expression-based searches. It supports data replication and load balancing which occurs through
horizontal scaling. The batch processing of data as well as aggregation operations can be carried
out through MapReduce. More details of MongoDB can be found at [MONG].
recently the NoSQL database. The NoSQL database is based on the key value paid model. Each row
has a unique key and has a value that is of arbitrary length and interpreted by the applicant. Oracle
NoSQL database is a shared nothing system and is distributed across what are called multiple
shards in a cluster. The data is replicated in the storage nodes within a shard for availability. The
data can be accessed via programs such as those written in Java, C, Python as well as RESTful web
services. More details on the Oracle NoSQL database can be found at [ORAC].
7.3.10 Weka
Weka is an open-source software product that implements a collection of data mining techniques
from association rule mining to classification to clustering. It has been designed, developed, and
maintained by Mankato University in New Zealand. Weka 3, a version of Weka, operated on big
datasets. While earlier versions of Weka required the entire datasets to be loaded into memory to
carry out say classification, the big data version carried out incremental loading and classification.
Weka 3 also supports distributed data mining with map and reduce tasks. It also provides wrappers
for Hadoop and Spark. More details on Weka can be found in [WEKA].
As stated in [COSM], “the core type system of Azure Cosmos DB’s database engine is atom-record-
sequence based. Atoms consist of a small set of primitive types, for example, string, Boolean,
number, and so on. Records are structs and sequences are arrays consisting of atoms, records, or
sequences.” Developers use the Cosmos DB by provisioning a database account. The notion of a
container is used to store the stored procedures, triggers, and user-defined functions. The entities
under that database account include the containers as well as the databases and permissions. These
entities are called resources. Data in containers is horizontally partitioned. More details of the
Cosmos DB can be found in [COSM].
REFERENCES
[HADO]. http://hadoop.apache.org/
[SPAR]. http://spark.apache.org/
[PIG]. https://pig.apache.org/
Big Data Management and Analytics Technologies 85
[HIVE]. https://hive.apache.org/
[STOR]. http://storm.apache.org/
[MONG]. https://www.mongodb.com/
[CASS]. http://cassandra.apache.org/
[BIGT]. https://cloud.google.com/bigtable/
[BIGQ]. https://cloud.google.com/bigquery/
[WEKA]. http://www.cs.waikato.ac.nz/ml/weka/bigdata.html
[ORAC]. https://www.oracle.com/big-data/index.html
[IBM]. https://www-01.ibm.com/software/data/bigdata/
[GOOG]. https://cloud.google.com/solutions/big-data/
[COSM]. https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/
[DYNA]. https://aws.amazon.com/dynamodb/
[KAFK]. http://kafka.apache.org/
[FLIN]. https://flink.apache.org/
[MAHO]. http://mahout.apache.org/
[NOSQ]. http://nosql-database.org/
[COUC]. http://couchdb.apache.org/
[MAPR]. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[HBAS]. https://hbase.apache.org/
Conclusion to Part I
Part I, consisting of six chapters, described supporting technologies for BDMA and BDSP. In Chapter
2, we provided an overview of discretionary security policies in database systems. We started with
a discussion of access control policies including authorization policies and role-based access con-
trol policies. Then we discussed administration policies. We briefly discussed identification and
authentication. We also discussed auditing issues as well as views for security. Next, we discussed
policy enforcement as well as Structured Query Language (SQL) extensions for specifying poli-
cies as well as provided an overview of query modification. Finally, we provided a brief overview
of data privacy aspects. In Chapter 4, we provided an overview of data mining for cyber security
applications. In particular, we discussed the threats to computers and networks and described the
applications of data mining to detect such threats and attacks. Some of the data mining tools for
security applications developed at The University of Texas at Dallas were also discussed. Chapter
5 introduced the notions of the cloud and semantic web technologies. This is because some of the
experimental systems discussed in Part IV utilize these technologies. We first discussed concepts in
cloud computing including aspects of virtualization, deployment models, and cloud functions. We
also discussed technologies for the semantic web including eXtensible Markup Language (XML),
resource description framework (RDF), Ontologies, and OWL. In Chapter 6, we discussed the prob-
lem of insider threat and our approach to insider threat detection. We represented the insiders and
their communication as RDF graphs and then queried and mined the graphs to extract the nug-
gets. We also provided a comprehensive framework for insider threat detection. In Chapter 7, we
discussed three types of big data systems. First, we discussed what we call infrastructures (which
we also call frameworks). These are essentially massive data processing platforms such as Apache
Hadoop, Spark, Storm, and Flink. Then we discussed various big data management systems. These
included SQL-based systems and NoSQL-based systems. This was followed by a discussion of big
data analytics systems. Finally, we discussed cloud platforms that provide the capability for man-
agement of massive amounts of data.
The chapters in Part I lay the foundations of the discussions in Parts II through V. Stream data
mining, which is essentially data mining for streaming data, will be discussed in Part II. Applying
stream data mining for insider threat detection will be discussed in Part III. Some of the experi-
mental big data systems we have developed will be discussed in Part IV. Finally, the next steps in
big data management and analytics and big data security and privacy will be discussed in Part V.
Part II
Stream Data Analytics
Introduction to Part II
Now that we have provided an overview of the various supporting technologies including data min-
ing and cloud computing in Part I, the chapters in Part II will describe various stream data mining
techniques that we have designed and developed. Note that we use the term data mining and data
analytics interchangeably.
Part II, consisting of six chapters, provides a detailed overview of the novel class detection tech-
niques for data streams. These techniques are part of stream data mining.
Chapter 8 focuses on the various challenges associated with data stream classification and
describes our approach to meet those challenges. Data stream classification mainly consists of two
steps. Building (or learning) a classification model using historical labeled data and classifying (or
predicting the class of) future instances using the model. The focus in Chapter 8 will mainly be
on the challenges involved in data stream classification. Chapter 9 discusses related work in data
stream classification, semisupervised clustering, and novelty detection. First, we discuss various
data stream classification techniques that solve the infinite length and concept-drift problems. Also,
we describe how our proposed multiple partition and multiple chunk (MPC) ensemble technique is
different from the existing techniques. Second, we discuss various novelty/anomaly detection tech-
niques and their differences from our ECSMiner approach. Finally, we describe different semisu-
pervised clustering techniques and the advantages of our ReaSC approach over them. Chapter 10
describes the MPC ensemble classification technique. First, we present an overview of the approach.
Then we establish theoretical justification for using this approach over other approaches. Finally, we
show the experimental results on real and synthetic data. Chapter 11 explains ECSMiner, our novel
class detection technique, in detail. First, we provide a basic idea about the concept-evolution prob-
lem and give an outline of our solution. Then, we discuss the algorithm in detail and show how to
efficiently detect a novel class within given time constraints and limited memory. Next, we analyze
the algorithm’s efficiency in correctly detecting the novel classes. Finally, we present experimental
results on different benchmark datasets. Chapter 12 describes the limited labeled data problem, and
our solution, ReaSC. First, we give an overview of the data stream classification problem, and a top
level description of ReaSC. Then, we describe the semisupervised clustering technique to efficiently
learn a classification model from scarcely labeled training data. Next, we discuss various issues
related to stream evolution. Last, we provide experimental results on a number of datasets. Finally,
Chapter 13 discusses our findings and provides directions for further work in stream data analyt-
ics, in general, and stream data classification, in particular. In addition, we will discuss stream data
analytics for handling massive amounts of data.
8 Challenges for Stream
Data Classification
8.1 INTRODUCTION
Data streams are continuous flows of data being generated from various computing machines such
as clients and servers in networks, sensors, call centers, and so on. Analyzing these data streams has
become critical for many applications including for network data, financial data, and sensor data.
However, mining these ever growing data is a big challenge to the data mining community ([CAI04],
[CHEN02], [FAN04a], [GABE05], [GANT01]). Data stream classification ([CHI05], [DING02],
[DOMI00], [GANT02], [GEHR99], [HULT01], [JIN03], [KUNC08], [LAST02], [MASU09a],
[SCHO05], [WANG03], [WANG06], [WANG07] is one major aspect of data stream mining. Data
stream classification mainly consists of two steps. Building (or learning) a classification model using
historical labeled data and classifying (or predicting the class of) future instances using the model.
Building a classification model from a data stream is more challenging than building a model from
static data because of several unique properties of data streams. In this chapter, we will discuss the
challenges involved in analyzing such streams.
The organization of this chapter is as follows. Section 8.2 provides an overview of the challenges.
The notions of infinite length and concept drift in streaming data are discussed in Section 8.3. The
notion of concept evolution is discussed in Section 8.4. Aspects of limited labeled data are discussed
in Section 8.5. The experiments we have carried out are discussed in Section 8.6. Our contributions
to the field are discussed in Section 8.7. This chapter is summarized in Section 8.8.
8.2 CHALLENGES
First, data streams are assumed to have infinite length. It is impractical to store and use all the
historical data for learning, since it would require an infinite amount of storage and learning time.
Therefore, traditional classification algorithms that require several passes over the training data are
not directly applicable to data streams.
Second, data streams observe concept drift, which occurs when the underlying concept of the
data changes over time. For example, consider the problem of credit card fraud detection. Here
our goal is to detect whether a particular transaction is authentic or fraud. Since the behavior of
authentic users as well as the techniques of forgery change over time, what is considered authentic
now may appear to be fraud in the next year, and vice versa. In other words, the characteristics/pat-
terns of these two classes of data (i.e., fraud/authentic) change over time. Therefore, the underlying
concept (i.e., class characteristics) of the data is dynamic. Traditional classification techniques that
assume static concept of the data are not applicable to data streams. In order to address concept
drift, a classification model must continuously adapt itself to the most recent concept.
Third, data streams also observe concept evolution which occurs when a novel class appears
in the stream. For example, consider an intrusion detection problem, where network traffic is ana-
lyzed to determine whether it is benign, or it contains some kind of intrusion. It is possible that
a completely new kind of intrusion occurs in the network. In that case, traditional classification
techniques, which assume that the total number of classes in the data is fixed, would misclassify
the new intrusion either as benign, or as a known intrusion. In order to cope with concept evolution,
a classification model must be able to automatically detect novel classes when they appear, before
being trained with the labeled instances of the novel class.
93
94 Big Data Analytics with Applications in Insider Threat Detection
TABLE 8.1
Data Stream Classification Problems and Proposed Solutions
Proposed Technique Infinite Concept Concept Limited
Length Drift Evolution Labeled Data
MPC ensemble (Chapter 10) √ √ √ √
ECSMiner (Chapter 11) √ √
ReaSC (Chapter 12) √ √
Finally, high-speed data streams suffer from insufficient labeled data. This is because manual
labeling is both costly and time-consuming. Therefore, the speed at which data points are labeled
lags far behind the speed at which data points arrive in the stream. As a result, most of the data
points in the stream would be left as unlabeled. Most classification algorithms apply a supervised
learning technique, which require completely labeled training data. By completely labeled we
mean that all instances in the training data are labeled. Therefore, supervised classification tech-
niques suffer from the scarcity of labeled data for learning, resulting in a poorly built classifier.
In order to deal with this scarcity of labeled data, we need a learning algorithm that is capable
of producing a good classification model even if it is supplied with partially labeled data for
training. By partially labeled, we mean that only P% (P < 100) instances in the training data are
labeled.
Most data stream classification techniques concentrate only on the first two issues, namely, infi-
nite length, and concept drift. Our goal is to address all the four issues, providing a more realistic
solution than the state of the art.
We propose three different techniques that address two or more of these problems. This is shown
in a tabular form in Table 8.1. For example, ECSMiner, the novel class detection technique proposed
in Chapter 11, solves the infinite length, concept-drift and concept-evolution problems.
As discussed earlier, we assume that data stream is divided into equally sized chunks. The chunk
size is chosen so that all the data in a chunk may fit into the main memory. Each chunk, when
labeled, is used to train classifiers. In our approach, there are three parameters that control the MPC
ensemble: v, r, and L. The parameter v determines the number of partitions (v = 1 means single-
partition ensemble), the parameter r determines the number of chunks (r = 1 means single-chunk
ensemble), and the parameter L controls the ensemble size. The MPC ensemble consists of Lv clas-
sifiers. This ensemble is updated whenever a new data chunk is labeled. We take the most recent
labeled r consecutive data chunks and train v classifiers using v-fold partitioning of these chunks.
We then update the ensemble by choosing the best (based on accuracy) Lv classifiers among the
newly trained v classifiers and the existing Lv classifiers. Thus, the total number of classifiers in the
ensemble is always kept constant.
Figure 8.1 illustrates concept drift in a two-dimensional data stream. Each box represents a
chunk in the data stream. The white circles represent the negative class, and the black circles repre-
sent the positive class. The dark straight line (analogous to a hyperplane) inside each box separates
the data classes, and defines the current concept. The dotted straight line represents the concept
corresponding to the previous chunk. As a result of concept drift, some data points have different
class labels in the current concept than in the previous concept. Blue circles represent the data points
that are negative (positive) according to the current concept but positive (negative) according to the
previous concept.
It should be noted that when a new data point appears in the stream, it may not be labeled imme-
diately. We defer the ensemble updating process until the data points in the latest data chunk have
been labeled, but we keep classifying new unlabeled data using the current ensemble. For example,
consider the online credit card fraud detection problem. When a new credit card transaction takes
place, its class ({fraud,authentic}) is predicted using the current ensemble. Suppose a fraudulent
transaction has been misclassified as authentic. When the customer receives the bank statement, he
will identify this error and report to the authority. In this way, the actual labels of the data points
will be obtained, and the ensemble will be updated accordingly.
detection ([WANG03]). For example, in case of intrusion detection, a novel kind of intrusion might
go undetected by traditional classifiers, but our approach should not only be able to detect the intru-
sion, but also deduce that it is a novel kind of intrusion. This discovery would lead to an intense
analysis of the intrusion by human experts in order to understand its cause, find a remedy, and make
the system more secure.
Figure 8.2 shows an example of concept evolution in data streams in a two-dimensional fea-
ture space. The left graph shows the data distribution of two different classes (+ and −) in a data
chunk. A rule-based learner learns two rules (shown in the figure) from this data chunk. The right
graph shows the data distribution in the next chunk, where a novel class (denoted by x) has evolved.
Instances of this class would be misclassified by the rules learned in the previous chunk since class
x was not present when the rules were learned. In fact, no traditional classification model can detect
the novel class. ECSMiner provides a solution to the concept-evolution problem by enriching each
classifier in the ensemble with a novel class detector. If the arrival of a novel class is discovered,
potential novel class instances are separated and classified as novel class. Thus, a novel class can be
automatically identified without manual intervention.
The novel class detection technique proposed in ECSMiner is different from traditional one-
class novelty detection techniques ([MARK03], [ROBE00], [YAMA01]) that can only distinguish
between the normal and anomalous data. That is, the traditional novelty detection techniques
assume that there is only one normal class and any instance that does not belong to the normal class
is an anomaly/novel class instance. Therefore, they are unable to distinguish among different types
of anomaly. But ECSMiner offers a multiclass framework for the novelty detection problem that
can distinguish between different classes of data and discover the emergence of a completely novel
class. Furthermore, traditional novelty detection techniques simply identify data points as outli-
ers/anomalies that deviate from the normal class. On the other hand, ECSMiner not only detects
whether a single data point deviates from the existing classes, but also discovers whether a group
of such outliers possesses the potential of forming a novel class by showing strong cohesion among
themselves. Therefore, ECSMiner is a synergy of a multiclass classification model and a novel class
detection model.
Traditional data stream classification techniques also make impractical assumptions about the
availability of labeled data. Most techniques ([CHEN08], [HULT01], [YANG05]) assume that
the true label of a data point can be accessed as soon as it has been classified by the classification
model. Thus, according to their assumption, the existing model can be updated immediately using
Novel class
y y
y1 y1
y2 y2
x1 x x1 x
Classification rules
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +)
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = –)
the labeled instance. In reality, we would not be so lucky in obtaining the label of a data instance
immediately, since manual labeling of data is time consuming and costly. For example, in a credit
card fraud detection problem, the actual labels (i.e., authentic/fraud) of the credit card transactions
by a customer usually become available in the next billing cycle after the customer reviews all his
transactions in the last statement and reports fraud transactions to the credit card company. Thus,
a more realistic assumption would be to have a data point labeled after Tl time units of its arrival.
For simplicity, we assume that the ith instance in the stream arrives at ith time unit. Thus, Tl can
be considered as a time constraint imposed on the data labeling process. Note that traditional
stream classification techniques assume Tl = 0. Finally, we impose another time constraint, Tc,
on classification decision. That is, an instance must be classified by the classification model within
Tc time units of its arrival. If it assumed that there is no concept-evolution, it is customary to have
Tc = 0, that is, an instance should be classified as soon as it arrives. However, when novel concepts
evolve, classification decisions may have to be postponed until enough instances are seen by the
model to gain confidence in deciding whether a novel class has emerged or not. Tc is the maximum
allowable time up to which the classification decision can be postponed for any instance. Note that
Tc < Tl must be maintained in any practical classification model. Otherwise, we would not need
the classifier at all, we could just wait for the labels to arrive. We will discuss this issue in detail
in Chapter 10.
Figure 8.3 illustrates the significance of Tl and Tc with an example. Here xk is the last instance
that has arrived in the stream. Let xj be the instance that arrived Tc time units earlier, and xi be the
instance that arrived Tl time units earlier. Then xi and all instances that arrived before xi (shown
with dark-shaded area) are labeled, since all of them are at least Tl time units old. Similarly, xj and
all instances that arrived before xj (both the light-shaded and dark-shaded areas) are classified by
the classifier since they are at least Tc time units old. However, the instances inside the light-shaded
area are unlabeled. Instances that arrived after xj (age less than Tc) are unlabeled, and may or may
not be classified (shown with the unshaded area).
Integrating classification with novel class detection is a nontrivial task, especially in the pres-
ence of concept drift, and under time constraints. We assume an important property of each class:
the data points belonging to the same class should be closer to each other (cohesion) and should
be far apart from the data points belonging to other classes (separation). If a test instance is well
separated from the training data, it is identified as a Raw outlier. Raw outliers that possibly appear
as a result of concept drift or noise are filtered out. An outlier that passes the filter (called F outlier)
has potential to be a novel class instance. However, we must wait to see whether more such F outli-
ers appear in the stream that observes strong cohesion among themselves. If a sufficient number of
such strongly cohesive F outliers are observed, a novel class is assumed to have appeared, and the
F outliers are classified as novel class instances. However, we can defer the classification decision
of a test instance at most Tc time units after its arrival, which makes the problem more challenging.
Furthermore, we must keep detecting novel class instances in this ‘un-supervised’ fashion for at
least Tl time units from the arrival of the first novel class instance, since labeled training data of the
novel class(es) would not be available before that.
xi xj xk
Stream
Tc
Tl
Update
Current Classify Predicted class
model
Semi-supervised
clustering
Current Classify Predicted class
model
Update
FIGURE 8.4 Illustrating the limited labeled data problem in data streams.
training data ([MASU08b], [WOOL09]). A summary of the statistics of the instances belonging to
each cluster is saved as a microcluster. The microclusters created from each chunk serve as a clas-
sification model for the nearest neighbor algorithm. In order to cope with concept drift, we keep an
ensemble of L models. Whenever a new model is built from a new data chunk, we update the ensem-
ble by choosing the best L models from the L + 1 models (previous L models and the new model),
based on their individual accuracies on the labeled training data of the new data chunk. Besides,
we refine the existing models in the ensemble whenever a novel class of data evolves in the stream.
8.6 EXPERIMENTS
We evaluate our approaches on several synthetic and real datasets. We use two kinds of synthetic
datasets:
• Synthetic data with only concept drift: This synthetic data was generated with a moving
hyperplane. It contains 10 real-valued attributes and two classes. The hyperplane is moved
gradually to simulate concept drift in the data. Details of this dataset are discussed in the
experiment sections in Chapters 10 through 12.
• Synthetic data with concept drift and concept evolution: This synthetic data was generated
with Gaussian distribution. It contains 20 real-valued attributes and 10 classes. Different
100 Big Data Analytics with Applications in Insider Threat Detection
classes of the data were generated using different parameters of the (Gaussian) distribution.
Here concept drift is simulated by gradually changing the parameter values for each class.
Concept evolution is simulated by changing class distributions in such a way that novel
classes appear and old classes disappear at different times in the stream. Details of this
dataset are discussed in the experiment sections in Chapter 10.
• We also use four different real datasets:
• Botnet dataset: We generate real peer-to-peer (P2P) botnet traffic in a controlled envi-
ronment, where we run a P2P bot named Nugache [LEMO06]. Here the goal is to
classify network traffic as either benign or botnet.
• KDD cup 1999 intrusion detection dataset: This dataset contains TCP connection
records extracted from LAN network traffic at MIT Lincoln Labs over a period of
2 weeks. We have used the 10% version of the dataset, which is more concentrated, and
challenging than the full version. Each instance in the dataset refers to either to a normal
connection or an attack. There are 22 types of attacks, such as buffer overflow, ports-
weep, guess-passwd, neptune, rootkit, smurf, spy, etc. So, there are 23 different classes
of data, including the normal class. Here the goal is to classify an instance into one of
the classes. This dataset is discussed in the experiment sections of Chapters 11 and 12.
• NASA Aviation Safety Reporting Systems (ASRS) dataset: This dataset contains around
150,000 text documents. Each document is actually a report corresponding to a flight
anomaly. There are a total of 55 anomalies. The goal is to classify a document into one
of these anomalies. Details of this dataset are discussed in Chapter 11.
• Forest cover dataset: This dataset contains geospatial descriptions of different types
of forests. It contains seven classes, and the goal is to classify an instance into one of
the forest classes. Details of this dataset are discussed in Chapter 10.
We evaluate our approaches with the state-of-the-art data stream classification techniques on
these datasets. In each dataset, our approaches show significantly better performance both in run-
ning times and classification accuracies. Besides, we also analyze the sensitivity of different param-
eters used in our techniques on classification accuracies and running times.
• We propose a generalized MPC ensemble technique that significantly reduces the expected
classification error over the existing single-partition, single-chunk ensemble methods. The
MPC ensemble technique addresses both infinite length and concept drift.
• We have theoretically justified the effectiveness of the MPC ensemble approach.
• We apply the MPC ensemble on synthetically generated data as well as on real botnet traf-
fic, and achieve better detection accuracies than other stream data classification techniques.
• To the best of our knowledge, no other data stream classification technique addresses the
concept-evolution problem. This is a major problem with data streams that must be dealt
with. In this light, ECSMiner offers a more realistic solution to data stream classification.
ECSMiner also addresses infinite length and concept-drift problems.
• ECSMiner offers a more practical framework for stream classification by introducing time
constraints for delayed data labeling and making classification decision.
Challenges for Stream Data Classification 101
We believe that the proposed methods provide promising, powerful, and practical techniques to
the stream classification problem in general.
REFERENCES
[AGGA06]. C.C. Aggarwal, J. Han, J. Wang, P.S. Yu. “A Framework for On-Demand Classification of Evolving
Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18 (5), 577–589, 2006.
[CAI04]. Y.D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, L. Auvil, “Maids: Mining Alarming Incidents from
Data Streams,” In 23rd ACM SIGMOD International Conference on Management of Data, Paris, France,
June 13–18, ACM, 2004.
[CHEN08]. S. Chen, H. Wang, S. Zhou, P.S. Yu, “Stop Chasing Trends: Discovering High Order Models in
Evolving Data,” In ICDE ’08: Proceedings of the 24th International Conference on Data Engineering,
Cancun, Mexico, April 7–12, pp. 923–932, IEEE Computer Society, 2008.
[CHEN02]. Y. Chen, G. Dong, J. Han, B.W. Wah, J. Wang, “Multi-Dimensional Regression Analysis of Time-
Series Data Streams,” In VLDB ’02: Proceedings of the 28th International Conference on Very Large
Data Bases, Hong Kong, China, August 20–23, pp. 323–334, VLDB Endowment, 2002.
[CHI05]. Y. Chi, P.S. Yu, H. Wang, R.R. Muntz, “Loadstar: A Load Shedding Scheme for Classifying Data
Streams,” In SDM ’05: Proceedings of the 2005 SIAM International Conference on Data Mining,
Newport Beach, CA, USA, April 21–23, p. 3, SIAM, 2005.
[CRUP04]. V. Crupi, E. Guglielmino, G. Milazzo, “Neural-Network-Based System for Novel Fault Detection
in Rotating Machinery,” Journal of Vibration and Control, 10(8):1137–1150, 2004.
[DING02]. Q. Ding, Q. Ding, W. Perrizo, “Decision Tree Classification of Spatial Data Streams Using Peano
Count Trees,” In SAC ’02: Proceedings of the 2002 ACM symposium on Applied Computing, Madrid,
Spain, March 10–14, pp. 413–417, ACM, 2002.
102 Big Data Analytics with Applications in Insider Threat Detection
[DOMI00]. P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” In KDD ’00: Proceedings of
the 2000 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August
20–23, Boston, MA, USA, pp. 71–80, ACM, 2000.
[FAN04a]. W. Fan, Y. an Huang, H. Wang, P. S. Yu, “Active Mining of Data Streams” In SDM ’04: Proceedings
of the 2004 SIAM International Conference on Data Mining, April 22–24, Lake Buena Vista, Florida,
USA, pp. 457–461, SIAM, 2004.
[FAN04b]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In KDD ’04:
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, August 22–25, Seattle, WA, USA, pp. 128–137, ACM, 2004.
[GABE05]. M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD
Record, 34 (2), 18–26, June 2005.
[GANT01]. V. Ganti, J. Gehrke, R. Ramakrishnan, “Demon: Mining and Monitoring Evolving Data,” IEEE
Transactions on Knowledge and Data Engineering, 13 (1), 50–63, 2001.
[GANT02]. V. Ganti, J. Gehrke, R. Ramakrishnan, “Mining Data Streams Under Block Evolution,” ACM
SIGKDD Explorations Newsletter, 3 (2), 1–10, 2002.
[GAO07]. J. Gao, W. Fan, J. Han. “On Appropriate Assumptions to Mine Data Streams,” In ICDM ’07:
Proceedings of the 2007 International Conference on Data Mining, October 28–31, Omaha, NE, USA,
pp. 143–152, IEEE Computer Society, 2007.
[GEHR99]. J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh, “Boat-Optimistic Decision Tree Construction,”
In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of
Data, Jun. 1–3, Philadelphia, PA, USA, pp. 169–180, ACM, 1999.
[HULT01]. G. Hulten, L. Spencer, P. Domingos, “Mining Time-Changing Data Streams,” In KDD ’01:
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, August 26–29, San Francisco, CA, USA, pp. 97–106, ACM, 2001.
[JIN03]. R. Jin and G. Agrawal, “Efficient Decision Tree Construction on Streaming Data,” In KDD ’03:
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, August 24–27, Washington, DC, pp. 571–576, ACM, 2003.
[KHAN07]. L. Khan, M. Awad, B.M. Thuraisingham, “A New Intrusion Detection System Using Support
Vector Machines and Hierarchical Clustering,” VLDB Journal, 16 (4), 507–521, 2007.
[KOLT03]. J.Z. Kolter and M.A. Maloof, “Dynamic Weighted Majority: A New Ensemble Method for Tracking
Concept Drift,” In ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining,
Nov. 19–22, Melbourne, FL, USA, IEEE Computer Society, pp. 123–130, 2003.
[KOLT05]. J.Z. Kolter and M.A. Maloof, “Using Additive Expert Ensembles to Cope with Concept Drift,” In
ICML ’02: Proceedings of the Twenty-Second International Conference on Machine Learning, August
7–11, Morgan Kaufmann, Bonn, Germany, pp. 449–456, 2005.
[KUNC08]. L.I. Kuncheva and J. Salvador Śanchez, “Nearest Neighbour Classifiers for Streaming Data with
Delayed Labelling,” In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on
Data Mining, December 15–19, Pisa, Italy, pp. 869–874, IEEE Computer Society, 2008.
[LAST02]. M. Last, “Online Classification of Nonstationary Data Streams,” Intelligent Data Analysis, 6(2),
129–147, 2002.
[LEMO06]. R. Lemos, Bot software looks to improve peerage, http://www.securityfocus.com/news/11390, 2006.
[MARK03a]. M. Markou and S. Singh, “Novelty Detection: A Review—Part 2: Neural Network Based
Approaches,” Signal Processing, 83(12), 2499–2521, 2003.
[MASU07a]. M.M. Masud, L. Khan, B.M. Thuraisingham, “E-mail Worm Detection Using Data Mining,”
International Journal of Information Security and Privacy, 1 (4), 47–61, 2007.
[MASU07b]. M.M. Masud, L. Khan, B.M. Thuraisingham, “Feature Based Techniques for Auto-Detection of
Novel Email Worms,” In PAKDD ’07: Proceedings of the 11th Pacific-Asia Conference on Knowledge
Discovery and Data Mining, May 22–25, Springer-Verlag, Nanjing, China, pp. 205–216, 2007.
[MASU06]. M.M. Masud, L. Khan, E. Al-Shaer, “Email Worm Detection Using Näıve Bayes and Support
Vector Machine,” In ISI ’06: Proceedings of the 2006 IEEE Intelligence and Security Informatics
Conference, May 23–24, San Diego, CA, USA, pp. 733–734, IEEE Computer Society, 2006.
[MASU07c]. M.M. Masud, L. Khan, B.M. Thuraisingham, “A Hybrid Model to Detect Malicious Executables,”
In ICC ’07: Proceedings of the 2007 IEEE International Conference on Communications, June 24–28,
Glasgow, Scotland, pp. 1443–1448, IEEE Computer Society, 2007.
[MASU08a]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM ’08: Proceedings
of the 2008International Conference on Data Mining, December 15–19, Pisa, Italy, pp. 929–934, IEEE
Computer Society, 2008.
Challenges for Stream Data Classification 103
[MASU08b]. M.M. Masud, L. Khan, B.M. Thuraisingham, “A Scal- Able Multi-Level Feature Extraction
Technique to Detect Malicious Executables,” Information Systems Frontiers, 10 (1), 33–45, 2008.
[MASU09a]. M.M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “A Multi-Partition Multi-Chunk
Ensemble Technique to Classify Concept-Drifting Data Streams,” In PAKDD09: Proceedings of The
13th Pacific- Asia Conference on Knowledge Discovery and Data Mining, April 27–30, Springer-Verlag,
Bangkok, Thailand, pp. 363–375, 2009. (also Advances in Knowledge Discovery and Data Mining).
[MASU09b]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection
with Classification for Concept- Drifting Data Streams,” In ECML PKDD ’09: Proceedings of the 2009
European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in
Databases, volume II, Springer-Verlag, Bled, Slovenia, September 7–11, pp. 79–94, 2009.
[ROBE00]. S.J. Roberts, “Extreme Value Statistics for Novelty Detection in Biomedical Signal Processing,”
In Proceedings of the First International Conference on Advances in Medical Signal and Information
Processing, Bristol, UK, pp. 166–172, 2000.
[SCHO05]. M. Scholz and R. Klinkenberg, “An Ensemble Classifier for Drifting Concepts,” In IWKDDS ’05:
Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams, Porto,
Portugal, Oct. 3–7, pp. 53–64, 2005.
[WANG03]. H. Wang, W. Fan, P. S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Washington, DC, USA, August 24–27 pp. 226–235, ACM,
2003.
[WANG06]. H. Wang, J. Yin, J. Pei, P.S. Yu, J.X. Yu, “Suppressing Model Overfitting in Mining Concept-
Drifting Data Streams,” In KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, New York, NY, USA, August 20–23, pp. 736–741, ACM,
2006.
[WANG07]. P. Wang, H. Wang, X. Wu, W. Wang, B. Shi, “A Low- Granularity Classifier for Data Streams with
Concept Drifts and Biased Class Distri bution,” IEEE Transactions on Knowledge and Data Engineering,
19 (9), 1202–1213, 2007.
[WOOL09]. C. Woolam, M.M. Masud, L. Khan, “Lacking Labels in the Stream: Classifying Evolving Stream
Data with Few Labels,” In ISMIS ’09: Proceedings of the 18th International Symposium on Methodologies
for Intelligent Systems, Springer, Prague, Czech Republic, September 14–17, pp. 552–562, 2009.
[YAMA01]. K. Yamanishi and J. ichi Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data:
Combining a Supervised Learner with An Uunsupervised Learner,” In KDD ’01: Proceedings of the
Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San
Francisco, CA, USA, 26–29 August pp. 389–394, ACM, 2001.
[YANG05]. Y. Yang, X. Wu, X. Zhu, “Combining Proactive and Reactive Predictions for Data Streams,” In KDD
’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining, Chicago, IL, USA, August 21–24, pp. 710–715, ACM, 2005.
9 Survey of Stream Data
Classification
9.1 INTRODUCTION
Data streams are continuously arriving data in applications such as finance, networks, and sensors.
These data streams have to be analyzed so that nuggets can be extracted to determine network intru-
sions, financial stock market price, and suspicious event detection. As discussed in Chapter 8, there
are many challenges that need to be addressed for analyzing data streams. These include infinite
length, concept drift, concept evolution, and limited labeled data. In Chapters 10 through 12, we
will discuss our approach to data stream analytics. Our approach has been built upon several of the
previous works. Therefore, in this chapter we review the previous works in data stream classifica-
tion and novelty detection. Also, we discuss related works in semisupervised clustering which is an
important component of our data stream classification technique with limited labeled data.
The organization of this chapter is as follows. General approach to data stream classification
is discussed in Section 9.2. Single model classification is discussed in Section 9.3. Ensemble clas-
sification is discussed in Section 9.4. Novel class detection is discussed in Section 9.5. Data stream
classification with limited labeled data is discussed in Section 9.6. Summary and directions are
provided in Section 9.7.
105
106 Big Data Analytics with Applications in Insider Threat Detection
Expert
analysis
and
Block and labeling
quarantine
ate
upd
Network el
traffic od
M
Attack
traffic
Firewall Benign
Cassification
traffic
model
Server
Data stream
Ensemble Predicted
Update M = {M1, M2, ..., M1} Classify class
FIGURE 9.2 Illustrating the basic principle of data stream classification using the ensemble approach.
the ensemble approaches. Here the last labeled chunk is used to update the existing ensemble. Then
the ensemble is used to classify the last data chunk, which is unlabeled.
There have been many works in stream data classification. Their main difference lies in the way
the existing classification model is updated. There are two main approaches—single-model clas-
sification and ensemble classification.
[KOLT05], [SCHO05], [WANG03]). These ensemble approaches have the advantage that they can
be more efficiently built than updating a single model and they observe higher accuracy than their
single model counterparts [TUME96].
Among these approaches, our MPC ensemble approach (Chapter 10) is related to that of Wang
et al. [WANG03]. Wang et al. [WANG03] keep an ensemble of the L best classifiers. Each time a
new data chunk appears, a classifier is trained from that chunk. If this classifier shows better accu-
racy than any of the L classifiers in the ensemble, then the new classifier replaces the old one. When
classifying an instance, weighted voting among the classifiers in the ensemble is taken, where the
weight of a classifier is inversely proportional to its error. Figure 9.2 illustrates the basic principle of
the ensemble approaches. Here the last labeled chunk is used to update the existing ensemble. Then
the ensemble is used to classify the last data chunk, which is unlabeled.
(a)
Labeled chunk Unlabeled chunk
Data D1 D2 D3 D4
chunks
Clasifiers M1 M2 M3
Ensemble M1 M2 M3 Prediction
(b)
Data D2 D3 D4 D5
chunks
Clasifiers M2 M3 M4
Ensemble M1 M4 M3 Prediction
FIGURE 9.3 Illustrating the ensemble updating process of Wang et al. (Adapted from Wang et al. Mining
Concept-Drifting Data Streams Using Ensemble Classifiers. In KDD ‘03: Proceedings of the 9th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235, Washington,
DC, August 24–27, ACM, 2003.)
108 Big Data Analytics with Applications in Insider Threat Detection
Therefore, it is replaced by the new model M4, and the ensemble is updated. This new ensemble is
used to classify the instances in the latest data chunk D5. This process continues indefinitely.
There are several differences between the MPC ensemble approach and the approach of Wang
et al. First, we apply multipartitioning of the training data to build multiple (i.e., v) classifiers from
that training data. Second, we train each classifier from r consecutive data chunks, rather than from
a single chunk. Third, when we update the ensemble, the v classifiers that are removed may come
from different chunks; thus, although some classifiers from a chunk may have been removed, other
classifiers from that chunk may still remain in the ensemble. Whereas, in the approach of Wang
et al. removal of a classifier means total removal of the knowledge obtained from one whole chunk.
Finally, we use simple voting, rather than weighted voting. Thus, our MPC ensemble approach is a
generalized form of the approach of Wang et al. and users have more freedom to optimize perfor-
mance by choosing the appropriate values of these two parameters (i.e., r and v).
9.5.1 Novelty Detection
ECSMiner is related to novelty/anomaly detection. Markou and Singh study novelty detection in
details in [MARK03a], [MARK03b]. Most novelty detection techniques fall into one of two cat-
egories: parametric and nonparametric. Parametric approaches assume a particular distribution of
data, and estimate parameters of the distribution from the normal data. According to this assump-
tion, any test instance is assumed to be novel if it does not follow the distribution ([NAIR97],
[ROBE00]). ECSMiner is a nonparametric approach, and therefore, it is not restricted to any specific
data distribution. There are several nonparametric approaches available, such as parzen window
method [YEUN02], k-nearest neighbor (k-NN)-based approach [YANG02], kernel-based method
[AHME07], and rule-based approach [MAHO03].
ECSMiner is different from the above novelty/anomaly detection techniques in three aspects.
First, existing novelty detection techniques only consider whether a test point is significantly differ-
ent from the normal data. However, we not only consider whether a test instance is sufficiently dif-
ferent from the training data, but also consider whether there are strong similarities among such test
instances. Therefore, existing techniques discover novelty individually in each test point, whereas
ECSMiner discovers novelty collectively among several coherent test points to detect the presence
of a novel class. Second, ECSMiner can be considered as a multiclass novelty detection technique,
since it can distinguish among different classes of data and also discover emergence of a novel class.
But existing novelty detection techniques can only distinguish between normal and novel, and,
therefore, can be considered as one-class classifiers. Finally, most of the existing novelty detection
techniques assume that the normal model is static, that is, there is no concept drift in the data. But
ECSMiner can detect novel classes even if concept drift occurs in the existing classes.
9.5.2 Outlier Detection
Novelty detection is also closely related to outlier detection techniques. There are many outlier detec-
tion techniques available, such as [AGAR05], [BAY03], [BREU00], [LAZA05], and [YAMA01].
Some of them are also applicable to data streams ([SUBR06], [TAND07]). However, the main
Survey of Stream Data Classification 109
difference with these outlier detection techniques from the outlier detection technique in ECSMiner
is that the primary objective of ECSMiner is novel class detection, not outlier detection. Outliers are
the byproduct of intermediate computation steps in ECSMiner algorithm. Thus, the precision of our
outlier detection technique is not too critical to the overall performance of ECSMiner.
9.5.3 Baseline Approach
Spinosa et al. [SPIN08] propose a cluster-based novel concept detection technique that is applicable
to data streams. However, this is also a one-class novelty detection technique, where authors assume
that there is only one normal class and all other classes are novel. Thus, it is not directly applicable
to a multiclass environment where more than one class is considered as normal or non-novel. But
ECSMiner can handle any number of existing classes, and also detect a novel class that does not
belong to any of the existing classes. Therefore, ECSMiner offers a more practical solution to the
novel class detection problem, which has been proved empirically.
ECSMiner extends our previous work [MASU09c], in which we proposed a novel class detec-
tion technique. However, in the previous work, we did not consider the time constraints Tl and
Tc. Therefore, ECSMiner addresses a more practical problem than the previous one. These time
constraints impose several restrictions on the classification algorithm, making classification more
challenging [MASU09a]. We encounter these challenges and provide efficient solutions.
9.6.1 Semisupervised Clustering
Semisupervised clustering techniques utilize a small amount of knowledge available in the form
of pairwise constraints (must-link, cannot-link), or class labels of the data points. According to
[BASU06], semisupervised clustering techniques can be subdivided into two categories: con-
straint-based and distance-based. Constraint-based approaches, such as [BASU02], [DEMI99],
and [WAGS01] try to cluster the data points without violating the given constraints. Distance-
based techniques use a specific distance metric or similarity measure (e.g., Euclidean distance),
but the distance metric is parameterized so that it can be adjusted to satisfy the given constraints.
Examples of the distance-based techniques are [COHN03], [HALK05], [KLEI02], and [XING03].
Some recent approaches for semisupervised clustering integrated the search-based and constraint-
based techniques into a unified framework, by applying pairwise constraints on top of the unsu-
pervised K-means clustering technique and formulating a constrained K-means clustering problem
([BASU04], [BASU06], [BILE04]). These approaches usually apply the expectation-maximization
(E-M) [DEMP77] technique to solve the constrained clustering problem.
ReaSC follows the constraint-based technique, but it is different from other constraint-based
approaches. Most constraint-based approaches use pairwise constraints (e.g., [BILE04]), whereas
we utilize a cluster-impurity measure based on the limited labeled data contained in each cluster
[MASU08]. If pairwise constraints are used, the running time per E-M step becomes quadratic in
the total number of labeled points, whereas the running time becomes linear if the impurity mea-
sures are used. So, the impurity measures are more realistic in classifying a high-speed stream data.
Although Basu et al. [BASU02] did not use any pairwise constraints, they did not use any cluster-
impurity measure either. However, cluster-impurity measure was used by Demiriz et al. [DEMI99].
But they applied expensive genetic algorithms, and had to adjust weights given to different compo-
nents of the clustering objective function to obtain good clusters. On the contrary, we apply E-M,
and we do not need to tune parameters to get a better objective function. Furthermore, we use a
110 Big Data Analytics with Applications in Insider Threat Detection
compound impurity-measure rather than the simple impurity measures used in [DEMI99]. Besides,
to the best of our knowledge, no other work applies a semisupervised clustering technique to clas-
sify stream data.
9.6.2 Baseline Approach
In ReaSC, we follow an ensemble classification approach, but it is different from other ensemble
approaches in two aspects. First, previous ensemble-based techniques use the underlying learn-
ing algorithm (such as decision tree, Naive Bayes, etc.) as a black-box and concentrate only on
optimizing the ensemble. But we concentrate mainly on building efficient classification models in
an evolving scenario. In this light, ReaSC is more closely related with the work of Aggarwal et al.
[AGGA06]. Secondly, previous techniques including [AGGA06] require completely labeled train-
ing data. But in practice, a very limited amount of labeled data may be available in the stream,
leading to poorly trained classification models. We show that high classification accuracy can be
achieved even with limited amount of labeled data.
Aggarwal et al. [AGGA06] apply a supervised microclustering technique along with horizon-
fitting to classify evolving data streams. They have achieved higher accuracy than other approaches
that use fixed horizon or the entire dataset for training. We also apply a microclustering technique.
But there are two major differences between ReaSC and this approach. First, we do not use horizon-
fitting for classification. Rather, we use a fixed-sized ensemble of classifiers. So, we do not need
to store historical snapshots which allows us to save memory. Second, we apply semisupervised
clustering, rather than supervised. Thus, we need only a fraction of training data to be labeled, com-
pared to a completely labeled data that is required for the approach of Aggarwal et al. [AGGA06].
Thus, ReaSC not only saves more memory, but also it is more applicable to a realistic scenario
where labeled data are scarce.
ReaSC extends our previous work [MASU08]. In the previous work, it was assumed that there
were two parallel, disjoint streams: a training stream and a test stream. The training stream con-
tained the labeled instances and was used to train the models. The test stream contained the unla-
beled instances and was used for testing. However, this assumption was not so realistic since in a
real-world scenario, labeled data may not be immediately available in the stream, and therefore,
it may not be possible to construct a separate training stream. So, ReaSC makes a more realistic
assumption that there is a single continuous stream. Each data chunk in the stream is first tested by
the existing ensemble, and then the same chunk is used for training, assuming that the instances
in the chunk have been labeled. Thus, all the instances in the stream are eventually tested by the
ensemble. Besides, in this book we have described our technique more elaborately and provided
detailed understanding and proof of the proposed framework. Finally, we have enriched the experi-
mental results by adding three more datasets, run more rigorous experiments, and reported in-depth
analysis of the results [MASU09b].
REFERENCES
[AGAR05]. D Agarwal, “An Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional
Arrays,” In ICDM ‘05: Proceedings of the 5th IEEE International Conference on Data Mining, Nov.
27–30, Houston, TX, pp. 26–33, IEEE Computer Society, 2005.
[AGGA06]. C.C. Aggarwal, J. Han, J Wang, P.S. Yu, “A Framework for On-Demand Classification of Evolving
Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006.
[AHME07]. T. Ahmed, M. Coates, A. Lakhina, “Multivariate Online Anomaly Detection Using Kernel
Recursive Least squares,” In INFOCOM ‘07: Proceedings of the 26th Annual IEEE Conference on
Computer Communications, May 6–12, Anchorage, Alaska, pp. 625–633, IEEE Computer Society,
2007.
[BASU02]. S Basu, A Banerjee, R.J. Mooney, “Semi-Supervised Clustering by Seeding.” In ICML ‘02:
Proceedings of the 19th International Conference on Machine Learning,” July 8–12, Sydney, Australia,
pp. 27–34, Morgan Kaufmann, 2002.
[BASU04]. S Basu, A Banerjee, R.J. Mooney, “Active Semi-Supervision for Pairwise Constrained Clustering,”
In SDM ‘04: Proceedings of the 2004 SIAM International Conference on Data Mining, April 22–24,
Lake Buena Vista, FL, pp. 333–344, SIAM, 2004.
[BASU06]. S Basu, M Bilenko, A Banerjee, R.J. Mooney. Probabilistic Semi-Supervised Clustering with
Constraints. Semi-Supervised Learning, O. Chapelle, B. Schoelkopf, A. Zien, editors, MIT Press,
Cambridge, MA, pp. 73–102, 2006.
[BAY03]. S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with
Randomization and a Simple Pruning Rule,” In KDD ‘03: Proceedings of the 9th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, August 24–27, Washington, DC,
pp. 29–38, ACM, 2003.
[BILE04]. M. Bilenko, S Basu, and R.J. Mooney, “Integrating Constraints and Metric Learning in Semi-
Supervised Clustering,” In ICML ‘04: Proceedings of the 21st International Conference on Machine
Learning, July 4–8, Banff, Canada, Morgan pp. 81–88, Kaufmann, 2004.
[BREU00]. M.M. Breunig, H-P. Kriegel, R.T. Ng, J. Sander, “LOF: Identifying Density-Based Local Outliers,”
ACM SIGMOD Record, 29(2):93–104, June 2000.
[COHN03]. D. Cohn, R. Caruana, A. McCallum. Semi-Supervised Clustering with User Feedback. Technical
Report TR2003-1892, Cornell University, 2003.
[DEMI99]. A. Demiriz, K.P. Bennett, M.J. Embrechts, “Semi-Supervised Clustering Using Genetic
Algorithms,” In ANNIE ‘99: Proceedings of the 1999 International Conference on Artificial Neural
Networks in Engineering, Nov. 7–10, St. Louis, MO, pp. 809–814, ASME Press, 1999.
[DEMP77]. A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM
Algorithm,” Journal of the Royal Statistical Society B, 39:1–38, 1977.
[DOMI00]. P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” In KDD ‘00: Proceedings of
the 2000 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August
20–23, Boston, MA, pp. 71–80, ACM, 2000.
[FAN04]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In KDD ‘04:
Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, August 22–25, Seattle, WA, pp. 128–137, ACM, 2004.
[FREU96]. Y. Freund and R.E. Schapire, “Experiments with a New Boosting Algorithm,” In ICML ‘96:
Proceedings of the 13th International Conference on Machine Learning, Jul. 3–6, Bari, Italy, pp. 148–156,
Morgan Kaufmann, 1996.
[GAO07]. J. Gao, W. Fan, J. Han, “On appropriate Assumptions to Mine Data Streams,” In ICDM ‘07:
Proceedings of the 2007 International Conference on Data Mining, October 28–31, Omaha, NE, pp.
143–152, IEEE Computer Society, 2007.
[GEHR99]. J. Gehrke, V. Ganti, R. Ramakrishnan, W-Y. Loh, “Boat-Optimistic Decision Tree Construction,”
In SIGMOD ‘99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of
Data, Jun. 1–3, Philadelphia, PA, pp. 169–180, ACM, 1999.
[HALK05]. M. Halkidi, D. Gunopulos, N. Kumar, M. Vazirgiannis, C. Domeniconi, “A Framework for Semi-
Supervised Learning Based on Subjective and Objective Clustering Criteria,” In ICDM ‘05: Proceedings of
the 5th IEEE International Conference on Data Mining, November 27–30, Houston, TX, pages 637–640,
IEEE Computer Society, 2005.
[HULT01]. G. Hulten, L. Spencer, P. Domingos, “Mining Time-Changing Data Streams,” In KDD ‘01:
Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, August 26–29, San Francisco, CA, pp. 97–106, CA, ACM, 2001.
112 Big Data Analytics with Applications in Insider Threat Detection
[KLEI02]. D. Klein, S.D. Kamvar, C.D. Manning, “From Instance- Level Constraints to Space-Level
Constraints: Making the Most of Prior Knowledge in Data Clustering,” In ICML ‘02: Proceedings of the
19th International Conference on Machine Learning, July 8–12, Sydney, Australia, pp. 307–314, Morgan
Kaufmann, 2002.
[KOLT05]. J.Z. Kolter and M.A. Maloof, “Using Additive Expert Ensembles to Cope with Concept Drift,”
In ICML ‘02: Proceedings of the 22nd International Conference on Machine Learning, August 7–11,
Bonn, Germany, pp. 449–456, Morgan Kaufmann, 2005.
[LAZA05]. A. Lazarevic and V. Kumar, “Feature Bagging for Outlier Detection,” In KDD ‘05: Proceedings
of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, August
21–24, Chicago, IL, pp. 157–166, ACM, 2005.
[MAHO03]. M.V. Mahoney and P.K. Chan, “Learning Rules for Anomaly Detection of Hostile Network
Traffic,” In ICDM ‘03: Proceedings of the 3rd International Conference on Data Mining, November
19–22, Melbourne, Florida, pp. 601–604, IEEE Computer Society, 2003.
[MARK03a]. M. Markou and S. Singh, “Novelty Detection: A Review—Part 2: Neural Network-Based
Approaches,” Signal Processing, 83(12):2499–2521, 2003.
[MARK03b]. M. Markou and S. Singh, “Novelty detection: A Review—Part 1: Statistical Approaches,” Signal
Processing, 83(12):2481–2497, 2003.
[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM ‘08: Proceedings
of the 2008 International Conference on Data Mining, Dec. 15–19, Pisa, Italy, pp. 929–934, IEEE
Computer Society, 2008.
[MASU09a]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge
and Data Engineering, 23(6):859–874, 2011.
[MASU09b]. M.M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. Hamlen, B.M. Thuraisingham, “Facing the
Reality of Data Stream Classification: Coping with Scarcity of Labeled Data,” Journal of Knowledge and
Information Systems, 1(33):213–244. 2012.
[MASU09c]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection
with Classification for Concept-Drifting Data Streams,” In ECML PKDD ‘09: Proceedings of the 2009
European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in
Databases, Volume II, Sep 7–11, Bled, Slovenia, pp. 79–94, Springer-Verlag, 2009.
[NAIR97]. A. Nairac, T.A. Corbett-Clark, R. Ripley, N.W. Townsend, L. Tarassenko, “Choosing an Appropriate
Model for Novelty Detection,” In ICANN ‘97: Proceedings of the 7th International Conference on
Artificial Neural Networks, pp. 117–122, Lausanne, Switzerland, October 8–10, Springer, 1997.
[ROBE00]. S.J. Roberts, “Extreme Value Statistics for Novelty Detection in Biomedical Signal Processing,”
In Proceedings of the 1st International Conference on Advances in Medical Signal and Information
Processing, pp. 166–172, 2000.
[SCHO05]. M. Scholz and R. Klinkenberg, “An Ensemble Classifier for Drifting Concepts,” In IWKDDS ‘05:
Proceedings of the 2nd International Workshop on Knowledge Discovery in Data Streams, October 3–7,
Porto, Portugal, pp. 53–64, 2005.
[SPIN08]. E.J. Spinosa, A.P. de Leon F. de Carvalho, J. Gama, “Cluster-Based Novel Concept Detection
in Data Streams Applied to Intrusion Detection in Computer Networks,” In SAC ‘08: Proceedings of
the 23rd ACM symposium on Applied Computing, March 16–20, Ceara, Brazil, pp. 976–980, ACM,
2008.
[SUBR06]. S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, D. Gunopulos, “Online Outlier
Detection in Sensor Data Using Non-Parametric Models,” In VLDB ‘06: Proceedings of the 32nd
International Conference on Very Large Data Bases, September 12–15, Seoul, Korea, pp. 187–198,
VLDB Endowment, 2006.
[TAND07]. G. Tandon and P.K. Chan, “Weighting versus Pruning in Rule Validation for Detecting Network
and Host Anomalies,” In KDD ‘07: Proceedings of the 13th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, August 12–15, San Jose, CA, pp. 697–706, ACM, 2007.
[TUME96]. K. Tumer and J. Ghosh, “Error Correlation and Error Reduction in Ensemble Classifiers,”
Connection Science, 8(304):385–403, 1996.
[UTGO89]. P.E. Utgoff, “Incremental Induction of Decision Trees,” Machine Learning, 4:161–186, 1989.
[WAGS01]. K. Wagsta, C. Cardie, S. Schroedl, “Constrained K-Means Clustering with Background Knowledge,” In
ICML ‘01: Proceedings of the 18th International Conf. on Machine Learning, June 28–July 1, Williamstown,
MA, pp. 577–584, Morgan Kaufmann, 2001.
Survey of Stream Data Classification 113
[WANG03]. H. Wang, W. Fan, P.S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In KDD ‘03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, August 24–27, Washington, DC, pp. 226–235, ACM, 2003.
[XING03]. E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, “Distance Metric Learning, with Application to
Clustering with Side-Information,” Advances in Neural Information Processing Systems 15, 15:505–512,
2003.
[YAMA01]. K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data:
Combining a Supervised Learner with an Unsupervised Learner,” In KDD ‘01: Proceedings of the 7th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 26–29, San
Francisco, CA, pp. 389–394, ACM, 2001.
[YANG02]. Y. Yang, J. Zhang, J. Carbonell, C. Jin, “Topic-Conditioned Novelty Detection,” In KDD ‘02:
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery in Data
Mining, July, 23–26, Alberta, Canada, pp. 688–693, ACM, 2002.
[YEUN02]. D-Y. Yeung and C. Chow, “Parzen-Window Network Intrusion Detectors,” In ICPR ‘02:
Proceedings of the 16th International Conference on Pattern Recognition, August 11–15, Quebec City,
Canada, pp. 385–388, 2002.
10 A Multi-Partition, Multi-Chunk
Ensemble for Classifying
Concept-Drifting Data Streams
10.1 INTRODUCTION
While the challenges and prior work for stream data classification were discussed in Chapters 8 and
9, in this chapter, we describe our innovative technique for classifying concept-drifting data streams
using a novel ensemble classifier originally discussed in [MASU09]. It is a multiple partition of
multiple chunk (MPC) ensemble classifier-based data mining technique to classify concept-drifting
data streams. Existing ensemble techniques in classifying concept-drifting data streams follow a
single-partition, single-chunk (SPC) approach, in which a single data chunk is used to train one
classifier. In our approach, we train a collection of v classifiers from r consecutive data chunks using
v-fold partitioning of the data, and build an ensemble of such classifiers. By introducing this MPC
ensemble technique, we significantly reduce classification error compared to the SPC ensemble
approaches. We have theoretically justified the usefulness of our algorithm, and empirically proved
its effectiveness over other state-of-the-art stream classification techniques on synthetic data and
real botnet traffic.
The organization of this chapter is as follows. Ensemble development will be discussed in Section
10.2. Our experiments are discussed in Section 10.3. This chapter is summarized in Section 10.4. We
have developed several variations of the technique presented in this chapter and some of them are
discussed in Chapters 11 through 13. We have also applied a variation of the techniques discussed in
Chapters 10 through 12 for insider threat detection and these techniques are discussed in Section III.
115
116 Big Data Analytics with Applications in Insider Threat Detection
Unlabeled chunks
Last labeled chunk
Data stream
D= d1 d2 ... ... dv
v newly trained classifiers M′ = (M1′ ... ... ..., Mv′). Training data for Mi′ = D–di
Current ensemble M = {M1, M2, ..., MLv}
New ensemble M = best Lv classifiers in (M ∪ M′)
FIGURE 10.1 Illustration: How data chunks are used to build an ensemble with MPC.
We compute the expected error of each classifier M ′j on its corresponding test data dj. Finally, on
line 10, we select the best Lv classifiers from the Lv + v classifiers M′ ∪ M. Note that any subset of
the nth batch of v classifiers may take place in the new ensemble.
Given an instance x, the posterior probability distribution of class a is p(a|x). For a two-class
classification problem, a = + or −. According to Tumer and Ghosh [TUME96], a classifier is
trained to learn a function f a(.) that approximates this posterior probability:
where ηa(x) is the error of f a(.) relative to p(a||x). This is the error in addition to Bayes error and
usually referred to as the “added error.” This error occurs either due to the bias of the learning
algorithm, and/or the variance of the learned model. According to [TUME96], the expected added
error can be obtained from the following formula:
ση2a ( x )
Error =
s
where ση2a ( x ) is the variance of ηa(x) and s is the difference between the derivatives of p(+|x) and
p(−|x), which is independent of the learned classifier.
Let C = {C1, …, CL} be an ensemble of L classifiers, where each classifier Ci is trained from
a single data chunk (i.e., C is an SPC ensemble). If we average the outputs of the classifiers in an
L-classifier ensemble, then according to [TUME96], the ensemble output would be
L
1
fCa =
L ∑f
i=i
a
Ci ( x ) = p(a| x ) + ηCa ( x ) (10.2)
where fCa is the output of the ensemble C, fCai ( x ) is the output of the i classifier Ci, and η Ca ( x ) is the
average error of all classifiers, given by
L
1
η Ca ( x ) =
L ∑η
i =1
a
Ci ( x) (10.3)
where η Cai ( x ) is the added error of the ith classifier in the ensemble. Assuming the error variances
are independent, the variance of η Ca ( x ) is given by
L
1 1 2
ση2a ( x ) =
C L2 ∑σ
i =1
2
η Ca ( x )
i
= σa
L ηC ( x )
(10.4)
where ση2a ( x ) is the variance of η Cai ( x ), and ση2a ( x ) is the common variance. In order to simplify the
Ci C
notation, we would denote ση2a ( x ) with σC2 i .
Ci
Let M be the ensemble of Lv classifiers M1, M2, …, MLv, Mi is a classifier trained using r consecu-
tive data chunks (i.e., the MPC approach). The following lemma proves that MPC reduces error over
SPC by a factor of rv when the outputs of the classifiers in the ensemble are independent.
118 Big Data Analytics with Applications in Insider Threat Detection
Lemma 10.1
Let σC2 be the error variance of SPC. If there is no concept drift, and the errors of the classifiers
in the ensemble M are independent, then the error variance of MPC is 1/rv times of that of SPC,
that is,
2 1 2
σM = σC
rv
Proof: Each classifier Mi ∈ M is trained on r consecutive data chunks. If there is no concept drift,
then a classifier trained on r consecutive data chunks may reduce the error of the single classifiers
trained on a single data chunk by a factor of r [WANG03]. So, it follows that
r +i−1
1
2
σM i
=
r2 ∑σj =i
2
Cj (10.5)
2
where σ M i
is the error variance of classifier Mi, trained using data chunks {Di ∪ Di+1 … ∪ Di+r−1}
2
and σC j is the error variance of Cj, trained using a single data chunk Dj. Combining Equations 10.4
and 10.5 and simplifying, we get
Lv
1
σ M2 = 2 2
Lv ∑σ i=1
2
Mi (using Equation 9.4)
Lv r +i−1
1 1
= 2 2
Lv ∑ ∑σ
i=1
r2 j=i
2
Cj (using Equation 9.5)
Lv r +i−1 Lv r +i−1
1 1
∑∑
1
=
L2 v 2r 2 i=1 j=i
σC2 j =
L2 v 2r ∑ ∑
i=1
r j=i
σC2 j
Lv
1
=
L2 v 2r ∑σi=1
2
Ci σC2i is common variance of σC2 i , j = i, …, i + r −1)
(σ
1 1
Lv
= 1 σC2
=
Lrv Lv ∑σ
i=1
2
Ci Lrv
(σC2 is the common variance of σC2i , i = 1, …, Lv)
1 1 2 1 2
= σ C = σC (using Equation 9.4) (10.6)
rv L rv
However, since we train v classifiers from each r consecutive data chunks, the independence
assumption given above may not be valid since each pair of these v classifiers has overlapping
training data. We need to consider correlation among the classifiers to compute the expected error
reduction. The following lemma shows the error reduction considering error correlation.
Lemma 10.2
Let σC2 be the error variance of SPC. If there is no concept drift, then the error v ariance of MPC is
at most (v – 1)/rv times of that of SPC, that is,
v −1 2
σ M2 ≤ σC , v > 1
rv
MPC Ensemble for Classifying Concept-Drifting Data Streams 119
Proof: According to [TUME96], the error variance of the ensemble M, given some amount of
correlation among the classifiers is as follows:
1 + δ ( Lv −1) 2
σ M2 = σ M (10.7)
Lv
where corr (η m , η l ) is the correlation between the errors of classifiers Mm and Ml.
To simplify the computation of error correlation between Mm and Ml, we assume that if they are
trained with the identical training data, then corr(ηm,ηl) = 1, and if they are trained with completely
disjoint training data, then corr(ηm, ηl) = 0. Given this assumption, the correlation between Mm and
Ml can be computed as follows:
v − 2
if {M m , Ml} ∈ M i
corr (ηm , ηl ) = v −1 (10.9)
0 otherwise
The first case of Equation 10.9 says that the error correlation between Mm and Ml is (v − 2)/
(v − 1) if they are in the same batch of classifiers Mi. In this case, each pair of classifiers have v − 2
partitions of common training data, and each one has a total of v −1 partitions of training data. In
the worst case, all the v classifiers of the ith batch will remain in the ensemble M. However, this may
not the case most of the time, because, according to the ensemble updating algorithm, it is possible
that some classifiers of the ith batch will be replaced and some of them will remain in the ensemble.
Therefore, in the worst case, the ensemble is updated each time by a replacement of a whole batch
of v classifiers by a new batch of v classifiers. In this case, each classifier will be correlated with
v − 1 classifiers. So, the mean correlation becomes
1 v−2 v−2
δ≤ Lv(v −1) =
( Lv)( Lv −1) v −1 Lv −1
1 + v − 2 ( Lv −1)
Lv −1 2
σ M2 ≤ σ M (using Equation 10.7)
Lv
Lv Lv r +i−1
v −1 2 v −1 1 v −1 1
∑σ ∑ ∑σ
2 1 2
= σM = Mi = Cj (ussing Equation 10.5)
Lv Lv Lv i=1
Lv Lv i=1
r2 j =i
Lv r + j−1 Lv
v −1 1 1 1 v −1 1 1
=
Lv Lv ∑ ∑
i=1
r r j=i
σC2 j =
Lv Lv ∑rσ
i=1
2
Ci
v −1 1 1 v −1 1
Lv
v −1 1 2 v −1 2
=
Lv r Lv ∑i=1
σC2i =
Lv r
σC2 = σC =
rv L rv
σC (using Equation 10.4)
120 Big Data Analytics with Applications in Insider Threat Detection
Definition 10.1
Magnitude of drift or ρd is the maximum error introduced to a classifier due to concept drift. That
is, every time a new data chunk appears, the error variance of a classifier is incremented ρd times
due to concept drift.
In other words, σ̂C2 j is the actual error variance of the j classifier Cj in the presence of concept drift,
when the last data chunk in the window, Di+r−1 appears. Our next lemma deals with error reduction
in the presence of concept drift.
Lemma 10.3
2
Let σ̂ M be the error variance of MPC in the presence of concept drift, σC2 be the error variance of
2
SPC, and ρd be the drifting magnitude defined by definition. Then σ̂ M is bounded by
2 (v −1)(1 + ρd )r −1 2
σˆ M ≤ σC
rv
Proof: Replacing σC2 j with σ̂C2 j in Equation 10.6 and following lemma, we get
Lv r +i−1
v −1 1
σˆ M2 ≤
L2 v 2 ∑ ∑ σˆ
i=1
r2 j =i
2
Cj
Lv r +i−1
v −1
=
L2r 2 v 2 ∑ ∑ (1 + ρ )
i=1 j=i
d
( i +r −1)− j
σC2 j (using Equation 10.10)
Lv r +i−1
v −1
≤
L2r 2 v 2 ∑i=1
(1 + ρd )r−1 ∑σ j =i
2
Cj
r −1 Lv r +i−1
(v −1)(1 + ρd )
=
L2r 2 v 2 ∑∑σ
i=1 j =i
2
Cj
r −1
(v −1)(1 + ρd )
= σC2
Lrv
(v −1)(1 + ρd )r−1 2
= σC , r > 0
rv
MPC Ensemble for Classifying Concept-Drifting Data Streams 121
1
0.6
0.8
Relative error (ER)
0 0
10 20 30 40 10 20 30 40
r v
(v −1)(1 + ρd )r−1
≤ 1 or, E R ≤ 1 (10.11)
rv
where ER is the ratio of MPC error to SPC error in the presence of concept drift. As we increase
r and v, the relative error keeps decreasing up to a certain point. After that, it becomes flat or starts
increasing. Next, we analyze the effect of parameters r and v on error reduction, in the presence of
concept drift.
For a given value of v, r can only be increased up to a certain value. After that, increasing r actu-
ally hurts the performance of our algorithm, because inequality (10.11) is violated. Figure 10.2a
shows the relative error ER for v = 2, and different values of ρd, for increasing r. It is clear from
the graph that for lower values of ρd, increasing r reduces the relative error by a greater margin.
However, in any case, after a certain value of r, ER becomes >1. Although it may not possible to
know the actual value of ρd from the data, we can determine the optimal value of r experimentally.
In our experiments, we found that for smaller chunk sizes, higher values of r work better, and vice
versa. However, the best performance-cost trade-off is found for r = 2 or 3. We have used r = 2
in our experiments. Figure 10.2b shows the relative error ER for r = 2, ρd = 0.3, and three cases
of correlation (no correlation, a classifier is correlated with one other classifier on average, and a
classifier is correlated with two other classifiers on average) for increasing v. We see that in all three
cases, relative error keeps decreasing as we increase v. This is true for any value of ρd. However,
after certain value of v, the rate of improvement gradually diminishes. From our experiments, we
obtained the best performance-cost trade-off for v = 5.
10.3 EXPERIMENTS
We evaluate our proposed method on both synthetic data and botnet traffic generated in a controlled
environment, and compare with several baseline methods.
122 Big Data Analytics with Applications in Insider Threat Detection
(a) 18 (b) 30
16 25
Error (%)
14 20
12 15
10 10 MPC
2 4 6 8 250 500 750 1000 Wang
L Chunk size BestL
All
(c) (d) 5 Last
3
4
Error (%)
2 3
2
1
1
2 4 6 8 30 60 90 120
L Chunk size (minutes)
FIGURE 10.3 Error vs L and chunk size on synthetic data (a, b) and botnet data (c, d).
Figure 10.3a shows the error rates for different values of L of each method, averaged over four
different chunk sizes on synthetic data, and Figure 10.3c shows the same for botnet data. Here
decision tree is used as the base learner. It is evident that M P C has the lowest error among all
approaches. Besides, we observe that the error of M P C is lower for higher values of L. This is
desired because higher values of L means larger ensemble, and more error reduction. However,
accuracy does not improve much after L = 8. Wang and BestL also show similar characteristic. All
and Last do not depend on L, so their error remains the same for any L. Figure 10.3b shows the error
rates for four different chunk sizes of each method (also using decision tree) averaged over different
values of L (2, 4, 6, 8) on synthetic data, and Figure 10.3d shows the same for botnet data. Again,
M P C has the lowest error of all. Besides, the error of M P C is lower for larger chunk sizes. This is
desired because larger chunk size means more training data for a classifier.
Tables 10.1 and 10.2 report the error of decision tree and Ripper learning algorithms, respec-
tively, on synthetic data, for different values of L and chunk sizes. The columns denoted by M2, W2,
and B2 represent MPC, Wang, and BestL, respectively, for L = 2. Other columns have similar inter-
pretations. In all the tables, we see that MPC has the lowest error for all values of L (shown in bold).
Figure 10.4 shows the sensitivity of r and v on error and running times on synthetic data for
MPC. Figure 10.4a shows the errors for different values of r for a fixed value of v (=5) and L (=8).
The highest reduction in error occurs when r is increased from 1 to 2. Note that r = 1 means single
chunk training. We observe no significant reduction in error for higher values of r, which follows
TABLE 10.1
Error of Different Approaches on Synthetic Data Using Decision Tree
S M2 W2 B2 M4 W4 B4 M6 W6 B6 M8 W8 B8 All Last
250 19.3 26.8 26.9 17.3 26.5 22.1 16.6 26.3 20.4 16.2 26.1 19.5 29.2 26.8
500 11.4 14.8 14.7 10.6 13.2 12.4 10.3 12.7 11.6 10.2 12.4 11.3 11.3 14.7
750 11.1 13.9 13.9 10.6 12.1 11.9 10.3 11.5 11.4 10.3 11.3 11.2 15.8 13.8
1000 11.4 14.3 14.3 10.7 12.8 12.2 10.5 12.2 11.7 10.3 11.9 11.4 12.6 14.1
124 Big Data Analytics with Applications in Insider Threat Detection
TABLE 10.2
Error of Different Approaches on Synthetic Data Using Ripper
S M2 W2 B2 M4 W4 B4 M6 W6 B6 M8 W8 B8 All Last
250 19.2 26.5 26.0 17.6 26.2 22.4 17.1 26.0 21.3 16.8 25.9 20.9 30.4 26.3
500 11.5 14.2 13.9 10.8 13.0 12.3 10.6 12.6 11.8 10.5 12.5 11.5 11.6 14.1
750 11.0 13.4 13.3 10.6 12.1 12.0 10.5 11.7 11.6 10.5 11.5 11.5 15.7 13.3
1000 11.1 13.8 13.7 10.6 12.5 12.3 10.3 12.1 11.9 10.2 11.9 11.8 12.6 13.6
from our analysis of parameter r on concept-drifting data in Section 10.1.3. However, the running
time keeps increasing, as shown in Figure 10.4c. The best trade-off between running time and error
occurs for r = 2. The charts in Figure 10.4b, d show a similar trend for parameter v. Note that v = 1
is the base case, that is, the single partition ensemble approach, and v > 1 is the multiple partition
ensemble approach. We observe no real improvement after v = 5, although the running time keeps
increasing. This result is also consistent with our analysis of the upper bounds of v, explained in
Section 10.1.3. We choose v = 5 as the best trade-off between time and error.
Figure 10.5a shows the total running times of different methods on synthetic data for L = 8,
v = 5 and r = 2. Note that the running time of MPC is within five times of that of Wang. This
also supports our complexity analysis that the running time of MPC would be at most rv times the
running time of Wang. The running times of MPC on botnet data shown in Figure 10.5b also have
similar characteristics. The running times shown in Figure 10.5 include both training and testing
time. Although the total training time of MPC is higher than that of Wang, the total testing times
are almost the same in both techniques. Considering that training can be done offline, we may con-
clude that both these techniques have the same runtime performances in classifying data streams.
Besides, users have the flexibility to choose either better performance or shorter training time just
by changing the parameters r and v.
(a) 18 (b) 30
16 25
Error (%)
14 20
12 15
10 10
2 4 6 8 MPC
250 500 750
1000 figure Wang
L
Chunk size BestL
All
Last
(c) (d) 5
3
4
Error (%)
2 3
2
1 1
2 4 6 8 30 60 90 120
L Chunk size (minutes)
FIGURE 10.4 Sensitivity of parameters r and v on error (a, b) and running time (c, d).
MPC Ensemble for Classifying Concept-Drifting Data Streams 125
(a) (b) 10
30 0
250 500 750 1000 30 60 90 120
Chunk size Chunk size (minutes)
FIGURE 10.5 Chunk size versus running times on (a) synthetic data and (b) real data.
TABLE 10.3
Error Comparison with the Same Number of Classifiers
in the Ensemble
Chunk Size M2(J48) W10(J48) M2(Ripper) W10(Ripper)
250 19.9 26.1 21.0 26.1
500 11.7 12.5 12.2 12.6
1000 11.4 12.5 11.8 13.0
We also report the results of using equal number of classifiers in MPC and Wang by setting
L = 10 in Wang, and L = 2, v = 5, and r = 1 in M P C, which is shown in Table 10.3. We observe
that error of MPC is lower than that of Wang in all chunk sizes. The columns M2(J48) and W10(J48)
show the error of MPC (L = 2, v = 5, r = 1) and Wang (L = 10), respectively, for decision tree
algorithm. The columns M2(Ripper) and W10(Ripper) show the same for Ripper algorithm. For
example, for chunk size 250, and decision tree algorithm, MPC error is 19.9%, whereas Wang
error is 26.1%. We can draw two important conclusions from this result. First, if the ensemble
size of Wang is simply increased v times (i.e., made equal to Lv), its error does not become as low
as MPC. Second, even if we use the same training set size in both these methods (i.e., r = 1), the
error of Wang still remains higher than that of MPC. There are two possible reasons behind this
performance. First, when a classifier is removed during ensemble updating in Wang, all information
obtained from the c orresponding chunk is forgotten, but in MPC, one or more classifiers from a
chunk may survive. Thus, the ensemble updating approach in MPC tends to retain more informa-
tion than that of Wang, leading to a better ensemble. Second, Wang requires at least Lv data chunks,
whereas MPC requires at least L + r − 1 data chunks to obtain Lv classifiers. Thus, Wang tends to
keep much older c lassifiers in the ensemble than MPC, leading to some outdated classifiers that can
put a negative effect on the ensemble outcome.
In the future, we would also like to apply our technique on the classification and model evolution
of other real streaming data. As we stated earlier several of our stream analytics techniques are based
on the approach discussed in this chapter. These techniques are presented in Chapters 11 and 12.
In addition, the applications of our techniques are discussed in Section III.
REFERENCES
[BARF06]. P. Barford and V. Yegneswaran, An Inside Look at Botnets. Advances in Information Security.
Springer, New York, 2006.
[GAO07]. J. Gao, W. Fan, J. Han, “On Appropriate Assumptions to Mine Data Streams,” ICDM’07:Proceedings
of the 2007 International Conference on Data Mining, Oct. 28–31, Omaha, NE, pp. 143–152, IEEE
Computer Society, 2007.
[GRIZ07]. J. B. Grizzard, V. Sharma, C. Nunnery, B. B. Kang, D. Dagon, “Peer-to-Peer Botnets: Overview and
Case Study,” Hot- Bots ’07: Proceedings of the 1st Workshop on Hot Topics in Understanding Botnets,
April 10, Cambridge, MA, pp. 1, 2007.
[LURH04]. LURHQ Threat Intelligence Group, Sinit p2p trojan analysis. 2004. http://www.lurhq.com/sinit.
html.
[LEMO06]. R. Lemos, Bot software looks to improve peerage. 2006, http://www.securityfocus.com/
news/11390.
[MASU08]. M. M. Masud, T. Al-khateeb, L. Khan, B. M. Thuraisingham, K. W. Hamlen, “Flow-Based
Identification of Botnet Traffic by Mining Multiple Log Files,” DFMA ’08: Proceedings of the 2008
International Conference on Distributed Frameworks and Applications, Oct. 21–22, Penang, Malaysia,
pp. 200–206, 2008.
[MASU09]. M. M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “A Multi-Partition Multi-Chunk
Ensemble Technique to Classify Concept-Drifting Data Streams,” PAKDD09: Proceedings of the 13th
Pacific–Asia Conference on Knowledge Discovery and Data Mining, Apr. 27–30, Bangkok, Thailand,
pp. 363–375, Springer-Verlag, 2009.
[TUME96]. K. Tumer and J. Ghosh, “Error correlation and error reduction in ensemble classifiers,” Connection
Science, 8(304), 385–403, 1996.
[WANG03]. H. Wang, W. Fan, P. S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” KDD ’03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Aug. 24–27, Washington, DC, pp. 226–235, ACM, 2003.
11 Classification and Novel
Class Detection in Concept-
Drifting Data Streams
11.1 INTRODUCTION
As discussed in [MASU09], in a typical data stream classification task, it is assumed that the
total number of classes are fixed. This assumption may not be valid in a real streaming environ-
ment, where new classes may evolve. Traditional data stream classification techniques are not
capable of recognizing novel class instances until the appearance of the novel class is manually
identified, and labeled instances of that class are presented to the learning algorithm for training.
The problem becomes more challenging in the presence of concept drift, when the underlying
data distribution changes over time. We propose a novel and efficient technique that can auto-
matically detect the emergence of a novel class in the presence of concept drift by quantify-
ing cohesion among unlabeled test instances and separation of the test instances from training
instances.
Our approach is nonparametric, meaning it does not assume any underlying distributions of
data. A comparison with the state-of-the-art stream classification techniques proves the supe-
riority of our approach. In this chapter, we discuss our proposed framework for classifying
data streams with automatic novel class detection mechanism. It is based on our previous work
[MASU09].
The organization of this chapter is as follows. Our ECS Miner algorithm is discussed in Section
11.2. Classification with novel class detection is discussed in Section 11.3. Experiments are dis-
cussed in Section 11.4. This chapter is summarized in Section 11.5.
11.2 ECSMINER
11.2.1 Overview
ECSMiner (pronounced like ExMiner), stands for Enhanced Classifier for Data Streams with novel
class Miner. Before describing ECSMiner, we mathematically formulate the data stream classifica-
tion problem (Figure 11.1).
• The data stream is a continuous sequence of data points: {x1, …,xnow}, where each xi is a
d-dimensional feature vector, x1 is the very first data point in the stream, and xnow is the
latest data point that has just arrived.
• Each data point xi is associated with two attributes: yi, and ti, being its class label and time
of arrival, respectively.
• For simplicity, we assume that ti+1 = ti + 1 and t1 = 1.
• The latest Tl instances in the stream: {xnow−Tl +1, …, xnow} are unlabeled, meaning their cor-
responding class labels are unknown. But the class labels of all other data points are known.
• We are to predict the class label of xnow before the time tnow + Tc, that is, before the data
point xnow+Tc arrives, and Tc < Tl.
127
128 Big Data Analytics with Applications in Insider Threat Detection
(b)
Tnow-Tl+l Tnow-Tc+l Tnow Tnow+l
T1 T2
FIGURE 11.1 Illustrating the mathematical formulation of data stream classification problem under time
constraints.
1:M←Build-initial-ensemble()
2:buf←empty//temporary buffer
3:U←empty//unlabeled data buffer
4:L←empty//labeled data buffer –+62393+*(training data)
5:while true do
6: xj←the latest data point in the stream
7: Classify(M,xj,buf)//(Algorithm 11.2, Section 11.2)
8: U⇐xj//enqueue
9: if |U|>Tl then//time to label the oldest instance
10: xk ⇐U//dequeue the instance
11: L⇐<xk,yk>//label it and save in training buffer
12: if |L|=S then//training buffer is full
13: M′←Train-and-save-decision-boundary (L) (Section 11.1.5)
14: M←Update(M, M′, L)
15: L←empty
16: endif
17: endif
18:endwhile
Algorithm 11.1 outlines the top level overview of our approach. The algorithm starts with building
the initial ensemble of models M = {M1, …, ML} with the first L labeled data chunks. The algorithm
maintains three buffers: buffer buf keeps potential novel class instances, buffer U keeps unlabeled
data points until they are labeled, and buffer L keeps labeled instances until they are used to train a
new classifier. After initialization, the while loop begins from line 5, which continues indefinitely. At
each iteration of the loop, the latest data point in the stream, xj is classified (line 7) using Classify()
(Algorithm 11.2). The novel class detection mechanism is implemented inside Algorithm 11.2. If the
class label of xj cannot be predicted immediately, it is stored in buf for future processing. Details of
Classification and Novel Class Detection in Concept-Drifting Data Streams 129
Data stream
Just arrived
Older instances (labeled) Newer instances (unlabeled)
Last labeled
Xnow
chunk Buffering and
novel class
Yes detection
Outlier
Training Buffer?
detection
Update No
Classification
this step will be discussed in Section 11.3. xj is then pushed into the unlabeled data buffer U (line 8).
If the buffer size exceeds Tl, the oldest element xk is dequeued and labeled (line 9), since Tl units of
time has elapsed since xk arrived in the stream (so it is time to label xk). The pair 〈xk, yk〉 is pushed
into the labeled data buffer L (line 9). When we have S instances in L, where S is the chunk size, a new
classifier M′ is trained using the chunk (line 13). Then the existing ensemble is updated (line 14) by
choosing the best L classifiers from the L + 1 classifiers M ∪ {M′} based on their accuracies on L, and
the buffer L is emptied to receive the next chunk of training data (line 15).
Figure 11.2 illustrates the overview of our approach. A classification model is trained from the
last labeled data chunk. This model is used to update the existing ensemble. The latest data point
in the stream is tested by the ensemble. If it is found to be an outlier, it is temporarily stored in a
buffer. Otherwise, it is classified immediately using the current ensemble. The temporary buffer is
processed periodically to detect whether the instances in the buffer belong to a novel class.
Our algorithm will be mentioned henceforth as “ECSMiner” (pronounced like ExMiner), which
stands for Enhanced Classifier for Data Streams with novel class Miner. We believe that any base
learner can be enhanced with the proposed novel class detector, and used in ECSMiner. The only
operation that needs to be treated specially for a particular base learner is Train-and-save-decision-
boundary. We illustrate this operation for two base learners in this section.
λc, q-neighborhood, or λc, q (x) of any instance x is the set of q nearest neighbors of x within class c.
For example, let there be three classes c+, and c−, and c0, denoted by the symbols “+,” “−,” and
black dots, respectively (Figure 11.3). Also, let q = 5, then λc +, q (x) of any arbitrary instance x is
the set of 5 nearest neighbors of x in class c+, and so on.
130 Big Data Analytics with Applications in Insider Threat Detection
λc0, 5(x)
x λc-, 5(x)
λc+, 5(x)
1
Dc,q ( x ) =
q x ∈λ
i
∑ c ,q ( x )
D( x, xi ) (11.1)
where D(xi,xj) is the distance between the data points xi and xj in some appropriate metric.
Let cmin be the class label such that Dcmin ,q ( x ) is the minimum among all Dc,q ( x ), that is, λcmin ,q ( x )
is the nearest λc,q(x) neighborhood (or q-nearest neighborhood or q-NH) of x. For example, in
Figure 11.3, cmin = c0, that is, λc0 ,q ( x ) is the q-NH of x.
Let cmin be the class label of the instances in q-NH of x. According to the q-NH rule, the predicted
class label of x is cmin.
In the example of Figure 11.3, cmin = c0, therefore, the predicted class label of x is c0. Our novel
class detection technique is based on the assumption that any class of data follows the q-NH rule.
In this section, we discuss the similarity of this rule with k-NN rule and highlight its significance.
Let M be the current ensemble of classification models. A class c is an existing class if at least one
of the models Mi ∈ M has been trained with the instances of class c. Otherwise, c is a novel class.
Therefore, if a novel class c appears in the stream, none of the classification models in the ensem-
ble will be able to correctly classify the instances of c. An important property of the novel class
follows from the q-NH rule.
Property 1 Let x be an instance belonging to a novel class c, and let c′ be an existing class. Then
according to q-NH rule, Dc,q ( x ), that is, the average distance from x to λc,q(x) is <Dc ′,q ( x ), the aver-
age distance from x to λc ′,q ( x ), for any existing class c′. In other words, x is closer to the neighborhood
of its own class (cohesion), and farther from the neighborhood of any existing classes (separation).
Figure 11.4 shows a hypothetical example of a decision tree and the appearance of a novel class.
A decision tree and its corresponding feature vector partitioning by its leaf nodes are shown in the
Classification and Novel Class Detection in Concept-Drifting Data Streams 131
Novel class
y
D
x<x1
y1
T F C
A
y<y1 y<y2
T F T F
+ – + – y2
A D B C B
x1 x
figure. The shaded portions of the feature space represent the training data. After the decision tree
is built, a novel class appears in the stream (shown with “x” symbol). The decision tree model mis-
classifies all the instances in the novel class as existing class instance since the model is unaware
of the novel class. Our goal is to detect the novel class without having to train the model with that
class. Note that instances in the novel class follow property 1, since the novel-class neighborhood of
any novel-class instance is much closer to the instance than the neighborhoods of any other classes.
If we observe this property in a collection of unlabeled test instances, we can detect the novel class.
This is not a trivial task, since we must decide when to classify an instance immediately, and when
to postpone the classification decision, and wait for more test instances so that property 1 can be
revealed among those instances. Because in order to discover property 1 (cohesion), we need to deal
with a collection of test instances simultaneously. Besides, we cannot defer the decision >Tc time
units after the arrival of a test instance.
Therefore, the main challenges in novel class detection are as follows:
11.2.5 Base Learners
We apply our technique on two different classifiers: decision tree, and k-nearest neighbor (k-NN).
When decision tree is used as a classifier, each training data chunk is used to build a decision tree.
When k-NN is used, each chunk is used to build a k-NN classification model. The simplest way to
build such a model is to just store all the data points of the training chunk in memory. But this strat-
egy would lead to an inefficient classification model, both in terms of memory and running time.
In order to make the model more efficient, we build K clusters with the training data [MASU08].
Note that we use small k as the parameter for k-NN and capital K to denote the number of clus-
ters. We apply a semisupervised clustering technique using expectation maximization (E-M) that
tries to minimize both intracluster dispersion (same objective as unsupervised K-means) and cluster
impurity
K
x ∈Xi
i
2
i (11.2)
132 Big Data Analytics with Applications in Insider Threat Detection
The first term in Equation 11.2 is the same as unsupervised K-means, which penalizes intracluster
dispersion. The second term penalizes cluster impurity. A cluster is considered pure if all data points
in the cluster come from the same class. We use entropy and Gini index for impurity measure. After
building the clusters, we save the cluster summary of each cluster (centroid, and frequencies of data
points belonging to each class) in a data structure called “microcluster,” and discard the raw data points.
Since we store and use only K microclusters, both the time and memory requirements become func-
tions of K (a constant number). A test instance xj is classified as follows: we find the microcluster whose
centroid is nearest from xj and assign it a class label that has the highest frequency in that microcluster.
y1 y1
C C
A
A
y2 y2
B
B
x1 x x1 x
FIGURE 11.5 Creating decision boundary for the decision tree of Figure 11.4.
Classification and Novel Class Detection in Concept-Drifting Data Streams 133
µ=
∑ x ∈X
x
w
• Radius, ℛ: Distance between the centroid and the farthest data point in the cluster, that is,
R = max(dist (µ, x ))
x ∈X
where dist(x, y) is the distance between two data points x and y in some appropriate metric.
• Mean distance, µd: The mean distance from each point to the cluster centroid, that is,
µd =
∑ x ∈X
dist (µ, x )
w
So, w(h) denotes the “weight” value of a pseudopoint h, and so on. After computing the clus-
ter summaries, the raw data are discarded and only the pseudopoints are stored in memory.
Any pseudopoint having too few (less than three) instances is considered as noise and is also
discarded. Thus, the memory requirement for storing the training data becomes constant, that
is, O(K).
Each pseudopoint h corresponds to a hypersphere in the feature space having center µ(h) and
radius R(h). Let us denote the portion of feature space covered by a pseudopoint h as the “region”
of h or RE(h). Therefore, RE(Mi) denotes the union of the regions of all pseudopoints h in the
classifier Mi, that is,
RE ( Mi ) ∪ h∈Mi RE (h)
RE(Mi) forms a decision boundary for the training data of classifier Mi.
5: end if
6: Filter(buf)
7: if fout = true then
8: buf ⇐ xj //enqueue
9: if buf.length > q and last trial + q ≤ ti then
10: last trial ← ti
11: novel ← DetectNovelClass(M,buf) // (Algorithm 11.3, Section 11.2.2)
12: if novel = true then remove novel (buf)
13: end if
14: end if
11.3.2 Classification
In line 2 of Algorithm 11.2, we first check whether the test instance xj is an F outlier, which is to
be defined shortly. If any test instance xj falls outside the decision boundary RE(Mi) of a classifier
Mi, then xj is an outlier. If xj is a novel class instance, it must be an outlier, which would be justi-
fied shortly. However, xj may also appear an outlier because of other reasons: noise, concept drift,
or insufficient training data for Mi. Therefore, we apply filtering so that most of the outliers, which
appear for any reason other than being novel class instance, are filtered out. The outliers that pass
the filtering are called F outliers.
A test instance is an F outlier (i.e., filtered outlier) if it is outside the decision boundary of all
classifiers
Mi ∈ M .
Intuitively, all novel class instances should be F outliers. Because, if any test instance xj is not
an F outlier, then it must be inside the decision boundary of some classifier Mi. Therefore, it must
be inside Re(h′) of some pseudopoint h′. This implies that xj is closer to the centroid of h′ than at
least one training instance in h′ (the one at the farthest distance from the centroid of h′), which leads
to the conclusion that xj is most likely an existing class instance having the same class label as the
instances in h′. So, if xj is not an F outlier, we classify it immediately using the ensemble voting
(line 3).
1. Age > S: the front of buf contains the oldest element in buf. It is removed if its age is >S,
the chunk size. Therefore, at any moment in time, there can be at most S instances in buf.
2. Ensemble update: the ensemble may be updated while an instance xk is waiting inside buf.
As a result, xk may no longer be an F4 outlier for the new ensemble of models, and it must
be removed if so. If xk is no longer an F outlier, and it is not removed, it could be falsely
Classification and Novel Class Detection in Concept-Drifting Data Streams 135
identified as a novel class instance, and also it could interfere with other valid novel class
instances, misleading the detection process.
3. Existing class: any instance is removed from buf if it has been labeled, and it belongs to one
of the existing classes. If it is not removed, it will also mislead novel class detection.
When an instance is removed from buf, it is classified immediately using the current ensemble
(if not classified already).
Lines 7–14 are executed only if xj is an F outlier. At first, xj is enqueued into buf (line 8). Then we
check whether buf.length, that is, the size of buf is at least q, and the last check on buf for detecting
novel class had been executed (i.e., last trial) at least q time units earlier (line 9). Since novel class
detection is more expensive than simple classification, this operation is performed at most once in
every q time units.
In line 11, Algorithm 11.3 (DetectNovelClass) is called, which returns true if a novel class is
found. Finally, if a novel class is found, all instances that are identified as novel class are removed
from buf (line 12).
Next, we examine Algorithm 11.3 to understand how buf is analyzed to detect presence of novel
class. First, we define q-neighborhood silhouette coefficient, or q-NSC, as follows:
Let Dcout ,q ( x ) be the mean distance from an F outlier x to λcout ,q ( x ) defined by Equation 11.1, where
λcout ,q ( x ) is the set of q-nearest neighbors of x within the F outlier instances. Also, let Dcmin ,q ( x ) be
the minimum among allDc,q ( x ), where c is an existing class. Then q-NSC of x is given by
Dcmin ,q ( x ) − Dcout ,q ( x )
q-NSC( x ) = (11.3)
max( Dcmin ,q ( x ), Dcout ,q ( x ))
q-NSC, which is a unified measure of cohesion and separation, yields a value between –1 and + 1.
A positive value indicates that x is closer to the F outlier instances (more cohesion) and farther away
from existing class instances (more separation), and vice versa. Note that q-NSC(x) of an F outlier x
must be computed separately for each classifier Mi ∈ M. We declare a new class if there are at least
q! (>q) F outliers having positive q-NSC for all classifiers Mi ∈ M. The justification behind this
decision is discussed in the next subsection.
Speeding up the computation of q-NSC: For each classifier Mi ∈ M, computing q-NSC for all F
outlier instance takes quadratic time in the number of F outliers. Let B = buf.length. In order to com-
pute q-NSC for one element x in buf, we need O(B) time to compute the distances from x to all other
elements in buf, and O(K) time to compute the distances from x to all existing class pseudopoints
h ∈ Mi. Therefore, the total time to compute q-NSC of all elements in buf is O(B(B + K)) = O(B2),
since B K . In order to make the computation faster, we create Ko (=(B/S) * K) pseudopoints from
F outliers using K-means clustering and perform the computations on the pseudopoints (referred to
as F pseudopoints), where S is the chunk size. The time required to apply K-means clustering on B
instances is O(KoB). The time complexity to compute q-NSC of all of the F pseudopoints is O(Ko
∗(Ko + K)), which is constant, since both Ko and K are independent of the input size. Therefore,
the overall complexity for computing q-NSC including the overhead for clustering becomes O(Ko
∗(Ko + K) + KoB) = O(Ko(B + Ko + K) = O(KoB), since B K ≥ Ko. So, the running time to
compute q-NSC after speedup is linear in B compared to quadratic in B before speedup. q-NSC of a
F pseudopoint computed in this way is actually an approximate average of the q-NSC of each F out-
lier in that F pseudopoint. By using this approximation, although we gain speed, we also lose some
precision. However, this drop in precision is negligible, as shown in the analysis to be presented
shortly. This approximate q-NSC of an F pseudopoint h is denoted by q-NSC′ (h).
136 Big Data Analytics with Applications in Insider Threat Detection
In line 1 of Algorithm 11.3, we create F pseudopoints using the F outliers as explained earlier. For
each classifier Mi ∈ M, we compute q-NSC′ (h) of every F pseudopoint h (line 4). If the total weight
of the F pseudopoints having positive q-NSC′ () is >q, then Mi votes for novel class (line 7). If all
classifiers vote for novel class, then we decide that a novel class has really appeared (line 9). Once
novel class is declared, we need to find the instances of the novel class. This is done as follows:
suppose h is an F pseudopoint having positive q-NSC′ (h) with respect to all classifiers Mi ∈ M
(note that q-NSC′ (h) is computed with respect to each classifier separately). Therefore, all F outlier
instances belonging to h are identified as novel class instances.
This algorithm can detect one or more novel classes concurrently as long as each novel class
follows property 1 and contains at least q instances. This is true even if the class distributions
are skewed. However, if more than one such novel class appears concurrently, our algorithm will
identify the instances belonging those classes as novel, without imposing any distinction between
dissimilar novel class instances (i.e., it will treat them simply as “novel”). But the distinction will be
learned by our model as soon as the true labels of those novel class instances arrive, and a classifier
is trained with those instances.
It should be noted that the larger the value of q, the greater the confidence with which we can
decide whether a novel class has arrived. However, if q is too large, then we may also fail to detect
a new class if the total number of instances belonging to the novel class is ≤q. An optimal value of
q is obtained empirically (Section 11.3).
Impact of evolving class labels on ensemble classification: As the reader might have realized
already, the arrival of novel classes in the stream causes the classifiers in the ensemble to have
different sets of class labels. There are two scenarios to consider. Scenario (a): suppose an older
(earlier) classifier Mi in the ensemble has been trained with classes c0 and c1, and an younger (later)
classifier Mj has been trained with classes c1, and c2, where c2 is a new class that appeared after Mi
had been trained. This puts a negative effect on voting decision, since the Mi obviously misclassifies
instances of c2. So, rather than counting the votes from each classifier, we selectively count their
votes as follows. If a younger classifier Mj classifies a test instance x as class c, but an older classifier
Mi had not been trained with training data of c, then the vote for Mi will be ignored if x is found to
be an outlier for Mi. Scenario (b): the opposite situation may also arise where the oldest classifier is
trained with some class c!, but none of the newer classifiers are trained with that class. This means
class c! has been outdated, and in that case, we remove Mi from the ensemble.
Figure 11.6a illustrates scenario (a). The classifier in the ensemble are sorted according to their age,
with M1 being the oldest, and M4 being the youngest. Each classifier Mi is marked with the classes
Classification and Novel Class Detection in Concept-Drifting Data Streams 137
(a) Outlier, c2 c2 c4 c4
M1 M2 M3 M4 Voted class = c4
Scenario
c1 c2 c3 c1 c2 c2 c4 c1 c2 c3 c4 M4’s vote ignored
(b)
with which it has been trained. For example, M1 has been trained with classes c1, c2, and c3, and so on.
Note that class c4 appears only in the two youngest classifiers. x appears as an outlier to M1. Therefore,
M1’s vote is not counted since x is classified as c4 by an younger classifier M3, and M1 does not contain
class c4. Figure 11.6b illustrates scenario (b). Here M1 contains class c1, which is not contained by any
younger classifiers in the ensemble. Therefore, c1 has become outdated, and M1 is removed from the
ensemble. In this way we ensure that older classifiers have less impact in the voting process. If class c1
later reappears in the stream, it will be automatically detected again as a novel class (see Definition 11.3).
∑D
x∈ℱ
cmin , q ( x) > ∑D
x∈ℱ
cout , q ( x)
1 1
⇒ ∑ ∑
x∈ℱ
q x ∈λ
i cmin , q ( x )
D( x, xi ) > ∑q ∑
x∈ℱ x j ∈λcout , q ( z )
D ( x, x j )
(11.4)
1 1 1 1
⇒
mq ∑ ∑
x∈ℱ xi ∈λcmin , q ( x )
D( x, xi ) >
mq ∑ ∑
x∈ℱ xi ∈λcout , q ( x )
D( x, xi ) (letting m = |ℱ |)
138 Big Data Analytics with Applications in Insider Threat Detection
Therefore, the mean pairwise distance between any pair (x, xj) of F outliers (such that x is an
F outlier with positive q-NSC and xj is an F outlier in λCout , q ( x )) is less than the mean pairwise
distance between an F outlier x and any existing class instance xi. In other words, an F outlier with
positive q-NSC is more likely to have its k-nearest neighbors (k-NN) within the F outlier instances
(for k ≤ q). So, each of the F outliers x ∈ ℱ should have the same class label as the other F outlier
instances, and should have a different class label than any of the existing classes. This implies that
the F outliers should belong to a novel class. The higher the value of q, the larger the support we
have in favor of the arrival of a new class. Furthermore, when all the classifiers unanimously agree
on the arrival of a novel class, we have very little choice other than announcing the appearance of
a novel class. The q-NH rule can be thought of a variation of the k-NN rule and is applicable to any
dataset irrespective of its data distribution and shape of classes (e.g., convex and nonconvex).
D(µi , µ j ) − Di
q-NSC′(φi ) = (11.5)
max( D(µi , µ j ), Di )
where µi is the centroid of φi, µj is the centroid of φj, and Di is the mean distance from centroid µi
to the instances in φi. The exact value of q-NSC follows from Equation 11.3:
1 ∑ 1
D( x, x j ) − q1∑ D( x, xi )
∑ max
q
x j ∈λcmin , q ( x ) xi ∈λcout , q ( x )
q-NSC(φi ) = (11.6)
q1 x ∈φi(∑ 1
q
x j ∈λcmin , q ( x )
D( x, x ), ∑
j
1
q
xi ∈λcout , q ( x )
D( x, xi ))
where λcout , q ( x ) is the set of q nearest neighbors of x within F pseudopoint φi, and λcmin ,q ( x ) is the set
of q nearest neighbors of x within pseudopoint φj, for some x ∈ φi. Therefore, the deviation from the
exact value, εqnsc = q-NSC(φi)−q-NSC′(φi). Applying Equations 11.5 and 11.6,
(x-µi)2 ϕ
j
µi
x
(µi-µj)2
(x-µj)2 ϕj
µj
1 ∑ 1
D( x, x j ) − q1 ∑ D( x, xi )
∑ max
q
x j ∈λcmin ,q ( x ) x j ∈λcout ,q ( x )
Eq -NSC =
q1 x ∈φi (∑ 1
q
x j ∈λcmin , q ( x )
D( x, x ), ∑ j
1
q
x j ∈λcout , q ( x ) )
D( x, xi )
(11.7)
D(µi , µ j ) − Di
−
max( D(µi , µ j ) − Di )
In order to simplify the equations, we assume that q1 = q2 = q, and q-NSC is positive for any
x ∈ φi. Therefore, λcout ,q ( x ) = φi , λcmin ,q ( x ) = φ j . Also, we consider square of Eucledian distance as
the distance metric, that is, D(x, y) = (x − y)2. Since q-NSC is positive for any x ∈ φi, we can deduce
following relationships:
R1: max( D(µi , µ j ), Di ) = D(µi , µ j ) —as the q-NSC for each x ∈ φi is positive, the overall
q-NSC of φi (i.e., q-NSC′(φi)) is also positive. Therefore, this relationship follows from
Equation 11.5.
R2: max q1
∑x j ∈λcmin ,q ( x )
D( x, x j ), q1 ∑
x j ∈λcout ,q ( x )
D( x, xi ) = q1
x j ∈λcmin ,q ( x ) ∑
D( x, x j ), which
follows, since the mean q-NSC of the instances in φi is positive.
Also, Di = 1
q ∑ x ∈φi
( x − µi )2 = σi2 , the mean distance of the instances in φi from the centroid.
Therefore, q-NSC′(φi) can be rewritten as
(µi − µ j )2 − σ 2 1 (µi − µ j )2 − ( x − µi )2
q-NSC′(φi ) =
(µi − µ j )2
=
q ∑ x ∈φi
(µi − µ j )2
(11.8)
1
=
q ∑ q-NSC′( x)
x ∈φi
where q-NSC′(x) is an approximate value of q-NSC(x). Now we can deduce the following
inequalities:
1
1
∑ ( x − x j )2 − q1 ∑ ( x − xi )2
(µi − µ j )2 − σi2
∑
q
x j ∈φ j x j ∈φi
Eqnsc = −
q x∈φi
1
q ∑ x j ∈φ j
( x − x j )2 (µi − µ j )2
1
1
q ∑ ( x − x j )2 − q1 ∑ ( x − xi )2
(µi − µ j )2 − ( x − µi )2
=
q ∑
x j ∈φ j
1
∑ ( x − x j )2
x j ∈φi
−
(µi − µ j )2
x∈φi q
x j ∈φ j
140 Big Data Analytics with Applications in Insider Threat Detection
1 σ 2j + ( x − µ j )2 − σi2 − ( x − µi )2 (µi − µ j )2 − ( x − µi )2
Eqnsc =
q ∑
x∈φi
σ 2j + ( x − µ j )2
−
(µi − µ j )2
1 σi2 + ( x − µi )2 ( x − µi )2
=
q ∑1− σ
x∈φi
2
j + (x − µj )
2
−1 +
(µi − µ j )2
1 ( x − µi )2 σi2 + ( x − µi )2
=
q ∑ (µ − µ )
x∈φi i j
2
−
σ 2j + ( x − µ j )2
2
σ 1 σi2 1 ( x − µi )2
= i
(µi − µ j ) 2
−
q ∑ x∈φi
−
σ 2j + ( x − µ j )2 q σ 2j + ( x − µ j )2
2
σ σi2 1 ( x − µi )2
≤ i
(µi − µ j ) 2
− 2 2
−
σi + σ j + (µi − µ j )2 q ∑σ
x∈φi
2
j + ( x − µ j )2
(11.9)
The last line follows since using the relationship between harmonic mean and arithmetic mean
it can be shown that
1 σi2
q ∑σ
x ∈φi
2
j + ( x − µ j )2
Usually, if φi belongs to a novel class, it is empirically observed that q-NSC′(φi) ≥ 0.9. Putting
this value in Equation 11.8 and solving, we obtain σi2 ≤ (1 − 0.9)(µi − µ j )2 . Therefore, from Equation
11.10, we obtain εqnsc ≤ 0.1/3≍0.03. Since the range of q-NSC is −1 to +1, a deviation of 0.03 (3%)
from the exact value is really negligible, and does not affect the outcome of the algorithm. Similar
reasoning can be carried out for the cases where q-NSC of the instances in φi is negative.
becomes O(KLS + LS + mS), since m KL . Finally, the overall complexity of Algorithm 11.2
(ECSMiner) is O(mS + ft(S)) per chunk, where ft(S) is the time to train a classifier with S training
instances, and m S .
ECSMiner keeps three buffers: buf, the training buffer ℒ, and the unlabeled data buffer U. Both
buf and ℒ hold at most S instances, whereas U holds at most Tl instances. Therefore, the space
required to store all three buffers is O(max(S,Tl)). The space required to store a classifier (along with
the pseudopoints) is much <S. So, the overall space complexity remains O(max(S,Tl)).
11.4 EXPERIMENTS
In this section, we describe the datasets, experimental environment, and discuss and analyze
the results.
11.4.1 Datasets
11.4.1.2 Synthetic Data with Concept Drift and Novel Class (SynCN)
This synthetic data simulates both concept drift and novel class. Data points belonging to each class
are generated using Gaussian distribution having different means (−5.0 to + 5.0) and variances
(0.5–6) for different classes. Besides, in order to simulate the evolving nature of data streams, the
probability distributions of different classes are varied with time. This caused some classes to
appear and some other classes to disappear at different times. In order to introduce concept drift,
the mean values of a certain percentage of attributes have been shifted at a constant rate. As done
in the SynC dataset, this rate of change is also controlled by the parameters m, t, s, and N in a
similar way. The dataset is normalized so that all attribute values fall within the range [0,1]. We
generate the SynCN dataset with 20 classes and 40 real-valued attributes, having a total of 400K
data points.
disappear frequently, making the new class detection challenging. This dataset contains TCP con-
nection records extracted from LAN network traffic at MIT Lincoln Labs over a period of 2 weeks.
Each record refers to either to a normal connection or an attack. There are 22 types of attacks, such
as buffer-overflow, portsweep, guess-passwd, neptune, rootkit, smurf, spy, and so on. So, there are
23 different classes of data. Most of the data points belong to the normal class. Each record con-
sists of 42 attributes, such as connection duration, the number bytes transmitted, number of root
accesses, and so on. We use only the 34 continuous attributes and remove the categorical attributes.
This dataset is also normalized to keep the attribute values within [0,1].
11.4.2 Experimental Set-Up
We implemented our algorithm in Java. The code for decision tree has been adapted from the Weka
machine learning open source repository (http://www.cs.waikato.ac.nz/ml/weka/). The experi-
ments were run on an Intel P- IV machine with 2 GB memory and 3 GHz dual processor CPU.
Our parameter settings are as follows, unless mentioned otherwise: (i) K (number of pseudopoints
per classifier) = 50, (ii) q (minimum number of instances required to declare novel class) = 50,
(iii) L (ensemble size) = 6, and (iv) S (chunk size) = 2000. These values of parameters are tuned to
achieve an overall satisfactory performance.
11.4.3 Baseline Approach
To the best of our knowledge, there is no approach that can classify data streams and detect novel
class. So, we compare ECSMiner with a combination of two baseline techniques: OLI N DDA
[SPIN08] and weighted classifier ensemble (WCE) [WANG03], where the former works as novel
class detector and the latter performs classification, respectively. This is done as follows. For each
test instance, we delay its classification for Tc time units. That is, OLI N DDA is given Tc time units
to determine whether the instance is novel. If by that time the test instance is identified as a novel
class instance, then it is considered novel and not classified using WCE. Otherwise, the instance is
assumed to be an existing class instance and its class is predicted using WCE. We use OLI N DDA
as the novelty detector, since it is a recently proposed algorithm that is shown to have outperformed
other novelty detection techniques in data streams [SPIN08].
However, OLI N DDA assumes that there is only one “normal” class, and all other classes are
“novel.” So, it is not directly applicable to the multiclass novelty detection problem, where any
combination of classes can be considered as the “existing” classes. Therefore, we propose two alter-
native solutions. First, we build parallel OLI N DDA models, one for each class, which evolve
simultaneously. Whenever the instances of a novel class appear, we create a new OLI N DDA model
for that class. A test instance is declared as novel, if all the existing class models identify this
instance as novel. We will refer to this baseline method as WCE-OLINDDA PARALLEL. Second,
we initially build an OLI N DDA model using all the available classes with the first init number
instances. Whenever a novel class is found, the class is absorbed into the existing OLI N DDA
model. Thus, only one “normal” model is maintained throughout the stream. This will be referred
to as WCE-OLINDDA SINGLE. In all experiments, the ensemble size and chunk size are kept
the same for all three baseline techniques. Besides, the same base learner is used for WCE and
ECSMiner. The parameter settings for OLI N DDA are (i) number of clusters built in the initial
model, K = 30, (ii) least number of normal instances needed to update the existing model = 100,
(iii) least number of instances needed to build the initial model = 100, and (iv) maximum size of
Classification and Novel Class Detection in Concept-Drifting Data Streams 143
the “unknown memory” = 200. These parameters are chosen either according to the default values
used in [SPIN08] or by trial and error to get an overall satisfactory performance. We will henceforth
use the acronyms XM for ECSMiner, W-OP for WCE-OLINDDA PARALLEL and W-OS for WCE-
OLINDDA SINGLE.
We use the following performance metrics to evaluate our technique: Mnew = % of novel class
instances Misclassified as existing class, that is,
Fn ×100
M new =
Nc
Fnew = % of existing class instances Falsely identified as novel class, that is,
Fp ×100
Fnew =
N − Nc
ERR = Total misclassification error (%)(including Mnew and Fnew), that is,
( Fp + Fn + Fe )*100
ERR =
N
From the definition of the error metrics, it is clear that ERR is not necessarily equal to the sum
of Mnew and Fnew.
Evaluation is done as follows: we build the initial models in each method with the first init
number instances. In our experiments, we set init number = 3S (first three chunks). From the
fourth chunk onward, we evaluate the performances of each method on each data point using the
time constraints. We update the models with a new chunk whenever all data points in that chunk
are labeled.
11.4.4.2 Results
Figure 11.8a through c shows the total number of novel class instances missed (i.e., misclassified
as existing class) and Figure 11.8d through f shows the overall error rates (ERR) of each of the
techniques for decision tree classifier up to a certain point in the stream in different datasets. We
omit SynC from the figures since it does not have any novel class. k-NN classifier also has similar
results. For example, in Figure 11.8a at the X-axis = 100, the Y values show the total number of
novel class instances missed by each approach in the first 100 K data points in the stream (Forest).
At this point, XM misses only 15 novel class instances, whereas W-OP, and W-OS misses 1937, and
7053 instances, respectively. Total number of novel class instances appeared in the stream by this
144 Big Data Analytics with Applications in Insider Threat Detection
Novel instances
Novel instances
20K 20K 20K
5K 5K 5K
0 0 0
100 200 300 400 100 200 300 400 100 200 300
Stream (in thousand data pts) Stream (in thousand data pts) Stream (in thousand data pts)
(d) (e) (f )
25 25 25
XM XM XM
W-OP W-OP W-OP
W-OS W-OS W-OS
20 20 20
15 15 15
ERR
ERR
ERR
10 10 10
5 5 5
0 0 0
100 200 300 400 100 200 300 400 100 200 300
Stream (in thousand data pts) Stream (in thousand data pts) Stream (in thousand data pts)
FIGURE 11.8 Novel class instances missed by each method (top row) and overall error of each method
(Tl = 1000, Tc = 400) (bottom row).
point of time is shown by the corresponding Y value of the curve “Total,” which is 12,226. Likewise,
in Figure 11.8d, the ERR rates are shown throughout the stream history. In this figure, at the same
position (X = 100), Y values show the ERR of each of the three techniques up to the first 100K data
points in the stream. The ERR rates of XM, W-OP, and W-OS at this point are: 9.2%, 14.0%, and
15.5%, respectively.
Table 11.1 summarizes the error metrics for each of the techniques in each dataset for decision
tree and KNN. The columns headed by ERR, Mnew, and Fnew report the value of the corresponding
metric on an entire dataset. For example, while using decision tree in KDD dataset, XM, W-OP,
and W-OS have 1.0%, 5.8%, and 6.7% ERR, respectively. Also, their corresponding Mnew are 1.0%,
13.2%, and 96.9%, respectively. Note that there is no novel class in SynC, and so, there is no Mnew for
TABLE 11.1
Performance Comparison in All Datasets
ERR Mnew Fnew
Classifier Dataset XM W-OP W-OS XM W-OP W-OS XM W-OP W-OS
Decision SynC 6.9 14.1 12.8 – – – 0.0 2.4 1.1
tree SynCN 1.2 8.9 13.9 0.0 26.5 96.2 0.02 1.6 0.1
KDD 1.0 5.8 6.7 1.0 13.2 96.9 0.9 4.3 0.03
Forest 4.7 7.9 8.5 0.2 30.7 70.1 3.0 1.1 0.2
k-NN SynC 0.0 2.4 1.1 – – – 0.0 2.4 1.1
SynCN 0.01 8.9 13.9 0.0 26.5 96.2 0.0 1.6 0.1
KDD 1.2 4.9 5.2 5.9 12.9 96.5 0.9 4.4 0.03
Forest 3.6 4.1 4.6 8.4 32.0 70.1 1.3 1.1 0.2
Note: Bold numbers in this table represent the best run times.
Classification and Novel Class Detection in Concept-Drifting Data Streams 145
any approach. Both W-OP and W-OS have some Fnew in SynC dataset, which appears since W-OP
and W-OS are less sensitive to concept drift than XM. Therefore, some existing class instances are
m is classified as novel class because of concept drift. In general, XM outperforms the baseline
techniques in overall classification accuracy and novel class detection. The main reason behind the
poorer performance of W-OP in detecting novel classes is the way OLINDDA detects novel class.
OLINDDA makes two strong assumptions about a novel class and normal classes. First, it assumes
a spherical boundary (or, convex shape) of the normal model. It updates the radius and centroid
of the sphere periodically and declares anything outside the sphere as a novel class if there is evi-
dence of sufficient cohesion among the instances outside the boundary. The assumption that a data
class must have a convex/spherical shape is too strict to be maintained for a real-world problem.
Second, it assumes that the data density of a novel class must be at least that of the normal class. If
a novel class is sparser than the normal class, the instances of that class would never be recognized
as a novel class. But in a real-world problem, two different classes may have different data densi-
ties. OLINDDA would fail in those cases where any of the assumptions are violated. On the other
hand, XM does not require that an existing class must have convex shape, or that the data density
of a novel class should match that of the existing classes. Therefore, XM can detect novel classes
much more efficiently. Besides, OLINDDA is less sensitive to concept drift, which results in falsely
declaring novel classes when drift occurs in the existing class data. On the other hand, XM correctly
distinguishes between concept drift and concept evolution, avoiding false detection of novel classes
in the event of concept drift. W-OS performs worse than W-OP since W-OS “assimilates” the novel
classes into the normal model, making the normal model too generalized. Therefore, it considers
most of the future novel classes as normal (nonnovel) data, yielding very high false negative rate.
Figure 11.9a and b shows how XM and W-OP respond to the constraints Tl and Tc in Forest
dataset. Similar characteristics are observed for other datasets and W-OS. From Figure 11.9a, it
is evident that increasing Tl increases error rates. This is because of the higher delay involved in
labeling, which makes the newly trained models more outdated. Naturally, Mnew rate decreases with
increasing Tc as shown in Figure 11.9b because higher values of Tc means more time to detect novel
classes. As a result, ERR rates also decreases.
Figure 11.10a through d illustrates how the error rates of XM change for different parameter set-
tings on Forest dataset and decision tree classifier. These parameters have similar effects on other
datasets, and k-NN classifier. Figure 11.10a shows the effect of chunk size on ERR, Fnew, and Mnew
rates for default values of other parameters. We note that ERR and Fnew rates decrease up to a certain
point (2000) then increases. The initial decrement occurs because larger chunk size means more
training data for the classifiers, which leads to lower error rates. However, if chunk size is increased
30 30
Error rates
Error rates
20 20
10 10
0 0
500 1000 1500 2000 2500 3000 300 300 300 1200
T1 Tc
FIGURE 11.9 Mnew and overall error (ERR) rates on Forest dataset for (a) Tc = 400 and different values of Tl
and (b) Tl = 2000 and different values of Tc.
146 Big Data Analytics with Applications in Insider Threat Detection
(a) (b)
15 15
ERR ERR
Mnew Mnew
12 Fnew 12 Fnew
9 9
6 6
3 3
0 0
1000 2000 3000 4000 5000 6000 2 4 6 8 10 12
Chunk size (S) Ensemble size (L)
(c) (d)
15 15
ERR ERR
Mnew Mnew
12 Fnew 12 Fnew
9 9
6 6
3 3
0 0
5 10 20 50 100 200 500 5 10 20 50 100 200 500 1000
Number of clusters (K) Neighborhood size (q)
too much, then we have to wait much longer to build the next classifier. As a result, the ensemble is
updated less frequently than desired, meaning, the ensemble remains outdated for longer period of
time. This causes increased error rates.
Figure 11.10b shows the effect of ensemble size (L) on error rates. We observe that the ERR and
Fnew rates keeps decreasing with increasing L. This is because when L is increased, classification
error naturally decreases because of the reduction of error variance [TUME96]. But the rate of dec-
rement is diminished gradually. However, the Mnew rate starts increasing after some point (L = 6),
because a larger ensemble means more restriction on declaration of the arrival of novel classes.
Therefore, we choose a value where the overall error (ERR) is considerably low and also the Mnew
is low. Figure 11.10c shows the effect of number of clusters (K) on error. The x-axis in this chart is
drawn on a logarithmic scale. Although the overall error is not much sensitive on K, the Mnew rate
is. Increasing K reduces the Mnew rate, because outliers are more correctly detected. Figure 11.10d
shows the effect of q (Minimum neighborhood size to declare a novel class) on error rates. The
x-axis in this chart is also drawn on a logarithmic scale. Naturally, increasing q up to a certain point
(e.g., 200) helps to reduce Fnew and ERR, since a higher value of q gives us a greater confidence
(i.e., reduces possibility of false detection) in declaring a new class (see Section 11.2). But too large
a value of q increases Mnew and ERR rates (which is observed in the chart), since a novel class is
missed by the algorithm if there are < q instances of the novel class in a window of S instances. We
have found that any value between 20 and 100 is the best choice for q.
Finally, we compare the running times of all three competing methods on each dataset for decision
tree in Table 11.2. k-NN also shows similar performances. The columns headed by “Time (sec)/1K”
Classification and Novel Class Detection in Concept-Drifting Data Streams 147
TABLE 11.2
Running Time Comparison in All Datasets
Time (second)/1K Points/second Speed Gain
Dataset XM W-OP W-OS XM W-OP W-OS XM over W-OP XM over W-OS
SynC 0.33 0.41 0.2 2960 2,427 5062 1.2 0.6
SynCN 1.7 14.2 2.3 605 71 426 8.5 1.4
KDD 1.1 30.6 0.5 888 33 1964 26.9 0.45
Forest 0.93 8.3 0.36 1068 120 2792 8.9 0.4
Note: Bold numbers in this table represent the best run times.
show the average running times (train and test) in seconds per 1000 points, the columns headed by
“Points/sec” show how many points have been processed (train and test) per second on average,
and the columns headed by “speed gain” shows the ratio of the speed of XM to that of W-OP and
W-OS, respectively.
For example, XM is 26.9 times faster than W-OP on KDD dataset. Also, XM is 1.2, 8.5, and 8.9
times faster than W-OP in SynC, SynCN, and Forest datasets, respectively.
In general, W-OP is roughly C times slower than XM in a dataset having C classes. This is
because W-OP needs to maintain C parallel models, one for each class. Besides, OLI N DDA model
creates cluster using the “unknown memory” every time a new instance is identified as unknown
and tries to validate the clusters. As a result, the processing speed becomes diminished when novel
classes occur frequently, as observed in KDD dataset. However, W-OS seems to run a bit faster than
XM in three datasets, although W-OS shows much poorer performance in detecting novel classes
and in overall error rates (see Table 11.1). For example, W-OS fails to detect 70% or more novel
class instances in all datasets, but XM correctly detects 91% or more novel class instances in any
dataset. Therefore, W-OS is virtually incomparable to XM for the novel class detection task. Thus,
XM outperforms W-OP both in speed and accuracy and dominates W-OS in accuracy.
We also test the scalability of XM on higher dimensional data having larger number of classes.
Figure 11.11 shows the results. The tests are done on synthetically generated data, having differ-
ent dimensions (20–60) and number of classes (10–40). Each dataset has 250,000 instances. It is
evident from the results that the time complexity of XM increases linearly with total number of
dimensions in the data, as well as total number of classes in the data. Therefore, XM is scalable to
high-dimensional data.
3500 3500
Running time (ms) per 1K instances
D=20 C=10
D=40 C=20
3000 D=60 3000 C=30
C=40
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
10 20 30 40 20 30 40 50 60
Number of classes Data dimensions
REFERENCES
[COVE67]. T.M. Cover and P.E. Hart. Nearest Neighbor Pattern Classification. IEEE Transactions on
Information Theory, 13 (1), 21–27, January 1967.
[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, and B.M. Thuraisingham. “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM’08: Proceedings
of the 2008 International Conference on Data Mining, Dec. 15–19, Pisa, Italy, pp. 929–934, IEEE
Computer Society, 2008.
[MASU09]. M.M. Masud, J. Gao, L. Khan, J. Han, and B.M. Thuraisingham. “Integrating Novel Class
Detection with Classification for Concept-Drifting Datastreams,” In ECMLPKDD’09: Proceedings
of the 2009 European Conference on Machine Learning and Principles and Practice in Knowledge
Discovery in Databases, II, September 7–11, Bled, Slovenia, pp. 79–94, Springer-Verlag, 2009.
[PANG04]. B. Pang and L. Lee. “A Sentimental Education: Sentiment Analysis using Subjectivity
Summarization Based on Minimum Cuts,” In ACL’04:Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics, July 21–26, Barcelona, Spain, pp. 271–278, 2004.
[SPIN08]. E.J. Spinosa, A. Poncede L.F. deCarvalho, and J. Gama. “Cluster-Based Novel Concept Detection
in Data Streams Applied to Intrusion Detection in Computer Networks,” In SAC’08: Proceedings of the
23rd ACM Symposium on Applied Computing, March 16–20, Ceara, Brazil, pp. 976–980, 2008.
[TUME96]. K. Tumer and J. Ghosh, “Error Correlation and Error Reduction in Ensemble Classifiers,”
Connection Science, 8 (304), 385–403, 1996.
[WANG03]. H. Wang, W. Fan, P.S. Yu, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In KDD’03: Proceedings of the 9th ACMSIGKDD International Conference on Knowledge
Discovery and Data mining, Aug. 24–27, Washington, DC, pp. 226–235, Aug 2003. ACM.
[ZHU08]. X. Zhu, Semi-Supervised Learning Literature Survey. University of Wisconsin Madison Technical
Report No. TR1530, July 2008.
12 Data Stream Classification with
Limited Labeled Training Data
12.1 INTRODUCTION
As stated in [MASU08], recent approaches in classifying evolving data streams are based on super-
vised learning algorithms, which can be trained with labeled data only. Manual labeling of data is
both costly and time-consuming. Therefore, in a real-streaming environment, where huge volumes
of data appear at a high speed, labeled data may be very scarce. Thus, only a limited amount of
training data may be available for building the classification models, leading to poorly trained clas-
sifiers. We apply a novel technique to overcome this problem by building a classification model from
a training set having both unlabeled and a small number of labeled instances. This model is built
as microclusters using a semisupervised clustering technique and classification is performed with
k-nearest neighbor algorithm. An ensemble of these models is used to classify the unlabeled data.
Empirical evaluation on both synthetic data and real-botnet traffic reveals that our approach, using
only a small amount of labeled data for training, outperforms state-of-the-art stream classification
algorithms that use 20 times more labeled data than our approach. In this chapter, we describe
our proposed solution for the limited labeled training data. It is based on our work discussed in
[MASU08].
The organization of this chapter is as following. A description of our techniques is given in
Section 12.2. Training with limited labeled data is discussed in Section 12.3. Ensemble classifica-
tion is discussed in Section 12.4. Our experiments are discussed in Section 12.5. This chapter is
summarized in Section 12.6.
D1 = x1,…, xS
D2 = xS +1,…, x2 S
.
.
.
Dn = x(n −1)S + 1,…, xnS
where xi is the ith instance in the stream, S is the chunk size, Di is the ithe data chunk, and Dn is the
latest data chunk. Assuming that the class labels of all the instances in Dn are unknown, the problem
is to predict their class labels. Let yi and ŷi be the actual and predicted class labels of xi, respectively.
If ŷi = yi, then the prediction is correct, otherwise it is incorrect. The goal is to minimize the predic-
tion error. Figure 12.1 shows the top level architecture of ReaSC.
We train a classification model from a data chunk Di as soon as P % (P ≪ 100) randomly cho-
sen instances from the chunk have been correctly labeled by an independent labeling mechanism
149
150 Big Data Analytics with Applications in Insider Threat Detection
1 Clasification
2 Training Predicted class
4 Update
Ensemble
New
of models
model
Refinement
3
(e.g., human experts). Note that this assumption is less strict than other stream classification tech-
niques such as [WANG03] which assumes that all the instances of Di must have been labeled before
it can be used to train a model. We build the initial ensemble M of L models = {M1, …, ML} from
the first L data chunks, where Mi is trained from chunk Di. Then the following algorithm is applied
for each of the following chunk.
1. Classification: The existing ensemble is used to predict the labels of each instance in Dn
using nearest neighbor (NN) classification, and majority voting (Section 12.4.1). As soon
as Dn has been partially labeled, the following steps are performed.
2. Training: Training is done by applying semisupervised clustering on the partially labeled
training data to build K clusters (Section 12.3). The semisupervised clustering is based on
the expectation-maximization (E-M) algorithm that locally minimizes an objective func-
tion. The objective function takes into account the dispersion between each point and its
corresponding cluster centroid, as well as the impurity measure of each cluster. Then we
extract a statistical summary from the data points of each cluster, save the summary as a
microcluster, and remove the raw data points (Section 12.3.5). In this way, we get a new
classification model M′ that can be used to classify unlabeled data using the NN algorithm.
3. Ensemble refinement: In this step, M′ is used to refine the existing ensemble of models if
required (Section 12.4.2). Refinement is required if M′ contains some data of a particular
class c, but no model in the ensemble M contains any data of that class. This situation may
occur because of concept evolution. In this case, the existing ensemble M does not have any
Data Stream Classification with Limited Labeled Training Data 151
knowledge of class c, and so, it must be refined so that it learns to classify instances of this
class. Refinement is done by injecting microclusters of M′, which contain labeled instances
of class c, into the existing models of the ensemble.
4. Ensemble update: In this step, we select the best L models from the L + 1 models: {M∪M′},
based on their accuracies on the labeled instances of Dn (Section 12.3.3). These L best
models construct the new ensemble M. The ensemble technique helps the system to cope
with concept drift.
Table 12.1 illustrates a schematic example of ReaSC. In this example, we assume that P% data in
data chunk Di are labeled by the time chunk Di+1 arrives. The initial ensemble is built with the first
L chunks. Then the ensemble is used to classify the latest chunk (DL+1). From the next (L + 2nd)
chunk onward, a sequence of operations is performed with the arrival of a new chunk. For example,
the sequence of operations at the arrival of chunk DL+i (i > 1) is as follows:
1. The previous chunk DL+i−1 has been partially labeled by now. Train a new model M′ using
DL+i−1.
2. Refine the existing ensemble M using the new model M′.
3. Update the ensemble M by choosing the best L models from M∪M′.
4. Classify each instance in DL+i using ensemble M.
TABLE 12.1
An Example of ReaSC Actions with Stream Progression
Stream Progression
Arrival of Chunk Action(s)
D1 —
D2 M1 ← Train(D1)
… …
… …
DL+2 M′ ← Train(DL + 1)
DL+i M′ ← Train(DL+i−1)
where the first l instances are labeled, that is, yi ∈ {1, …, C}, i ≤ l, and the remaining instances
are unlabeled; C being the total number of classes. We assign the class label yi = 0 for all unla-
beled instance xi, i > l. We are to create K clusters maintaining the constraint that all points in
the same cluster have the same class label. We restrict the value of parameter K to be greater
than C since intuitively there should be at least one cluster for each class of data. We will first re-
examine the unsupervised K-means clustering in Section 12.3.2 and then propose a new semisu-
pervised clustering technique using cluster-impurity minimization in Section 12.3.3.
K means = ∑ ∑ || x − u ||
i =1 x ∈Xi
i
2
(12.1)
where ui is the centroid of cluster i, and ||x − ui|| is the Eucledian distance between x and ui.
K K
MCI-K means = ∑∑
i =1 x ∈Xi
|| x − ui ||2 + ∑ W * Imp
i =1
i i (12.2)
where Wi is the weight associated with cluster i and Impi is the impurity of cluster.
In order to ensure that both the intracluster dispersion and cluster impurity are given the same
importance, the weight associated with each cluster should be adjusted properly. Besides, we would
Data Stream Classification with Limited Labeled Training Data 153
want to penalize each data point that contributes to the impurity of the cluster. So, the weight associ-
ated with each cluster is chosen to be
where Wi is the set of data points in cluster i and Dℒi is the average dispersion of each of these points
from the cluster centroid. Thus, each instance has a contribution to the total penalty, which is equal
to the cluster impurity multiplied by the average dispersion of the data points from the centroid.
We observe that Equation 12.3 is equivalent to the sum of dispersions of all the instances from the
cluster centroid. That is, we may rewrite Equation 12.3 as
Wi = ∑ ||x − u ||
x ∈Xi
i
2
K K
MCI-K means = ∑∑
i =1 x ∈Xi
|| x − ui ||2 + ∑ ∑ || x − u || * Imp
i =1 x ∈Xi
i
2
i
(12.4)
K
= ∑∑
i =1
x∈Xi
|| x − ui ||2 +(1 + Impi )
Impurity measures: Equation 12.4 should be applicable to any impurity measure in general.
Entropy and Gini index are most commonly used impurity measures. We use the following impurity
measure: Impi = ADCi*Enti, where ADCi is the “aggregated dissimilarity count” of cluster i and
Enti is the entropy of cluster i. The reason for using this impurity measure will be explained shortly.
In order to understand ADCi, we first need to define “Dissimilarity count.”
Dissimilarity count DCi(x, y) of a data point x in cluster i having class label y is the total number of
instances in that cluster having class label other than y.
In other words,
where Xi(y) is the set of instances in cluster i having class label = y. Recall that unlabeled
instances are assumed to have class label = 0. Note that DCi(x, y) can be computed in constant time,
if we keep an integer vector to store the counts |χi(c)|, c ∈ {0, 1, …, C}. “Aggregated dissimilarity
count” or ADCi is the sum of the dissimilarity counts of all the points in cluster i:
ADCi = ∑ DC ( x, y).
x ∈Xi
i (12.6)
154 Big Data Analytics with Applications in Insider Threat Detection
|Xi (c)|
pci = . (12.7)
|X i |
The use of Enti in the objective function ensures that clusters with higher entropy get higher
penalties. However, if only Enti had been used as the impurity measure, then each point in the same
cluster would have received the same penalty. But we would like to favor the points belonging to the
majority class in a cluster, and disfavor the points belonging to the minority classes. Doing so would
force more points of the majority class to be moved into the cluster, and more points of the minority
classes to be moved out of the cluster, thus making the clusters purer. This is ensured by introducing
ADCi to the equation. We call the combination of ADCi and Enti as “compound impurity measure”
since it can be shown that ADCi is proportional to the “gini index” of cluster i. Following from
Equation 12.6, we obtain
ADCi = ∑
x ∈Xi
DCi ( x, y) = ∑ ∑ DC ( x, y)
c = 0 x ∈Xi ( x )
i
C C
Xi (c) Xi (c)
= ∑
c=0
(|Xi (c)|)(|Xi |−|Xi (c)|) = (|Xi |)2 ∑
c=0
Xi
1 −
Xi
C
= (| Xi |)2 ∑( p )(1− p )
c=0
i
c
i
c (using Equation 12.7)
C
∑( p ) = (|X |) * Gini
2
= (| Xi |)2 1 − i
c i
2
i
c=0
|L(c)|
kc = K , c ∈ {1,…, C},
|L|
where L is the set of all labeled points in X, and L(c) is the subset of points in L belonging to class c.
We observed in our experiments that this initialization works better than initializing equal num-
ber of centroids of each class. This is because if we initialize the same number of centroids from
each class, then larger classes (i.e., classes having more instances) tend to create larger and sparser
clusters, which leads to poorer classification accuracy for the nearest neighbor classification.
Let there be ηc labeled points of class c in the dataset. If ηc > kc, then we choose kc centroids
from ηc points using the farthest-first traversal heuristic [HOCH85]. To apply this heuristic, we
first initialize a “visited set” of points with a randomly chosen point having class label c. At each
iteration, we find a point xj of class c that maximizes the minimum distance from all points in the
visited set, and add it to the visited set. This process continues until we have kc points in the set. If
ηc < kc, then we choose remaining centroids randomly from the unlabeled points. After initializa-
tion, E-step and M-step are iterated until the convergence condition is fulfilled.
E-Step: In E-step, we assign each data point x to a cluster i such that its contribution to the global
objective function, kMCI Keans(x), is minimized:
Note that the value of the global objective function OMCI Keans depends on the order in which the
labeled points are assigned to clusters. It is computationally intractable to try all possible orderings
and choose the best one. However, there are some heuristic approaches that approximate the optimal
solution. We follow the iterative conditional mode or ICM algorithm [BESA86]. This is implemented
as follows: at each iteration of ICM, we first randomly order the points. Then we assign the points (in
that order) to the cluster i that minimizes OMCI Keans(x). This is continued until no point changes its clus-
ter in successive iterations, which indicates convergence. According to [BESA86], ICM is guaranteed
to converge. The E-step completes after termination of ICM, and the program moves to the M-step.
M-Step: In the M-Step, we recompute each cluster centroid by averaging all the points in that
cluster:
ui =
∑ x ∈Xi
x
(12.9)
|X i |
After performing this step, the convergence condition is checked. If fulfilled, the procedure
terminates, otherwise another iteration of E-step and M-step is performed.
classification model. Note that the number of microclusters in the model will become less than K if
any such deletions take place.
12.4.2 Ensemble Refinement
After a new model M′ has been trained with a partially labeled data chunk, the existing ensemble M
is refined with this model (line 3, Algorithm 12.1). Refinement is done if the latest partially labeled
data chunk Dn contains a class c, which is absent in all models of the ensemble M. This is possible
if either a completely new class appears in the stream or an old class reappears that has been absent
in the stream for a long time. Both of these happen because of concept evolution, and the class c is
denoted as an evolved class. Note that there may be more than one evolved classes in the stream.
If there is any evolved class, M must be refined so that it can correctly classify future instances of
that class. Algorithm 12.2 describes how the existing model is refined.
Description of Refine-Ensemble (Algorithm 12.2): The algorithm starts (line 1) by checking whether
ensemble refinement is needed. This can be done in constant time by keeping a Boolean vector V
of size C per model, and setting V[c] = true during training if there is any labeled training instance
from class c. The function Need-to-refine (M) checks whether there is any class c such that V [c]
is false for all models Mi ∈ M, but true for M′. If there is such a class c, then c is an evolved class.
Refinement is needed only if there is an evolved class. Then the algorithm looks into each microclu-
ster m of the new model M′ (line 2). If the majority class of m is an evolved class (line 3), then we
do the following: for each model Mi ∈ M, we inject the microcluster m in Mi (line 7). Before inject-
ing a microcluster, we try to merge the closest pair of microclusters in Mi having the same majority
class (line 6). This is done to keep the number of microclusters constant (=K). However, merging
is done only if such a closet pair is found, and |Mi|, the total number of microclusters in Mi equals
K. Note that the first condition may occur (i.e., no such closest pair found) if |Mi| < C. In this case,
|Mi| is incremented after the injection. This ensures that if C, the number of classes, increases due
to concept evolution, the number of microclusters in each model also increases. In the extreme case
(not shown in the algorithm) when C exceeds K due to evolution, K is also incremented to ensure that
the relation K > C remains valid. The reasoning behind the refinement is as follows. Since no model
in ensemble M has knowledge of an evolved class c, the models will certainly misclassify any data
belonging to the class. By injecting microclusters of the class c, we introduce some data from this
class into the models, which reduces their misclassification rate. Figure 12.2 illustrates the ensemble
1
1 1 1 1 1
1
2 2
3 2 Merging 2
2 2
2 2 1 2 1
1
M1 M2 M3
Injection
Before refinement
1 1
2 3
3 2 3
A microcluster 2
belonging to an 2 1 1
evolved class 2 2
M1 M2 M3
After refinement
refinement process. The existing ensemble M consists of three models = {M1, M2, M3}. The circles
inside each model represent the microclusters. The numbers inside each microcluster represent the
class label of its majority class. None of the existing models contains any microcluster of class 3, but
the new model M′ has such a microcluster. Therefore, class 3 is an evolved class. All microclusters
belonging to class 3 are injected into the existing models. In order to keep the total number of clus-
ters constant, two nearest (same class) microclusters in each model are merged before the injection.
It is obvious that when more training instances are provided to a model, its classification error is
more likely to reduce. However, if the same set of microclusters are injected in all the models, the
correlation among the models may increase, resulting in reduced prediction accuracy of the ensem-
ble. According to [TUME96], if the errors of the models in an L-model ensemble are independent,
then the added error (i.e., the error in addition to Bayes error) of the ensemble is 1/L times the added
error of a single model. However, the ensemble error may be higher if there is correlation among the
errors of the models. But even if correlation is introduced by injecting the microclusters, according
to Lemma 12.1, under certain conditions the overall added error of the ensemble is reduced after
injection. The lemma is based on the assumption that after injection, single model error monotoni-
cally decreases with increasing prior probability of class c. In other words, we assume that there is
a continuous monotonic decreasing function f (x), f (x) ∈ [0, 1] and x ∈ [0, 1], such that
E = f (γ c)* E 0 (12.11)
where E 0 and E are the single model errors before and after injection, respectively, and γc is the
prior probability of class c. This function has the following special property: f(0) = 1, since γc = 0
means class c has not appeared at all, and no injection has been made. Lemma 12.1 quantifies an
upper bound of the function, that is, necessary for ensemble error reduction.
Lemma 12.1:
Let c be the evolved class, EM0 and EM be the added errors of the ensemble before and after injection;
E 0 and E be the added errors of a single model before and after injection, respectively, and γc be the
prior probability of class c. Then the injection process will reduce the added error of the ensemble
provided that
1
f (γ c ) ≤
1 + γ c2 ( L −1)
1 + δ ( L −1)
EM = E * (12.12)
L
where L is the total number of models in the ensemble, and δ is the mean correlation among the
models, given by
δ= ∑γ δ
i =1
i i (12.13)
where γi is the prior probability of class i and δi is the mean correlation associated with class i, given
by [TUME96]
Data Stream Classification with Limited Labeled Training Data 159
L
1
δi =
L ( L −1) ∑ ∑ Corr (η
m =1 l ≠ m
m
i , η il ) (12.14)
where Corr (ηim , ηil ) is the correlation between η im , the error of model m, and η il , the error of model
l. For simplicity, we assume that the correlation between two models is proportional to the number
of instances that are common to both these models. That is, the correlation is 1 if they have all
instances in common, and 0 if they have no instances in common. So, before injection, the correla-
tion between any pair of models is zero (since the models are trained using disjoint training data).
As a result
E0
EM0 = (12.15)
L
After injection, some instances of class c may be common among a pair of models, leading to
δc ≥ 0, where c is the evolved class.
Consider a pair of models m and l whose prior probabilities of class c are γ cm and γ cl , respectively,
after injection. So, the correlation between m and l reduces to
1 m
Corr (ηcm , ηcl ) = (γc + γcl ),
2
L
1 1
δc =
L ( L −1) 2 ∑ ∑(γ
m =1 l ≠ m
m
c + γ cl )
L L
(12.16)
1 1 1
=
L ( L −1) 2
2( L −1)
m =1
∑
(γ cm ) =
L ∑ (γ
m =1
m
c ) = γc
where γ c is the mean prior probability of class c in each model. Note that the mean prior probability
γ c represents the actual prior probability γc, so they can be used interchangeably. Substituting this
value of δi in Equation 12.13,
C C
δ= ∑ γ δ = γ δ + ∑ γ δ = (γ ) + 0 = (γ )
i =1
i i c c
i =1,i ≠c
i i c
2
c
2
since δi = 0 for all nonevolved class as no instance of those classes is common between any pair of
models. Now, substituting this value of δ in Equation 12.12, we obtain
1 + γ c2 ( L −1)
EM = E *
L
1 + γ c2 ( L −1)
= f (γ c )* E 0 * using Equation 12.11
L (12.17)
E0
=
L
(
* f (γ c )* (1 + γ c2 ( L −1) )
= EM0 * ( f (γ )* (1 + γ
c
2
c ( L −1)) using Equationn 12.15
160 Big Data Analytics with Applications in Insider Threat Detection
Now, we will have an error reduction provided that EM ≤ EM0 , which leads to
( f (γ c )* (1 + γ c2 ( L −1)) ≤ 1
1
f (γ c ) ≤
1 + γ c2 ( L −1)
From Lemma 12.1, we can infer that the function f (⋅) becomes more restricted as the value of
γc and/or L are increased. For example, for γc = 0.5, if L = 10, then f(γc) must be ≤ 0.31, meaning,
E ≤ 0.31*E0 is required for error reduction. For the same value of γc, if L = 2, then E ≤ 0.8*E0 is
required for error reduction. However, in our experiments, we have always observed error reduction
after injection, that is, inequality (12.17) has always been satisfied. Still, we recommend that the
value of L be kept within 10 for minimizing the risk of violating inequality (12.17).
12.4.3 Ensemble Update
After the refinement, the ensemble is updated to adapt to the concept drift in the stream. This is
done as follows. We have now L + 1 models: L models from the ensemble and the newly trained
model M′. One of these L + 1 models is discarded, and the rest of them construct the new ensemble.
The victim is chosen by evaluating the accuracy of each of these L + 1 models on the labeled
instances in the training data Dn. The model having the worst accuracy is discarded.
12.4.4 Time Complexity
The ensemble training process consists of three main steps: (1) creating clusters using E–M; (2)
refining the ensemble; and (3) updating the ensemble. Step (2) requires O(K L) time, and step (3)
requires O(K LP S) time, where P is the proportion of labeled data (P ≤ 1) in the chunk and S is
the chunk size. Step (1) (E-M) requires O(K SIicmIem) time, where Iicm is the average number of
ICM iterations per E-step and Iem is the total number of E-M iterations. Although it is not possible
to find the exact values of Iicm and Iem analytically, we obtain an approximation by observation.
We observe from our experiments that Iem depends only on the chunk-size S, and Iicm is constant
(≈2) for any dataset. On average, a data chunk having 1000 instances requires 10 E-M iterations to
converge. This increases sublinearly with chunk size. For example, a 2000 instance chunk requires
14 E-M iterations and so on. There are several reasons for this fast convergence of E-M, such as:
(1) proportionate initial seed selection from the labeled data using the farthest-fast traversal and
(2) using the compound impurity measure in the objective function. Therefore, the overall time
complexity of the ensemble training process of SmS Cluster is O(K S * (LP + g(S))), where g(.) is a
sublinear function. This complexity is almost linear in S for a moderate chunk size. The time com-
plexity of ensemble classification is O(K LS), which is also linear in S for a fixed value of K and L.
12.5 EXPERIMENTS
In this section, we discuss the datasets used in the experiments, the system setup, and the results.
12.5.1 Dataset
We apply our technique on two synthetic and two real datasets. We generate two different kinds of
synthetic datasets: concept drifting and concept drifting with concept evolving. The former dataset
simulates only concept drift, whereas the latter simulates both concept drift and concept evolution.
One of the two real datasets is the 10% version of the KDD cup 1999 intrusion detection dataset
Data Stream Classification with Limited Labeled Training Data 161
[KDD99]. The other one is the Aviation Safety Reporting Systems (ASRS) dataset obtained from
NASA [NASA]. All of these datasets are discussed in the following paragraphs.
Concept-drifting synthetic dataset (SynD): We use this dataset in order to show that our approach
can handle concept drift. SynD data are generated using a moving hyperplane technique. The equa-
tion of a hyperplane is as follows:
∑a x = a
i=1
i i 0
where d is the total number of dimensions, ai is the weight associated with dimension i, and xi is the
value of ith dimension of a datapoint x. If Σid=1ai xi ≤ a0 then an example is considered as negative;
otherwise, it is considered positive. Each instance is a randomly generated d-dimensional vector {x1,
…, xd}, where xi ∈ [0, 1]. Weights {a1, …, ad} are also randomly initialized with a real number in
the range [0, 1]. The value of a0 is adjusted so that roughly the same number of positive and nega-
tive examples is generated. This can be done by choosinga0 = (1/ 2)Σid=1ai . We also introduce noise
randomly by switching the labels of p% of the examples, where p = 5 is set in our experiments.
There are several parameters that simulate concept drift. The parameter m specifies the percent of
total dimensions whose weights are involved in changing, and it is set to 20%. The parameter t speci-
fies the magnitude of the change in every N examples. In our experiments, t is varied from 0.1 to 1.0,
and N is set to 1000. si, i ∈ {1, …, d} specifies the direction of change for each weight. Weights change
continuously, that is, ai is adjusted by si.t/N after each example is generated. There is a possibility of
r% that the change would reverse direction after every N example is generated. In our experiments,
r is set to 10%. We generate a total of 250,000 instances and divide them into equal-sized chunks.
Concept drifting with concept-evolving synthetic dataset (SynDE): SynDE dataset simulates
both concept drift and concept evolution. That is, new classes appear in the stream as well as old
classes disappear, and at the same time, the concept for each class gradually changes over time. The
dataset size is varied from 100 to 1000K points. The number of class labels is varied from 5 to 40,
and data dimensions are varied from 20 to 80. Data points belonging to each class are generated
by following a normal distribution having different mean (−5.0 to +5.0) and variance (0.5–6) for
different classes. In order to simulate the evolving nature of data streams, the prior probabilities
of different classes are varied with time. This has caused some classes to appear and some other
classes to disappear at different times in the stream history. In order to simulate the drifting nature
of the concepts, the class mean for each class is gradually changed in a way similar to the Syn-D
dataset. Different synthetic datasets are identified by an abbreviation: <size> C < #of classes> D
<#of dimensions>. For example, 300KC5D20 denotes a dataset having 300K points, 5 classes, and
20 dimensions.
Real dataset-KDDC up 99 network intrusion detection (KDD): This dataset contains TCP con-
nection records extracted from LAN network traffic at MIT Lincoln Labs over a period of two
weeks. We have used the 10% version of the dataset, which is more concentrated than the full ver-
sion. Here different classes appear and disappear frequently. Each instance in the dataset refers to
either to a normal connection or an attack. There are 22 types of attacks, such as buffer-overflow,
port sweep, guess-passwd, neptune, rootkit, smurf, spy, etc. So, there are 23 different classes of
data, most which are normal. Each record consists of 42 attributes, such as connection duration, the
number bytes transmitted, number of root accesses, etc. We use only the 34 continuous attributes,
and remove the categorical attributes.
Real dataset-Aviation Safety Reporting Systems (ASRS): This dataset contains around 150,000
text documents. Each document is actually a report corresponding to a flight anomaly. There are a
total of 55 anomalies, such as “aircraft equipment problem: critical,” “aircraft equipment problem:
less severe,” “inflight encounter: birds,” “inflight encounter: skydivers,” “maintenance problem:
improper documentation,” etc. Each of these anomalies is considered as a “class.” These documents
162 Big Data Analytics with Applications in Insider Threat Detection
represent a data stream since it contains the reports in order of their creation time, and new reports
are being added to the dataset on a regular basis.
We perform several preprocessing steps on this dataset. First, we discard the classes that con-
tain very few (less than 100) documents. We choose 21 classes among the 55, which reduced
the total number of selected documents to 125,799. Second, each text report is “normalized”
by removing capitalization, expanding some abbreviations, and so on. Third, we extract word
features from this corpus, and select the best 1000 features based on information gain. Then
each document is transformed into a binary feature vector, where the value corresponding to a
feature is “one” if the feature (i.e., word) is present or “zero” if it is not present in the document.
The instances in the dataset are multilabel, meaning, an instance may have more than one class
label. We transform the multilabel classification problem into 21 separate binary classification
problems by generating 21 different datasets from the original dataset, one for each class. The
dataset for ith class is generated by marking the instances belonging to class i as positive, and all
other instances as negative. When reporting the accuracy, we report the average accuracy of the
21 datasets.
An example of a normalized text report is as follows:
cleared direct private very high frequency omnidirectional radio range after takeoff bos. using right
navigation flight management system and omega bos center advised we missed private very high fre-
quency omnidirectional radio range by 20 miles. upon checking found both flight management system
and omega in gross error. advised center of same and requested airways flight plan. received same.
malfunction recorded in aircraft log for maintenance action.
12.5.2 Experimental Setup
Hardware and software: We implement the algorithms in Java. The experiments were run on a
Windows-based Intel P-IV machine with 2 GB memory and 3 GHz dual processor CPU.
Parameter settings: The default parameter settings are as follows, unless mentioned otherwise:
Baseline method: We compare our algorithm with “On Demand Stream,” proposed by Aggarwal
et al. [AGGA06]. We will refer to this approach as “OnDS.” We run our own implementation of
the OnDS and report the results. For the OnDS, we use all the default values of its parameters, and
set buffer size = 1600 and stream speed = 80 for real datasets, and buffer size = 1000 and stream
speed = 200 for synthetic datasets, as proposed by the authors. However, in order to ensure a fair
comparison, we make a small modification to the original OnDS algorithm. The original algorithm
assumed that in each data chunk, 50% of the instances are labeled, and the rest of them are unla-
beled. The labeled instances were used for training, and the unlabeled instances are used for testing
and validation. As mentioned earlier, this assumption is even more impractical than assuming that
a single stream contains both training and test instances. Therefore, in the modified algorithm, we
assume that all the instances in a new data chunk are unlabeled, and test all of them using the exist-
ing model. After testing, the data chunk is assumed to be completely labeled, and all the instances
are used for training.
Data Stream Classification with Limited Labeled Training Data 163
FIGURE 12.3 Cumulative accuracy (a) and ROC curve (b) for SynD dataset.
When training ReaSC, we consider that only 20% randomly chosen instances in a chunk have labels
(i.e., P = 20), whereas for training OnDS, 100% instances in the chunk are assumed to have labels.
So, if there are 100 data points in a chunk, then OnDS has 100 labeled training data points, but ReaSC
has only 20 labeled and80 unlabeled training instances. Also, for a fair comparison, the chunk size of
ReaSC is always kept equal to the buffer size of OnDS. Note that P is not a parameter of ReaSC, rather,
it is a threshold assigned by the user based on the available system resources to label data points.
Evaluation: For each competing approach, we use the first three chunks to build the initial classi-
fication model, which can be thought of as a warm-up period. From the fourth chunk onward, we first
evaluate the classification accuracy of the model on that chunk, then use the chunk as training data
to update the model. Each method is run 20 times on each dataset, and the average result is reported.
FIGURE 12.4 Cumulative accuracy (a) and ROC curve (b) for SynDE dataset.
FIGURE 12.5 Cumulative accuracy (a) and ROC curve (b) for KDD dataset.
90
0.6
FIGURE 12.6 Cumulative accuracy (a) and ROC curve (b) for ASRS dataset.
Figure 12.5 shows the chunk number (No) vs. cumulative accuracy and ROC curves for KDD
dataset. The KDD dataset has a lot of concept evolution, almost all of which occurs within the
first 120 chunks. The accuracy of OnDS is 2%–12% lower than ReaSC in this region. So, ReaSC
handles concept evolution better than OnDS in real data too. However, in the KDD dataset, most of
the instances belong to the “normal” class. As a result, the class distribution is skewed, and simple
accuracy does not reflect the true difference in performances. The ROC curves shown in Figure
12.5b reflect the performances of these two methods more precisely. The AUC of ReaSC is found to
be 10% higher than OnDS, which is a great improvement. Finally, Figure 12.6 shows the accuracy
Data Stream Classification with Limited Labeled Training Data 165
and ROC curves for ASRS dataset. Recall that these graphs are generated by averaging the accura-
cies and ROC curves from 21 individual binary classification results.
Again, here ReaSC achieves 3% or higher accuracy than OnDS in all stream positions. Besides,
the AUC of ReaSC in this dataset is 8% higher than OnDS. OnDS performs comparatively better in
this dataset because this dataset does not have any concept drift.
Again recall that in all these experiments, OnDS uses five times more labeled data for training
than ReaSC, still ReaSC outperforms OnDS in all datasets, both in accuracy and AUC.
TABLE 12.2
Comparison of Running Time (Excluding Labeling Time) and Classification Speed between
OnDS (with 100% Labeled Data) and ReaSC (With 20% Labeled Data)
Time (s/1000 pts) Classification Speed (pts/s)
Data Set OnDS (100% Labeled) ReaSC (20% Labeled) OnDS (100% Labeled) ReaSC (20% Labeled)
SynD 0.88 1.34 1222 6248
SynDE 1.57 1.72 710 4033
KDD 1.54 1.32 704 3677
ASRS 30.90 10.66 38 369
Note: Bold numbers in this table represent the best run times.
166 Big Data Analytics with Applications in Insider Threat Detection
TABLE 12.3
Comparison of Running Time Including Labeling Time on Real Datasets
Labeling Time (s/1000 pts) Total Time (s/1000 pts)
Data Set OnDS (100% Labeled) ReaSC (20% Labeled) OnDS (100% Labeled) ReaSC (20% Labeled)
KDD 1000 200 1001.54 201.32
ASRS 60,000 12,000 60,030.92 12,010.66
Note: Bold numbers in this table represent the best run times.
1000
900
800
Time (seconds)
700
600
D = 20
500
D = 40
D = 60
400
10 20 30 40
C
FIGURE 12.7 Running times on different datasets having higher dimensions (D) and number of classes (C).
In Figure 12.7, we report the scalability of ReaSC on high-dimensional and multiclass SynDE
data. This graph reports the running times of ReaSC for different dimensions (20–60) of synthetic
data with different number of classes (10–40). Each of these synthetic datasets has 250K points. For
example, for C = 10, and D = 20, the running time is 431 s, and it increases linearly with the number
of classes in the data. On the other hand, for a particular value of C (e.g., D = 20), the running time
increases very slowly (linearly) with increasing the number of dimensions in the data. For example,
for C = 10, running times for 20, 40, and 60 dimensions of datasets are 431, 472, and 522 s, respec-
tively. Thus, we may conclude that ReaSC scales linearly to higher dimensionality and class labels.
The memory requirement for ReaSC is O(D ∗ K ∗ L), whereas that of OnDS is O(D ∗ microclus-
ter ratio ∗ max capacity ∗ C ∗ log(N)), where N is the total length of the stream. Thus, the memory
requirement of ReaSC is constant, whereas that of OnDS grows with stream length. For high-
dimensional datasets, this requirement may not be practical. For example, for the ASRS dataset,
ReaSC requires less than 10 MB memory, whereas OnDS requires approximately 700 MB memory.
12.5.5 Sensitivity to Parameters
All the following results are obtained using a SynDE dataset (B250K, C10, D20). Figure 12.8 shows
how accuracy varies with chunk size (S) and the percentage of labeled instances in each chunk (P). It is
obvious that higher values of P lead to better classification accuracy since each model is better trained.
For any particular chunk size, the improvement gradually diminishes as P approaches to 100. For
example, a stream with P = 10 has five times more labeled data than the one with P = 2. As a result,
the accuracy improvement is also rapid from P = 2 to P = 10. But a stream with P = 75 has only 1.5
Data Stream Classification with Limited Labeled Training Data 167
100
95
90
Accuracy (%)
85
80
S = 500
75
S = 1000
S = 2000
70
20 40 60 80 100
P
FIGURE 12.8 Sensitivity to chunk size (S) for different percentage of labeled data (P).
times more labeled data than a stream with P = 50, so the accuracy improvement in this case is much
less than the former case. We also observe higher accuracy for larger chunk sizes. This is because,
as chunk size is increased, each model gets trained with more data, which leads to a better classifica-
tion accuracy. This improvement also diminishes gradually because of concept drift. According to
[WANG03], if there is concept drift in the data, then a larger chunk contains more outdated points,
canceling out any improvement expected to be gained by increasing the training set size.
Figure 12.9a shows how classification accuracy varies for ReaSC with the number of microclu-
sters (K). We observe that higher values of K lead to better classification accuracies. This happens
because when K is larger, smaller and more compact clusters are formed, leading to a finer-grained
classification model for the nearest neighbor classifier. However, there is no significant improvement
after K = 50 for this dataset, where C = 10. It should be noted that K should always be much larger
than C. Experimental results suggest that K should be between 2C and 5C for best performance.
Figure 12.9b shows the effect of accuracy on ensemble size (L). Intuitively, increasing the ensem-
ble size helps to reduce error. Significant improvement is achieved by increasing the ensemble size
from 1 (i.e., single classifier) to 2. After that, the improvement diminishes gradually. Increasing
the ensemble size also increases the classification time. Besides, correlation among the classifiers
increases in the event of concept evolution, which diminishes the improvement intended by the
ensemble. So, a reasonable value is chosen depending on the specific requirements of a system.
90 92
Accuracy (%)
Accuracy (%)
80
88
70
84
60
50 80
20 40 60 80 100 1 2 4 6 8 10 12
K L
FIGURE 12.9 Sensitivity to number of clusters (K) (a) and ensemble size (L) (b).
168 Big Data Analytics with Applications in Insider Threat Detection
From the results shown in this section, we can conclude that ReaSC outperforms OnDS in all
datasets. There are two main reasons behind this. First, ReaSC considers both the dispersion and
impurity measures in building clusters, but OnDS considers only purity, since it applies K-means
algorithm to each class separately. Besides, ReaSC uses proportionate initialization, so that more
clusters are formed for the larger classes (i.e., classes having more instances). But OnDS builds
equal number of clusters for each class, so clusters belonging to larger classes tend to be bigger (and
more sparse). Thus, the clusters of ReaSC are likely to be more compact than those of the OnDS.
As a result, the nearest neighbor classification gives better prediction accuracy in ReaSC. Second,
ReaSC applies ensemble classification, rather than the “horizon fitting” technique used in OnDS.
Horizon fitting selects a horizon of training data from the stream that corresponds to a variable-
length window of the most recent (contiguous) data chunks. It is possible that one or more chunks
in that window have been outdated, resulting in a less accurate classification model. This is because
the set of training data, that is, the best representative of the current concept are not necessarily
contiguous. But ReaSC always keeps the best training data (or models) that are not necessarily con-
tiguous. So, the ensemble approach is more flexible in retaining the most up-to-date set of training
data, resulting in a more accurate classification model.
It would be interesting to compare ReaSC with some other baseline approaches. First, consider
a single combined model that contains all the K ∗ L clusters in the ensemble M. We argue that this
combined model is no better than the ensemble of models because our analysis shows that increas-
ing the number of clusters beyond a certain threshold (e.g., 100) does not improve classification
accuracy. Since K is chosen to be close to this threshold, it is most likely that we would not get a
better model out of the K∗ L clusters. Second, consider a single model having K clusters (not exceed-
ing the threshold) built from L data chunks. Increasing the training set size would most likely
improve classification accuracy. However, in the presence of concept drift, it can be shown that a
single model built from L consecutive data chunks has a prediction error no less than an ensemble
of L models, each built on a single data chunk [WANG03]. This also follows from our experimental
results that a single model built on L chunks has 5%–10% worse accuracy than ReaSC, and is at
least L-times slower than ReaSC.
REFERENCES
[AGGA06]. C.C. Aggarwal, J. Han, J. Wang, P.S. Yu, “A Framework for On-Demand Classification of Evolving
Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18(5), 577–589, 2006.
[BESA86]. J. Besag, “On the Statistical Analysis of Dirty Pictures,” Journal of the Royal Statistical Society,
Series B (Methodological), 48(3), 259–302, 1986.
Data Stream Classification with Limited Labeled Training Data 169
[DEMP77]. A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum Likelihood from Incomplete Data via the Em
Algorithm,” Journal of the Royal Statistical Society B, 39, 1–38, 1977.
[HOCH85]. D. Hochbaum and D. Shmoys, “A Best Possible Heuristic for the K-Center Problem,” Mathematics
of Operations Research, 10(2), 180–184, 1985.
[HUYS07]. G.B. van Huyssteen, M.J. Puttkammer, S. Pilon, H.J. Groenewald, “Using Machine Learning
to Annotate Data for NLP Tasks Semi-Automatically,” In CALP ’07: Proceedings of the RANLP-07
Workshop: Computer-Aided Language Processing Computer-Aided Language Processing, 27–29 Sep,
Borovets, Bulgaria, 2007. http://rgcl.wlv.ac.uk/events/CALP07/papers/3.pdf.
[KDD99]. KDD Cup 1999 Intrusion Detection Dataset. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.
html.
[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM ’08: Proceedings of
the 2008 International Conference on Data Mining, Pisa, Italy, 15–19 Dec, pp. 929–934, 2008, IEEE
Computer Society.
[NASA]. NASA Aviation Safety Reporting System. http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_
Begin.aspx.
[TUME96]. K. Tumer and J. Ghosh, “Error Correlation and Error Reduction in Ensemble Classifiers,”
Connection Science, 8(304), 385–403, 1996.
[WANG03]. H. Wang, W. Fan, P.S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Washington, DC, USA, Aug 24–27, pp. 226–235, 2003, ACM.
13 Directions in Data
Stream Classification
13.1 INTRODUCTION
We have discussed three major approaches for stream analytics in Chapters 10 through 12.
In Chapter 10, we described our innovative technique for classifying concept-drifting data streams
using a novel ensemble classifier originally discussed in [MASU09a]. It is a multiple partition,
multiple chunk (MPC) ensemble classifier-based data mining technique to classify concept-drifting
data streams. Existing ensemble techniques in classifying concept-drifting data streams follow a
single-partition, single-chunk approach in which a single data chunk is used to train one classifier.
In our approach, we train a collection of v classifiers from r consecutive data chunks using v-fold
partitioning of the data, and build an ensemble of such classifiers. By introducing this MPC ensem-
ble technique, we significantly reduce classification error compared to the single-partition, single-
chunk ensemble approaches. We have theoretically justified the usefulness of our algorithm, and
empirically proved its effectiveness over other state-of-the-art stream classification techniques on
synthetic data and real botnet traffic.
In Chapter 11, we described a novel and efficient technique that can automatically detect the
emergence of a novel class in the presence of concept drift by quantifying cohesion among unla-
beled test instances and separation of the test instances from training instances. Our approach is
nonparametric, meaning, it does not assume any underlying distributions of data. Comparison with
the state-of-the-art stream classification techniques proves the superiority of our approach. In this
chapter, we discuss our proposed framework for classifying data streams with automatic novel class
detection mechanism. It is based on our previous work [MASU09a].
In Chapter 12, we discussed the building of a classification model from a training set having
both unlabeled and a small number of labeled instances. This model is built as microclusters using
a semisupervised clustering technique and classification is performed with k-nearest neighbor algo-
rithm. An ensemble of these models is used to classify the unlabeled data. Empirical evaluation
on both synthetic data and real botnet traffic reveals that our approach, using only a small amount
of labeled data for training, outperforms state-of-the-art stream classification algorithms that use
20 times more labeled data than our approach. We describe our proposed solution for the limited
labeled training data. It is based on our work discussed in [MASU08].
In this chapter, we will compare the three approaches we have developed. That is, we discuss
the three data stream classification approaches described in previous chapters and give directions to
possible extensions to those approaches. The organization of this chapter is as follows. A summary
of the three approaches is provided in Section 13.2. Some extensions are discussed in Section 13.3.
Summary and directions are provided in Section 13.4.
171
172 Big Data Analytics with Applications in Insider Threat Detection
13.3 EXTENSIONS
Our proposed data stream classification techniques can be extended in various ways. The first and
obvious extension would be developing a unified framework that integrates all three proposed tech-
niques. We consider the novel class detection and limited labeled data problems separately. In the
unified framework, the classification model should be able to detect the novel class even if there is
only a few labeled data per chunk. It would be interesting to see how the scarcity of labeled data
Directions in Data Stream Classification 173
affects the proposed novel class detection technique. All of the following extensions are applicable
to this unified framework.
Dynamic Feature Vector: We would like to address the data stream classification problem under
dynamic feature vector. Currently, we assume that the feature vector is fixed. We would like to
relax this assumption, and provide a more general framework for data stream classification where
the feature vector may change over time. This would be useful for classifying text stream, and other
similar data streams, where new features evolve over time. For example, suppose the data stream is
a stream of text documents where the feature vector consists of a list of all words that appeared so
far in the stream. Since the vocabulary is supposed to increase with the arrival of new documents,
the feature vector also grows over time. As a result, each classifier in the ensemble would be built
on a different feature vector. Also, a test instance would likely have a different feature vector from
the training instances of a classifier. It would be a challenging task to classify the test instances and
detect novel classes in these scenarios.
Multilabel Instances: In some classification problems, an instance may have multiple class labels,
which are known as multilabel instances. The multilabel classification problem is more generalized
than the traditional classification problem where each instance belongs to only one class. An example
of this multilabel instance is the NASA Aviation Safety Reporting Systems dataset [NASA] which
has been discussed in detail in Section 12.5. Although there are many existing solutions to the mul-
tilabel classification problem ([CHEN07], [CLAR01], [GAOS04], [LUO05] [MCDO05] [STRE08]
[THAB04], [VELO07], [ZHAN07]), none of them are applicable to data streams. Besides, the
problem becomes more complicated in the presence of concept drift and novel classes. It will be
interesting to investigate the effects of multilabel instances on our proposed stream classification
technique and extend it to cope with them.
Cloud Computing and Big Data: With so much streaming data being generated for various
applications, we need to develop scalable stream data mining techniques that can handle massive
amounts of data. Therefore, we need to extend our techniques to operate in a cloud computing
framework. As data streams grow larger and faster, it would be difficult to mine them using a
stand-alone computing machine and limited data storage. Therefore, in the future we would need to
utilize the computing power and storage capacities from several computing machines. This would
necessitate adapting our proposed technique on the cloud computing infrastructure. One such infra-
structure is the Hadoop Distributed File System ([CHU07], [DEAN08], [HUSA09]). In order to
facilitate classification, raw data will be distributed among different nodes in the system. Hadoop
is a distributed file system and stream data will be stored in this system by partitioning each data
chunk into a number of blocks and storing each block in a separate node as illustrated in Figure 13.1.
For example, without loss of generality, let us assume that a node can hold at least 128 MB data
on hard disk and block size is 64 MB. If chunk size is 256 MB, the chunk will be partitioned into
four data blocks. These four data blocks will be distributed across two nodes (without replication).
On the other hand, if the chunk size is 64 MB, the whole chunk can be stored into a single node.
Hence, a chunk will be processed in parallel by each node independently, which can speed up query
processing and classification. Each node will train its own classification model from the raw data
that is stored in that node. After training, raw data will be discarded. However, there should be a
way to combine the classification output of each node to get a global output, such as majority voting.
In addition to using a cloud computing framework, we also need to explore the use of the big data
management and analytics (BDMA) technologies discussed in Chapter 7. For example, the stream-
ing data may need to be stored in systems such as HBase and CouchDB. Furthermore, the enhanced
Weka techniques discussed in Chapter 7 to handle big data have to be examined for stream mining
for big data.
Dynamic Chunk Size and Ensemble Size: In the proposed approach, we use fixed chunk size
and ensemble size. In the future, we would like to make the system more adaptive to concept drift
and concept evolution by adapting both the chunk size and ensemble size. It is shown in the past
that if the chunk size is increased when concept drift is slow, and decreased when it is faster, it is
174 Big Data Analytics with Applications in Insider Threat Detection
Stream data
Hadoop architecture
Map-reduce paradigm
Chunks
divided into
blocks
Processor Processor
Memory Memory
Disk Disk
Node 1 Node N
FIGURE 13.1 Architecture for stream data classification using Hadoop Distributed File System.
possible to improve classification performance ([KLIN00], [WIDM96]). Also, keeping older clas-
sifiers in the ensemble hurts performance when concept drift and concept evolution occur too fast,
and improve performance when they are slow. Therefore, dynamically changing the ensemble size
also adds to the overall improvement in classification accuracy. However, the main challenge is to
determine whether concept drift and concept evolution are occurring and at what pace ([GARC06],
[DRIE09], [GAMA04], [HIDO08]).
Parameter Reduction: We also have several other parameters that need to be tuned to optimum
performance, for example, the number of clusters, K, and the minimum neighborhood size, q. In the
future, we would like to make these parameters adapt to the dynamic nature of the data streams and
change their values accordingly.
Real-Time Classification: We would like to extend our system to perform real-time stream clas-
sification. Real-time data stream classification is important in many applications such as the intru-
sion detection systems. We need to optimize both classification and training times for the system to
be applicable to a real-time environment. In order to get a fast and accurate classification decision,
we need to consider a number of issues. First, an instance must be classified using a fast classifica-
tion model. Second, the classification model must be updated quickly, and also the update should
not delay the classification of a test instance. Third, the system should be able to extract features
from raw data quickly, and present the feature vector to the classification model.
Directions in Data Stream Classification 175
Feature Weighting: We would like to incorporate feature weighting and distance learning in the
semisupervised clustering, which should lead to a better classification model. Feature weighting and
distance learning have been used in many semisupervised computing tasks in the past ([BASU04],
[BASU06], [BILE04]). However, the learning process would be more challenging in a streaming
environment in the presence of concept drift and novel classes.
REFERENCES
[BASU04]. S. Basu, A. Banerjee, R.J. Mooney, “Active Semi-Supervision for Pairwise Constrained Clustering,”
SDM ’04: Proceedings of the 2004 SIAM International Conference on Data Mining, April 22–24, Lake
Buena Vista, FL, pp. 333–344, SIAM, 2004.
[BASU06]. S. Basu, M. Bilenko, A. Banerjee, R.J. Mooney, “Probabilistic Semi-Supervised Clustering
with Constraints,” Semi-Supervised Learning, O. Chapelle, B. Schoelkopf, A. Zien, editors, MIT
Press, Cambridge, MA, 73–102, 2006.
[BILE04]. M. Bilenko, S. Basu, R.J. Mooney, “Integrating Constraints and Metric Learning in Semi-
Supervised Clustering,” ICML ’04: Proceedings of the Twenty-First International Conference on
Machine Learning, July 4–8, Banff, Canada, pp. 81–88, Morgan Kaufmann, 2004.
[CHEN07]. W. Chen, J. Yan, B. Zhang, Z. Chen, Q. Yang, “Document Transformation for Multi-Label Feature
Selection in Text Categorization,” ICDM ’07: Proceedings of the 2007 International Conference on
Data Mining, October 28–31, Omaha, NE, pp. 451–456, IEEE Computer Society, 2007.
[CHU07]. C.-T. Chu, S.K. Kim, Y.-A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, “Map-Reduce for Machine Learning
on Multicore,” NIPS ’07: Proceedings of the 21st Annual Conference on Neural Information Processing
Systems, December 3–7, Vancouver, B.C., Canada, pp. 281–288, MIT Press, 2007.
[CLAR01]. A. Clare and R.D. King, “Knowledge Discovery in Multi-Label Phenotype Data,” PKDD ’01:
Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery,
September 3–7, Freiburg, Germany, pp. 42–53, Springer-Verlag, 2001.
[DEAN08]. J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,”
Communications of the ACM, 51(1):107–113, January 2008.
[DRIE09]. A. Dries and U. Rückert, “Adaptive Concept Drift Detection,” SDM 09: Proceedings of the 2009
Siam International Conference on Data Mining, April 30 to May 2, Sparks, NV, pp. 233–244, SIAM,
2009.
[GAMA04]. J. Gama, P. Medas, G. Castillo, P.P. Rodrigues, “Learning with Drift Detection,” SBIA ’04:
Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA), September 29 to October
1, Sao Luis, Maranhao, Brazil, pp. 286–295, Springer, 2004.
[GAOS04]. S.G. Gaosheng, W. Wu, C.H. Lee, “A MFoM Learning Approach to Robust Multiclass Multi-
Label Text Categorization,” ICML ’04: Proceedings of the 21st International Conference on Machine
Learning, July 4–8, Banff, Canada, pp. 329–336, Morgan Kaufmann, 2004.
[GARC06]. M. Baena-Garcia, J. del Campo-Avila, R. Fidalgo, A. Bifet, R. Gavalda, R. Morales-Bueno, “Early
Drift Detection Method,” ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams,
September 18, Berlin, Germany, Springer-Verlag, 2006.
[HIDO08]. S. Hido, T. Ide, H. Kashima, H. Kubo, H. Matsuzawa, “Unsupervised Change Analysis Using
Supervised Learning,” Advances in Knowledge Discovery and Data Mining, 148–159, 2008.
176 Big Data Analytics with Applications in Insider Threat Detection
[HUSA09]. M.F. Husain, P. Doshi, L. Khan, B. Thuraisingham, “Storage and Retrieval of Large RDF Graph
Using Hadoop and Mapreduce,” Technical Report No. UTDCS-40-09, Computer Science Department,
University of Texas, Dallas, TX, 2009.
[KLIN00]. R. Klinkenberg and T. Joachims, “Detecting Concept Drift with Support Vector Machines,”
ICML ’00: Proceedings of the 17th International Conference on Machine Learning, June 29 to July 2,
Stanford University, CA, pp. 487–494, Morgan Kaufmann, 2000.
[LUO05]. X. Luo and A. Nur Zincir-Heywood, “Evaluation of Two Systems on Multi-Class Multi-
Label Document Classification,” ISMIS ’05: Proceedings of the 15th International Symposium on
Methodologies for Intelligent Systems, Saratoga Springs, New York, May 25–28, pp. 161–169, Springer,
2005.
[MCDO05]. R. Mcdonald, K. Crammer, F. Pereira, “Flexible Text Segmentation with Structured Multilabel
Classification,” HLT-EMNLP ’05: Proceedings of the 2005 Human Language Technology Conference
and Conference on Empirical Methods in Natural Language Processing, October 6–8, Vancouver,
B.C., Canada, pp. 987–994, 2005.
[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” ICDM ’08: Proceedings of
the 2008 International Conference on Data Mining, December 15–19, Pisa, Italy, pp. 929–934, IEEE
Computer Society, 2008.
[MASU09]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection
with Classification for Concept-Drifting Data Streams,” ECML PKDD ’09: Proceedings of the 2009
European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in
Databases, Vol. II, September 7–11, Bled, Slovenia, pp. 79–94, Springer-Verlag, 2009.
[NASA]. NASA Aviation Safety Reporting System, http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_
Begin.aspx.
[STRE08]. A.P. Streich and J.M. Buhmann, “Classification of Multi-Labeled Data: A Generative Approach,”
ECML PKDD ’08: Proceedings of the 2008 European Conference on Machine Learning and Principles
and Practice in Knowledge Discovery in Databases, Vol. II, September 15–19, Antwerp, Belgium,
pp. 390–405, Springer, 2008.
[THAB04]. F.A. Thabtah, P. Cowling, Y. Peng,” Mmac: A New Multi- Class, Multi-Label Associative
Classification Approach,” ICDM ’05: Proceedings of the 5th IEEE International Conference on Data
Mining, November 1–4, Brighton, UK, pp. 217–224, IEEE Computer Society, 2004.
[VELO07]. A. Veloso, W. Meira Jr, M. Goncalves, M. Zaki, “Multi-Label Lazy Associative Classification,”
ECML PKDD ’07: Proceedings of the 2007 European Conference on Machine Learning and Principles
and Practice in Knowledge Discovery in Databases, September 17–21, Warsaw, Poland, pp. 605–612,
Springer-Verlag, 2007.
[WIDM96]. G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts,”
Machine Learning, 23(1):69–101, 1996.
[ZHAN07]. M.-L. Zhang and Z.-H. Zhou, “Multi-Label Learning by Instance Differentiation,” AAAI-07:
Proceedings of the 22nd Conference on Artificial Intelligence, July 22–26, Vancouver, British Columbia,
Canada, pp. 669674, 2007.
Conclusion to Part II
Part II, consisting of six chapters, described our approach to stream data analytics, which we also
called stream data mining. In particular, we discussed various techniques for detecting novel classes
in data streams.
Chapter 8 stressed the need for mining data streams and discussed the challenges. The challenges
include infinite length, concept drift, concept evolution, and limited labeled data. We also provided
an overview of our approach to mining data streams. Specifically, our approach determines whether
an item belongs to a pre-existing class or whether it is a novel class. Chapter 9 described prior
approaches as well as our approach on stream analytics. For example, in the single model classifica-
tion approach, incremental learning techniques are used. The ensemble-based techniques are more
efficiently built than the single model approach. Our novel class detection approach integrated both
data stream classification and novelty detection. Our data stream classification technique with lim-
ited labeled data uses a semi-supervised technique. Chapter 10 introduced a multiple partition, mul-
tiple chunk (MPC) ensemble method for classifying concept-drifting data streams. Our ensemble
approach is a generalization over previous ensemble approaches that trains a single classifier from
a single data chunk. By introducing this MPC ensemble, we have reduced error significantly over
the single-partition, single-chunk approach. In Chapter 11, we presented a novel technique to detect
new classes in concept-drifting data streams. Our approach is capable of detecting novel classes in
the presence of concept drift, even when the model consists of multiple “existing” classes. Besides,
our novel class detection technique is nonparametric, meaning, it does not assume any specific dis-
tribution of data. In Chapter 12, we addressed a more realistic problem of stream mining and that
is training with a limited amount of labeled data. We designed and implemented a semisupervised,
clustering-based stream classification algorithm to solve this limited labeled data problem. Finally,
in Chapter 13, we examined the three approaches we developed for data stream classification and
discussed in Chapters 10 through 12 and provided directions for further work. In particular, we
discussed the need to use cloud computing and BDMA techniques to scale our stream mining
techniques.
Now that we have discussed our techniques for stream data analytics, in Part III, we will show
how we can apply our techniques for the insider threat detection problem.
Part III
Stream Data Analytics for
Insider Threat Detection
Introduction to Part III
Part III, consisting of nine chapters, describes big data analytics techniques for insider threat detec-
tion. In particular, both supervised and unsupervised learning methods for insider threat detection
are discussed.
Chapter 14 provides a discussion of the problem addressed and the solutions provided by big
data analysis. In particular, stream data mining that addresses the big data issues for insider threat
detection is discussed. Chapter 15 describes related work. Both insider threat detection and stream
mining aspects are discussed. In addition, issues on handling big data techniques are also discussed.
Chapter 16 describes ensemble-based classification and details both unsupervised and supervised
learning techniques for insider threat detection. Chapter 17 describes supervised and unsupervised
learning methods for nonsequence data. Chapter 18 describes our experiments and testing meth-
odology and presents our results and findings on insider threat detection for nonsequence data.
Chapter 19 describes both supervised and unsupervised learning algorithms for insider threat detec-
tion for sequence data. Chapter 20 presents our experiments and results on insider threat detection
for sequence data. Chapter 21 describes scalability issues using the Hadoop/MapReduce framework
and solutions for quantized dictionary construction. Finally, Chapter 22 concludes with an assess-
ment of the viability of stream mining for real-world insider threat detection and the relevance to
big data aspects.
14 Insider Threat Detection as
a Stream Mining Problem
14.1 INTRODUCTION
There is a growing consensus within the intelligence community that malicious insiders are per-
haps the most potent threats to information assurance in many or most organizations ([BRAC04],
[HAMP99], [MATZ04], [SALE11]). One traditional approach to the insider threat detection problem
is supervised learning, which builds data classification models from training data. Unfortunately,
the training process for supervised learning methods tends to be time-consuming and expensive,
and generally requires large amounts of well-balanced training data to be effective. In our experi-
ments, we observe that <3% of the data in realistic datasets for this problem is associated with
insider threats (the minority class) and over 97% of the data is associated with nonthreats (the major-
ity class). Hence, traditional support vector machines (SVM) ([CHAN11], [MANE02]), trained
from such imbalanced data are likely to perform poorly on test datasets.
One-class SVMs (OCSVM) [MANE02] address the rare-class issue by building a model that
considers only normal data (i.e., nonthreat data). During the testing phase, test data is classified
as normal or anomalous based on geometric deviations from the model. However, the approach
is only applicable to bounded-length, static data streams. In contrast, insider threat-related data is
typically continuous and threat patterns evolve over time. In other words, the data is a stream of
unbounded length. Hence, effective classification models must be adaptive (i.e., able to cope with
evolving concepts) and highly efficient in order to build the model from large amounts of evolving
data. Data, that is, associated with insider threat detection and classification is often continuous.
In these systems, the patterns of average users and insider threats can gradually evolve. A novice
programmer can develop his skills to become an expert programmer over time. An insider threat
can change his actions to more closely mimic legitimate user processes. In either case, the patterns
at either end of these developments can look drastically different when compared directly to each
other. These natural changes will not be treated as anomalies in our approach. Instead, we classify
them as natural concept drift. The traditional static supervised and unsupervised methods raise
unnecessary false alarms with these cases because they are unable to handle them when they arise
in the system. These traditional methods encounter high false positive rates (FPR). Learning models
must be adept in coping with evolving concepts and highly efficient at building models from large
amounts of data to rapidly detecting real threats. For these reasons, the insider threat problem can
be conceptualized as a stream mining problem that applies to continuous data streams. Whether
using a supervised or unsupervised learning algorithm, the method chosen must be highly adap-
tive to correctly deal with concept drifts under these conditions. Incremental learning and ensem-
ble-based learning ([MASU10a], [MASU10b] [MASU11a], [MASU11b], [MASU08], [MASU13],
[MASU11c], [ALKH12a], [MASU11d], [ALKH12b]) are two adaptive approaches in order to over-
come this hindrance. An ensemble of K models that collectively vote on the final classification
can reduce the false negatives and false positives for a test set. As new models are created and old
ones are updated to be more precise, the least accurate models are discarded to always maintain an
ensemble of exactly K current models. An alternative approach to supervised learning is unsuper-
vised learning, which can be effectively applied to purely unlabeled data—that is, data in which
no points are explicitly identified as anomalous or nonanomalous. Graph-based anomaly detection
(GBAD) is one important form of unsupervised learning ([COOK07], [EBER07], [COOK00]) but
183
184 Big Data Analytics with Applications in Insider Threat Detection
has traditionally been limited to static, finite-length datasets. This limits its application to streams
related to insider threats which tend to have unbounded length and threat patterns that evolve over
time. Applying GBAD to the insider threat problem therefore requires that the models used be adap-
tive and efficient. Adding these qualities allows effective models to be built from vast amounts of
evolving data.
In this book, we cast insider threat detection as a stream mining problem and propose two
methods (supervised and unsupervised learning) for efficiently detecting anomalies in stream data
[PARV13]. To cope with concept evolution, our supervised approach maintains an evolving ensem-
ble of multiple OCSVM models [PARV11a]. Our unsupervised approach combines multiple GBAD
models in an ensemble of classifiers [PARV11b]. The ensemble updating process is designed in both
cases to keep the ensemble current as the stream evolves. This evolutionary capability improves the
classifier’s survival of concept drift as the behavior of both legitimate and illegitimate agents varies
over time. In experiments, we use test data that records system call data for a large, Unix-based,
multiuser system.
This chapter deserves our approach to insider threat detection using stream data mining.
In Section 14.2, we discuss sequence stream data. Big data issues are discussed in Section 14.3. Our
contributions are discussed in Section 14.4. This chapter is summarized in Section 14.5.
14.4 CONTRIBUTIONS
The main contributions of this work can be summarized as follows (see Figure 14.1).
1. We show how stream mining can be effectively applied to detect insider threats.
2. With regard to nonsequence data:
a. We propose a supervised learning solution that copes with evolving concepts using
one-class SVMs.
b. We increase the accuracy of the supervised approach by weighting the cost of false
negatives.
c. We propose an unsupervised learning algorithm that copes with changes based on
GBAD.
d. We effectively address the challenge of limited labeled training data (rare instance
issues).
e. We exploit the power of stream mining and graph-based mining by effectively com-
bining the two in a unified manner. This is the first work to our knowledge to harness
these two approaches for insider threat detection.
f. We compare one and two class SVMs on how well they handle stream insider threat
problems.
g. We compare supervised and unsupervised stream learning approaches and show which
has superior effectiveness using real-world data.
3. With regard to sequence data:
a. For sequence data, we propose a framework that exploits an unsupervised learning
(USSL) to find pattern sequences from successive user actions or commands using
stream-based sequence learning.
b. We effectively integrate multiple USSL models in an ensemble of classifiers to exploit
the power of ensemble-based stream mining and sequence mining.
Anomaly
Anomaly
Sequence data
Unsupervised learning
(quantized dictionary)
c. We compare our approach with the supervised model for stream mining and show the
effectiveness of our approach in terms of true positive rate (TPR) and FPR on a bench-
mark dataset.
4. With regard to big data:
a. Scalability is an issue to construct benign pattern sequences for quantized diction-
ary. For this, we exploit the MapReduce-based framework and show effectiveness of
our work.
REFERENCES
[ALKH12a]. T. Al-Khateeb, M.M. Masud, L. Khan, C.C. Aggarwal, J. Han, B.M. Thuraisingham, “Stream
Classification with Recurring and Novel Class Detection Using Class-Based Ensemble,” ICDM,
Brussels, Belgium, pp. 31–40, 2012.
[ALKH12b]. T. Al-Khateeb, M.M. Masud, L. Khan, B.M. Thuraisingham, “Cloud Guided-Stream
Classification Using Class-Based Ensemble,” IEEE CLOUD, Honolulu, Hawaii, pp. 694–701, 2012.
[BRAC04]. R.C. Brackney and R.H. Anderson (editors). Understanding the Insider Threat. RAND
Corporation, Santa Monica, CA, 2004.
[CHAN11]. C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions
on Intelligent Systems and Technology, 2(3), 2011, Article #27.
[COOK00]. D.J. Cook and L.B. Holder, “Graph-Based Data Mining,” IEEE Intelligent Systems, 15(2), 32–41,
2000.
[COOK07]. D.J. Cook and L.B. Holder, (Eds.). Mining Graph Data. John Wiley & Sons, Inc., Hoboken, NJ,
2007.
[EBER07]. W. Eberle and L.B. Holder, “Mining for Structural Anomalies in Graph-Based Data,” In Proceedings
of International Conference on Data Mining (DMIN), Las Vegas, NV, pp. 376–389, 2007.
[HAMP99]. M.P. Hampton and M. Levi, “Fast Spinning into Oblivion? Recent Developments in Money-
Laundering Policies and Offshore Finance Centres,” Third World Quarterly, 20(3), 645–656, 1999.
[MANE02]. L.M. Manevitz and M. Yousef, “One-Class SVMs for Document Classification,” The Journal of
Machine Learning Research, 2, 139–154, 2002.
[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In Proceedings of IEEE
International Conference on Data Mining (ICDM), Pisa, Italy, pp. 929–934, 2008.
[MASU10a]. M.M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In Proceedings of IEEE International
Conference on Data Mining (ICDM), pp. 929–934, 2010.
[MASU10b]. M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel
Class Detection of Data Streams in A Dynamic Feature Space,” CML/PKDD (2), pp. 337–352, 2010.
Insider Threat Detection as a Stream Mining Problem 187
[MASU11a]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel
Class Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on
Knowledge and Data Engineering, 23(6), 859–874, 2011.
[MASU11b]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge
and Data Engineering, 23(6), 859–874, 2011.
[MASU11c]. M.M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K.W. Hamlen, N.C. Oza, “Facing The Reality
of Data Stream Classification: Coping with Scarcity of Labeled Data,” Knowledge and Information
Systems, 33(1), 213–244, 2011.
[MASU11d]. M.M. Masud, T. Al-Khateeb, L. Khan, C.C. Aggarwal, J. Gao, J. Han, B.M. Thuraisingham,
“Detecting Recurring and Novel Classes in Concept-Drifting Data Streams,” ICDM, pp. 1176–1181,
2011.
[MASU13]. M.M. Masud, Q. Chen, L. Khan, C.C. Aggarwal, J. Gao, J. Han, A.N. Srivastava, N.C. Oza,
“Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams,” IEEE
Transactions on Knowledge and Data Engineering, 25(7), 1484–1497, 2013.
[MATZ04]. S. Matzner and T. Hetherington, “Detecting Early Indications of A Malicious Insider,”
IA Newsletter, 7(2), 42–45, 2004.
[PARV11a]. P. Parveen, J. Evans, B. Thuraisingham, K.W. Hamlen, L. Khan, “Insider Threat Detection Using
Stream Mining and Graph Mining,” In Proceedings of the 3rd IEEE Conference on Privacy, Security,
Risk and Trust (PASSAT) MIT, October, Boston, MA. (acceptance rate 8%) (Nominated for Best Paper
Award), pp. 1102–1110, 2011.
[PARV11b]. P. Parveen, Z.R. Weger, B. Thuraisingham, K.W. Hamlen, L. Khan, “Supervised Learning
for Insider Threat Detection Using Stream Mining,” In Proceedings of the 23rd IEEE International
Conference on Tools with Artificial Intelligence, November 7–9, Boca Raton, FL (acceptance rate 30%)
(Best Paper Award), pp. 1032–1039, 2011.
[PARV12a]. P. Parveen, N. McDaniel, B. Thuraisingham, L. Khan, “Unsupervised Ensemble Based Learning
for Insider Threat Detection,” In Proceedings of 4th IEEE International Conference on Information
Privacy, Security, Risk and Trust (PASSAT), September, Amsterdam, the Netherlands, pp. 718–727,
2012.
[PARV12b]. P. Parveen and B. Thuraisingham, “Unsupervised Incremental Sequence Learning for Insider
Threat Detection,” In Proceedings of IEEE International Conference on Intelligence and Security (ISI),
June, Washington DC, pp. 141–143, 2012.
[PARV13]. P. Parveen, N. McDaniel, J. Evans, B. Thuraisingham, K.W. Hamlen, L. Khan, “Evolving Insider
Threat Detection Stream Mining Perspective,” International Journal on Artificial Intelligence Tools
(World Scientific Publishing), 22(5), 1360013-1–1360013-24, 2013.
[SALE11]. M.B. Salem and S.J. Stolfo, “Modeling User Search Behavior for Masquerade Detection,”
In Proceedings of Recent Advances in Intrusion Detection (RAID), pp. 181–200, 2011.
15 Survey of Insider Threat
and Stream Mining
15.1 INTRODUCTION
As we have discussed in Chapter 7, the effective detection of insider threats requires monitoring
mechanisms that are far more fine-grained than for external threat detection. These monitors must
be efficiently and reliably deployable in the software environments where actions endemic to mali-
cious insider missions are caught in a timely manner. Such environments typically include user-level
applications, such as word processors, email clients, and web browsers for which reliable monitor-
ing of internal events by conventional means is difficult.
To be able to detect the insider threats, we need to capture as accurately as possible not only the
attributes of such insiders but also their behavior and communication. In Chapter 14, we argued
that the data about the insiders arrive continuously and therefore could be modeled as data streams.
Therefore, insider threat detection amounts to a stream data mining problem.
In this chapter, first, we will present related work with regard to insider threat and stream min-
ing. Next, we will present related work with regard to big data and analytics perspective. The orga-
nization of this chapter is as follows. Related work on insider threat detection will be discussed in
Section 15.2. Related work in stream mining will be discussed in Section 15.3. Big data issues will
be discussed in Section 15.4. This chapter is summarized in Section 15.5.
189
190 Big Data Analytics with Applications in Insider Threat Detection
evolving streams. In other words, stream characteristics of data are not explored further. Hence, static
learning performance may degrade over time. On the other hand, our supervised approach will learn
from evolving data streams. Our proposed work is based on supervised learning, and it can handle
dynamic data or stream data well by learning from evolving streams. In anomaly detection, a one-
class SVM (OCSVM) algorithm is used [STOL05]. OCSVM builds a model by training on normal
data, and then it classifies test data as benign or anomalous based on geometric deviations from that
normal training data. For masquerade detection, OCSVM training is as effective as two-class train-
ing [STOL05]. Investigations have been made into SVMs using binary features and frequency-based
features. The OCSVM algorithm with binary features performed the best.
Recursive mining has been proposed to find frequent patterns [SZYM04]. OCSVM classi-
fiers were used for masquerade detection after the patterns were encoded with unique symbols
and all sequences rewritten with this new coding. To the best of our knowledge, there is no work
that extends this OCSVM in a stream domain. Although our approach relies on OCSVM, it is
extended to the stream domain so that it can cope with changes ([PARV11b], [PARV13]). Works
have also explored unsupervised learning for insider threat detection, but only to static streams to
our knowledge ([LIU05], [ESKI02]). Static graph-based anomaly detection (GBAD) approaches
([COOK07], [EBER07], [COOK00], [YAN02]) represent threat and nonthreat data as a graph and
apply unsupervised learning to detect anomalies. The minimum description length (MDL) approach
to GBAD has been applied to email, cell phone traffic, business processes, and cybercrime datasets
([STAN96], [KOWA08]). Our work builds upon GBAD and MDL to support dynamic, evolving
streams ([PARV11a], [PARV13]).
Stream mining is a relatively new category of data mining research that applies to continu-
ous data streams [FAN04]. In such settings, both supervised and unsupervised learning must be
adaptive in order to cope with data whose characteristics change over time. There are two main
approaches to adaptation: incremental learning ([DOMI01], [DAVI98]) and ensemble-based learn-
ing ([MASU10a], [MASU11a], [FAN04]). The past work has demonstrated that ensemble-based
approaches are the more effective of the two, thus motivating our approach.
Ensembles have been used in the past to bolster the effectiveness of positive/negative classifica-
tion ([MASU08], [MASU11a]). By maintaining an ensemble of K models that collectively vote on
the final classification, the number of false negatives (FN) and false positives (FP) for a test set can
be reduced. As better models are created, poorer models are discarded to maintain an ensemble of
size exactly K. This helps the ensemble evolve with the changing characteristics of the stream and
keeps the classification task tractable. A comparison of the above related works is summarized in
Table 15.1. A more complete survey is available in [SALE08].
Insider threat detection work has utilized ideas from intrusion detection or external threat detec-
tion areas ([SCHO01], [WANG03]). For example, supervised learning has been applied to detect
insider threats. System call traces from normal activity and anomaly data are gathered [HOFM98];
TABLE 15.1
Capabilities and Focuses of Various Approaches for Nonsequence Data
Concept Insider Sequence-
Approach Learning Drift Threat Based
[JU01] S ✓ ✓
[MAXI03] S ✓
[LIU05] U ✓ ✓
[WANG03] S ✓
[MASU11a] S ✓
(Parveen, Weger et al., 2011b) U ✓ ✓
(Parveen, McDaniel et al., 2012) U ✓ ✓ ✓
Survey of Insider Threat and Stream Mining 191
features are extracted from this data using n-gram and, finally, trained with classifiers. Authors
[LIAO02] exploit the text classification idea in the insider threat domain where each system call is
treated as a word in a bag-of-words model. System call, and related attributes, arguments, object
path, return value, and error status of each system call are served as features in various supervised
methods ([KRUG03], [TAND03]). A supervised model based on a hybrid high-order Markov chain
model was adopted by researchers [JU01]. A signature behavior for a particular user based on the
command sequences that the user executed is identified and then anomaly is detected.
Schonlau et al. [SCHO01] applied a number of detection methods to a dataset of “truncated”
UNIX shell commands for 70 users. Commands were collected using the UNIX acct auditing
mechanism. For each user, a number of commands were gathered over a period of time. The detec-
tion methods are supervised based on the multistep Markovian model and the combination of the
Bayes and Markov approaches. Maxion et al. [MAXI03] argued that the Schonlau dataset was not
appropriate for the masquerade detection task and created a new dataset using the Calgary dataset
and applying the static supervised model.
These approaches differ from our work in the following ways. These learning approaches are
static in nature and do not learn over evolving stream. In other words, stream characteristics of data
are not explored further. Hence, static learner performance may degrade over time. On the other
hand, our approach will learn from evolving data stream. We show that our approach is unsuper-
vised and is as effective as a supervised model (incremental). Researchers have explored unsuper-
vised learning [LIU05] for insider threat detection. However, this learning algorithm is static in
nature. Although our approach is unsupervised, it learns at the same time from evolving stream
over time, and more data will be used for unsupervised learning. In anomaly detection, an OCSVM
algorithm is used. OCSVM builds a model from training on normal data and then classifies a test
data as benign- or anomaly-based on geometric deviations from normal training data. Wang et al.
[WANG03] showed for masquerade detection that OCSVM training is as effective as two-class
training. The authors have investigated SVMs using binary features and frequency-based features.
The one-class SVM algorithm with binary features performed the best. To find frequent patterns,
Szymanski et al. [SZYM04] proposed recursive mining, encoded the patterns with unique symbols,
and rewrote the sequence using this new coding. They used an OCSVM classifier for masquerade
detection. These learning approaches are static in nature and do not learn over evolving stream.
TABLE 15.2
Capabilities and Focuses of Various Approaches for Sequence Data
Concept Insider Sequence-
Approach Learning Drift Threat Based
[JU01] S ✓ ✓
[MAXI03] S ✓
[LIU05] U ✓ ✓
[WANG03] S ✓
[MASU11a] S ✓
(Parveen, Weger et al., 2011b) U ✓ ✓
(Parveen, McDaniel et al., 2012) U ✓ ✓ ✓
Users’ repetitive daily or weekly activities may constitute user profiles. For example, a user’s
frequent command sequences may represent a normative pattern of that user. Finding normative
patterns over dynamic data streams of unbounded length is challenging due to the requirement
of a one-pass algorithm. For this, an unsupervised learning approach is used by exploiting a com-
pressed/quantized dictionary to model common behavior sequences. This unsupervised approach
needs to identify a normal user’s behavior in a single pass ([PARV12a], [PARV12b], [CHUA11]).
One major challenge with these repetitive sequences is their variability in length. To combat this
problem, we generate a dictionary that will contain any combination of possible normative patterns
existing in the gathered data stream. In addition, we have incorporated the power of stream mining
to cope with gradual changes. We have done experiments and shown that our USSL approach works
well in the context of the concept drift and anomaly detection.
Our work ([PARV12a], [PARV12b]) differs from that of [CHUA11] in the following ways. First, the
work in [CHUA11] focuses on dictionary construction to generate normal profiles. In other words, their
work does not address the insider threat issue, which is our focus. Second, [CHUA11] does not consider
ensemble-based techniques; our work exploits the ensemble-based technique with the combination of
unsupervised learning (i.e., dictionary for benign sequences). Finally, when a number of users will
grow, dictionary construction will become a bottleneck. The work of [CHUA11] does not consider the
scalability issue; in our case, we address the scalability issue using a MapReduce framework.
In [PARV12a], an incremental approach is used. Ensemble-based techniques are not incorpo-
rated, but the literature used shows that ensemble-based techniques are more effective than those
of the incremental variety for stream mining ([MASU10a], [MASU11a], [FAN04]). Therefore, our
approach focuses on ensemble-based techniques [PARV12b].
Refer to Table 15.2 in which related approaches are unsupervised or supervised, and it has been
explained whether they focus on concept-drift, detecting insider threat, and sequenced data from
stream mining.
The Google File System ([CHAN06], [DEAN08]) is a scalable distributed file system that utilizes
clusters of commodity hardware to facilitate data-intensive applications. The system is fault tolerant
where the failure of the machine is normal due to the usage of commodity hardware. To cope with
failure, data will replicate into multiple nodes. If one node is failing, the system will utilize the other
node where replicated data exists.
MapReduce ([CHAN06], [DEAN08]) is a programming model that supports data-intensive
applications in a parallel manner. The MapReduce paradigm supports map and reduce functions.
Map generates a set of intermediate key and value pairs, and then the reduce function combines
the results and deduces it. In fact, the map/reduce paradigm can solve many real-world problems as
shown in ([CHAN06], [DEAN08]).
Hadoop ([BU10], [XU10], [ABOU09]) is an open-source apache project that supports the Google
File System and the MapReduce paradigm. Hadoop is widely used to address the scalability issue
along with MapReduce. For example, with the huge amount of semantic web datasets, Husain et al.
([HUSA09], [HUSA10], [HUSA11]) showed that Hadoop can be used to provide scalable queries.
In addition, MapReduce technology has been exploited by the BioMANTA project [DING05] and
SHARD (see also [BIOM] and [SHAR]).
Amazon developed Dynamo [DECA07], a distributed key-value store. Dynamo does not support
master−slave architecture, which is supported by Hadoop. Nodes in Dynamo communicate via a
gossip network. To achieve high availability and performance, Dynamo supports a model called
eventual consistency by sacrificing rigorous consistency. In eventual consistency, updates will be
propagated to nodes in the cluster asynchronously and a new version of the data will be produced
for each update.
Google developed BigTable ([CHAN06], [CHAN08]), a column-oriented data storage system.
BigTable utilizes the Google File System and Chubby [BURR06], a distributed lock service. BigTable
is a distributed multidimensional sparse map based on row keys, column names, and time stamps.
Researchers [ABOU09] exploited the combined power of MapReduce and relational database
technology. With regard to big data analytics, there are handfuls of works related to this topic. For
example, on the one hand, some researchers focus on generic analytics tools to address the scalabil-
ity issue. On the other hand, other researchers focus on specific analytics problems.
With regard to tools, Mahout is an open-source big data analytics tool to support classification,
clustering, and a recommendation system for big data. In [CHU06], researchers customized well-
known machine learning algorithms to take advantage of multicore machines and the MapReduce
programming paradigm. MapReduce has been widely used for mining petabytes of data [MORE08].
With regard to specific problems, Al-Khateeb et al. [ALKH12b] and Haque et al. ([HAQU13a],
[HAQU13b]) proposed scalable classification over evolving stream by exploiting the MapReduce
and Hadoop frameworks. There are some research works on parallel boosting with MapReduce.
Palit et al. [PALI12] proposed two parallel boosting algorithms, ADABOOST.PL and
LOGITBOOST.PL.
REFERENCES
[ABOU09]. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, A. Silberschatz, “HadoopDB: An
Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” In Proceedings
of the VLDB Endowment 2 (1), 922–933, 2009.
[ALKH12a]. T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggarwal, J. Han, B. M. Thuraisingham, “Stream
Classification with Recurring and Novel Class Detection Using Class-Based Ensemble,” In ICDM’2012:
Proceedings of the 12th IEEE Conference on Data Mining, December 10−13, 2012, Brussels, Belgium,
pp. 31–40, 2012.
[ALKH12b]. T. Al-Khateeb, M. M. Masud, L. Khan, B. M. Thuraisingham, “Cloud Guided Stream
Classification Using Class-Based Ensemble.” In CLOUD’2012: Proceedings of the 5th IEEE Conference
on Cloud Computing, June 24−29, Honolulu, HI, USA, pp. 694–701, 2012.
[BIOM]. http://www.itee.uq.edu.au/eresearch/projects/biomanta.
[BU10]. Y. Bu, B. Howe, M. Balazinska, M. Ernst, “Haloop: Efficient Iterative Data Processing on Large
Clusters,” Proceedings of the VLDB Endowment 3 (1), 285–296, 2010.
[BURR06]. M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” In OSDI’06:
Proceedings of the 7th Symposium on Operating Systems Design and Implementation, November 6−8,
Seattle, Washington, D.C., pp. 335–350, 2006.
[CHAN06]. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,
R. Gruber, “Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper),” In
OSDI’06: 7th USENIX Symposium on Operating Systems Design and Implementation, November 6–8,
Seattle, Washington, D.C., pp. 205–218, 2006.
[CHAN08]. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,
R. E. Gruber, “BigTable: A Distributed Storage System for Structured Data,” ACM Transactions on
Computer Systems 26 (2), Article #4, 2008.
[CHU06]. C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, K. Olukotun, “Map-Reduce
for Machine Learning on Multicore,” B. Sch¨ opf, J. C. Platt, T. Hoffman (eds.), Neural Information
Processing Systems, MIT Press, Cambridge, MA, pp. 281–288, 2006.
[CHUA11]. S.-L. Chua, S. Marsland, H. W. Guesgen, “Unsupervised Learning of Patterns in Data Streams
Using Compression and Edit Distance,” In IJCAI’2011: Proceedings of the 22nd International Joint
Conference on Artificial Intelligence, July 16–22, Catalonia, Spain, pp. 1231–1236, 2011.
[COOK00]. D. J. Cook and L. B. Holder, “Graph-Based Data Mining,” IEEE Intelligent Systems 15 (2), 32–41,
2000.
[COOK07]. D. J. Cook and L. B. Holder, editors. Mining Graph Data, John Wiley & Sons, Inc., Hoboken,
NJ, 2007.
[DAVI98]. B. D. Davison and H. Hirsh, “Predicting Sequences of User Actions. In Working Notes of the Joint
Workshop on Predicting the Future: AI Approaches to Time Series Analysis.” 15th National Conference
on Artificial Intelligence and Machine, AAAI Press, Madison, WI, pp. 5–12, 1998.
[DEAN08]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,”
Communications of the ACM, 51(1), 107–113, 2008.
[DECA07]. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S.
Sivasubramanian, P. Vosshall, W. Vogels, “Dynamo: Amazon’s Highly Available Key-Value Store,” T.
C. Bressoud, M. F. Kaashoek (eds.), In SOSP’07: Proceedings of the 21st ACM Symposium on Operating
Systems Principles, Oct. 14−17, Stevenson, Washington, D.C., pp. 205–220, 2007.
[DING05]. L. Ding, T. Finin, Y. Peng, P. P. da Silva, D. L. Mcguinness, “Tracking RDF Graph Provenance
Using RDF Molecules,” Technical Report (TR-S-05-06), University of Maryland Baltimore County,
2005. http://ebiquity.umbc.edu/paper/html/id/240/.
[DOMI01]. P. Domingos and G. Hulten, “Catching Up with the Data: Research Issues in Mining Data Streams,”
In DMKD’01: 2001 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, May 20, Santa Barbara, CA, USA, 2001.
[EBER07]. W. Eberle and L. B. Holder, “Mining for Structural Anomalies in Graph-Based Data,” In DMIN’07:
Proceedings of International Conference on Data Mining, Las Vegas, NV, pp. 376–389, 2007.
[ESKI02]. E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, “A Geometric Framework for Unsupervised
Anomaly Detection: Detecting Intrusions in Unlabeled data,” D. Barbar´, S. Jajodia (eds.), Applications
of Data Mining in Computer Security, Chapter 4. Springer, New York, NY, 2002.
[FAN04]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In Proceedings of
ACM SIGKDD, Seattle, Washington, D.C., pp. 128–137, 2004.
Survey of Insider Threat and Stream Mining 195
[FORR96]. S. Forrest, S. A. Hofmeyr, A. Somayaji, T. A. Longstaf, “A Sense of Self for Unix Processes,”
In Proceedings of the IEEE Symposium on Computer Security and Privacy (S&P), Oakland, CA,
pp. 120–128, 1996.
[GAO04]. D. Gao, M. K. Reiter, D. Song, “On Gray-Box Program Tracking for Anomaly Detection,” In
Proceedings of the USENIX Security Symposium, pp. 103–118, 2004.
[HAQU13a]. A. Haque, B. Parker, L. Khan, “Intelligent MapReduce Based Frameworks for Labeling Instances
in Evolving Data Stream” In CloudCom’2013: Proceedings of the 5th International Conference on
Cloud Computing Technology and Science, December 2−5, Bristol, UK, pp. 299–304, 2013.
[HAQU13b]. A. Haque, B. Parker, L. Khan, “Labeling Instances in Evolving Data Streams with Mapreduce,”
BigData, Santa Clara, CA, pp. 387–394, 2013.
[HOFM98]. S. A. Hofmeyr, S. Forrest, A. Somayaji, “Intrusion Detection Using Sequences of System Calls,”
Journal of Computer Security 6 (3), 151–180, 1998.
[HUSA09]. M. Husain, P. Doshi, L. Khan, B. Thuraisingham, “Storage and Retrieval of Large RDF Graph
Using Hadoop and MapReduce,” In CloudCom’09: Proceedings of the 1st International Conference on
Cloud Computing, pp. 680–686. Springer-Verlag, Berlin, 2009.
[HUSA10]. M. F. Husain, L. Khan, M. Kantarcioglu, B. Thuraisingham, “Data Intensive Query Processing for
Large RDF Graphs Using Cloud Computing Tools,” In CLOUD’10: Proceedings of the 2010 IEEE 3rd
International Conference on Cloud Computing, Washington, DC, pp. 1–10, 2010.
[HUSA11]. M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan, B. M. Thuraisingham, “Heuristics-
Based Query Processing for Large RDF Graphs Using Cloud Computing,” IEEE Transactions on
Knowledge and Data Engineering 23 (9), 1312–1327, 2011.
[JU01]. W.-H. Ju and Y. Vardi, “A Hybrid High-Order Markov Chain Model for Computer Intrusion Detection,”
Journal of Computational and Graphical Statistics 10 (2), 277–295, 2001.
[KOWA08]. E., Kowalski, T. Conway, S. Keverline, M. Williams, D. Cappelli, B. Willke, A. Moore, “Insider
Threat Study: Illicit Cyber Activity in the Government Sector,” Technical Report, U.S. Department
of Homeland Security, U.S. Secret Service, CERT, and the Software Engineering Institute (Carnegie
Mellon University), 2008. http://resources.sei.cmu.edu/library/asset-view.cfm?assetID=52227.
[KRUG03]. C. Krugel, D. Mutz, F. Valeur, G. Vigna, “On the Detection of Anomalous System Call
Arguments,” In ESORICS’03: Proceedings of the 8th European Symposium on Research in Computer
Security, Gjovik, Norway, pp. 326–343, 2003.
[LIAO02]. Y. Liao and V. R. Vemuri, “Using Text Categorization Techniques for Intrusion Detection,” In
Proceedings of the 11th USENIX Security Symposium, Berkeley, CA, pp. 51–59, 2002.
[LIU05]. A. Liu, C. Martin, T. Hetherington, S. Matzner, “A Comparison of System Call Feature Representations
for Insider Threat Detection,” In IAW’05: Proceedings of the IEEE Information Assurance Workshop,
West Point, NY, pp. 340–347, 2005.
[MASU08]. Masud, M. M., J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM’08: Proceedings of
the IEEE International Conference on Data Mining, West Point, NY, pp. 929–934, 2008.
[MASU10a]. M. M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In ICDM’10: Proceedings of the IEEE
International Conference on Data Mining, Sydney, New South Wales, pp. 929–934, 2010.
[MASU10b]. M. M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel Class
Detection in Data Streams with Active Mining,” In PKDD’10: Advances in Knowledge Discovery and
Data Mining (Lecture Notes in Computer Science Series, vol. 6119, part 2), Springer, New York, NY,
pp. 311–324, 2010.
[MASU11a]. M. M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel Class
Detection in Concept-drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge
and Data Engineering 23 (6), 859–874, 2011.
[MASU11b]. M. M., Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, N. C. Oza, “Facing the Reality
of Data Stream Classification: Coping with Scarcity of Labeled Data,” Knowledge and Information
Systems 33 (1), 213–244, 2011.
[MASU11c]. M. M. Masud, T. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, B. M. Thuraisingham,
“Detecting Recurring and Novel Classes in Concept-Drifting Data Streams,” In ICDM’2011: Proceedings
of the 11th IEEE Conference on Data Mining, December 11−14, Vancouver, BC, Canada, pp. 1176–1181,
2011.
[MASU13]. M. M., Masud, Q. Chen, L. Khan, C. C. Aggarwal, J. Gao, J. Han, A. N. Srivastava, N. C.
Oza, “Classification and Adaptive Novel Class Detection of Feature-evolving Data Streams,” IEEE
Transactions on Knowledge and Data Engineering 25 (7), 1484–1497, 2013.
196 Big Data Analytics with Applications in Insider Threat Detection
[MAXI03]. R. A. Maxion, “Masquerade Detection Using Enriched Command Lines,” In DSN’03: Proceedings
of the IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, pp.
5–14, 2003.
[MORE08]. C. Moretti, K. Steinhaeuser, D. Thain, N. V. Chawla, “Scaling Up Classifiers to Cloud Computers,”
In Proceedings of the 2008 8th IEEE International Conference on Data Mining, Washington, D.C., pp.
472–481, 2008.
[NGUY03]. N. Nguyen, P. Reiher, and G. H. Kuenning, “Detecting Insider Threats by Monitoring System Call
Activity,” In IAW’03: Proceedings of the IEEE Information Assurance Workshop, West Point, NY, pp.
45–52, 2003.
[PALI12]. I. Palit and C. K. Reddy, “Scalable and Parallel Boosting with Mapreduce,” IEEE Transactions on
Knowledge and Data Engineering 24 (10), 1904–1916, 2012.
[PARV11a]. P. Parveen, J. Evans, B. Thuraisingham, K. W. Hamlen, L. Khan, “Insider Threat Detection Using
Stream Mining and Graph Mining,” In PASSAT’2011: Proceedings of the 3rd IEEE Conference on
Privacy, Security, Risk and Trust, MIT, Boston, MA, USA, pp. 1102–1110, 2011.
[PARV11b]. P. Parveen, Z. R. Weger, B. Thuraisingham, K. W. Hamlen, L. Khan, “Supervised Learning
for Insider Threat Detection Using Stream Mining,” In Proceedings of the 23rd IEEE International
Conference on Tools with Artificial Intelligence, November 7–9, Boca Raton, FL, pp. 1032–1039, 2011.
[PARV12a]. P. Parveen and B. Thuraisingham, “Unsupervised Incremental Sequence Learning for Insider
Threat Detection,” In ISI’2012: Proceedings of the. IEEE International Conference on Intelligence and
Security, June, Washington, DC, pp. 141–143, 2012.
[PARV12b]. P. Parveen, N. McDaniel, B. Thuraisingham, L. Khan, “Unsupervised Ensemble Based Learning
for Insider Threat Detection,” In PASSAT’2012: Proceedings of the 4th IEEE International Conference
on Information Privacy, Security, Risk and Trus, September, Amsterdam, The Netherlands, pp. 718–
727, 2012.
[PARV13]. P. Parveen, N. McDaniel, J. Evans, B. Thuraisingham, K. W. Hamlen, L. Khan, “Evolving Insider
Threat Detection Stream Mining Perspective,” International Journal on Artificial Intelligence Tools
22 (5), 1360013, 2013.
[SALE08]. M. B., Salem, S. Herkshkop, S. J. Stolfo, “A Survey of Insider Attack Detection Research,” Insider
Attack and Cyber Security 39, 69–90, 2008.
[SCHO01]. M. Schonlau, W. DuMouchel, W.-H. Ju, A. F. Karr, M. Theus, Y. Vardi, “Computer Intrusion:
Detecting Masquerades,” Statistical Science 16 (1), 1–17, 2001.
[SCHU02]. E. E. Schultz, “A Framework for Understanding and Predicting Insider Attacks,” Computers and
Security 21 (6), 526–531, 2002.
[SHAR]. http://www.cloudera.com/blog/2010/03/how-raytheon-esearchers-are-using-hadoop-to-build-a-
scalable-distributed-triple-store.
[STAN96]. S. Staniford-Chen, S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagland, K. Levitt, C. Wee,
R. Yip, D. Zerkle, “GrIDS—A Graph Based Intrusion Detection System for Large Networks,” In
Proceedings of the 19th National Information Systems Security Conference, Baltimore, MD, pp. 361–
370, 1996.
[STOL05]. S. J. Stolfo, F. Apap, E. Eskin, K. Heller, S. Hershkop, A. Honig, K. Svore, “A Comparative
Evaluation of Two Algorithms for Windows Registry Anomaly Detection,” Journal of Computer
Security 13 (4), 659–693, 2005.
[SZYM04]. B. K. Szymanski and Y. Zhang, “Recursive Data Mining for Masquerade Detection and Author
Identification,” 13th Annual IEEE Information Assurance Workshop, Washington, DC, pp. 424–431,
2004.
[TAND03]. G. Tandon and P. Chan, “Learning Rules from System Call Arguments and Sequences for
Anomaly Detection,” In DMSEC’03: Proceedings of the ICDM Workshop on Data Mining for Computer
Security, Melbourne, FL, pp. 20–29, 2003.
[WANG 03]. H. Wang, W. Fan, P. S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In Proceedings of SIGKDD, Washington, DC, pp. 226–235, 2003.
[XU10]. Y. Xu, P. Kostamaa, L. Gao, “Integrating Hadoop and Parallel DBMS,” In SIGMOD’2010: Proceedings
of the 2010 International Conference on Management of Data, New York, NY, pp. 969–974, 2010.
[YAN02]. X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining,” In ICDM’02: Proceedings
of the International Conference on Data Mining, Maebashi City, Japan, pp. 721–724, 2002.
16 Ensemble-Based Insider
Threat Detection
16.1 INTRODUCTION
Data relevant to insider threats is typically accumulated over many years of organization and system
operations, and is therefore best characterized as an unbounded data stream. Such a stream can be
partitioned into a sequence of discrete chunks; for example, each chunk might comprise a week’s
worth of data. Figure 16.1 illustrates how a classifier’s decision boundary changes when such a
stream observes the concept drift. Each circle in the picture denotes a data point with unfilled
circles representing true negatives (TNs) (i.e., nonanomalies) and solid circles representing true
positives (TPs) (i.e., anomalies). The solid line in each chunk represents the decision boundary
for that chunk, whereas the dashed line represents the decision boundary for the previous chunk.
Shaded circles are those that embody a new concept that have drifted relative to the previous chunk.
In order to classify these properly, the decision boundary must be adjusted to account for the new
concept. There are two possible varieties of misapprehension (false detection):
1. The decision boundary of chunk 2 moves upward relative to chunk 1. As a result, some
nonanomalous data is incorrectly classified as anomalous, causing the false positive (FP)
rate to rise.
2. The decision boundary of chunk 3 moves downward relative to chunk 2. As a result, some
anomalous data is incorrectly classified as nonanomalous, causing the false negative (FN)
rate to rise.
In general, the old and new decision boundaries can intersect, causing both of the above cases
to occur simultaneously for the same chunk. Therefore, both FP and FN counts may increase.
These observations suggest that a model built from a single chunk or any finite prefix of chunks is
inadequate to properly classify all data in the stream. This motivates the adoption of our ensemble
approach, which classifies data using an evolving set of K models.
The organization of this chapter is as follows. Ensemble learning will be discussed in Section
16.2. Ensemble for unsupervised learning will be discussed in Section 16.3. Ensemble learning for
supervising learning will be discussed in Section 16.4. This chapter is summarized in Section 16.5.
197
198 Big Data Analytics with Applications in Insider Threat Detection
Data stream
M1 + λ6
λ4
x, ? M3 + +
M7 – λ0
models on the most recent chunk and discard the poorest predictor. This requires the ground truth
to be immediately available for the most recent chunk so that the prediction error can be accurately
measured. If the ground truth is not available, we instead rely on majority voting; the model with
least agreement with the majority decision is discarded. This results in an ensemble of the K models
that best match the current concept.
∑{i | M ∈ E, a ∈ A
−i
i Mi }λ
WA( E, a) = (16.1)
∑{i | M ∈ E} i
λ −i
where Mi ∈ E is a model in ensemble E that was trained from chunk i, AM is the set of anomalies
i
reported by model Mi, λ ∈ [0, 1] is a constant fading factor [CHEN09] and is the index of the most
recent chunk. Model Mi’s vote therefore receives weight λ − i, with the most recently constructed
model receiving weight λ0 = 1, the model trained from the previous chunk receiving weight λ1 (if
it still exists in the ensemble), etc. This has the effect of weighting the votes of more recent models
above those of potentially outdated ones when λ < 1. Weighted average WA(E, a) is then rounded
to the nearest integer (0 or 1) in Line 15 to obtain the weighted majority vote. For example, in
Figure 16.2, models M1, M3, and M7 vote positive, positive, and negative, respectively, for the input
sample x. If = 7 is the most recent chunk, these votes are weighted λ6, λ4, and 1, respectively.
The weighted average is therefore WA(E, x) = (λ6 + λ4)/(λ6 + λ4 + 1). If λ ≤ 0.86, the negative
majority opinion wins in this case; however, if λ ≥ 0.87, the newer model’s vote outweighs the two
older dissenting opinions, and the result is a positive classification. The parameter λ can thus be
tuned to balance the importance of large amounts of older information against smaller amounts of
newer information. Our approach uses the results from the previous iterations of GBAD to identify
200 Big Data Analytics with Applications in Insider Threat Detection
anomalies in subsequent data chunks. That is, normative substructures found in the previous GBAD
iterations may persist in each model. This allows each model to consider all data since the model’s
introduction to the ensemble, not just that of the current chunk. When streams observe the concept
drift, this can be a significant advantage because the ensemble can identify patterns that are norma-
tive over the entire data stream or a significant number of chunks but not in the current chunk. Thus,
insiders whose malicious behavior is infrequent can still be detected.
number of chunks but not in the current execution chunk. Thus, insiders whose malicious behav-
ior is infrequent will be detected. It is important to note that we always keep our ensemble size
fixed. Hence, an outdated model, which is performing the worst on the most recent chunks, will be
replaced by the new one.
It is important to note that the size of the ensemble remains fixed over time. Outdated models
that are performing poorly are replaced by better performing, newer models that are more suited to
the current concept. This keeps each round of classification tractable, even though the total amount
of data in the stream is potentially unbounded.
REFERENCES
[CHEN09]. L. Chen, S. Zhang, L. Tu, “An Algorithm for Mining Frequent Items on Data Stream Using
Fading Factor,” In COMPSAC’09: Proceedings of the IEEE International Computer Software and
Applications Conference, Seattle, WA, pp. 172–177, 2009.
[PARV11a]. P. Parveen, Z. R. Weger, B. Thuraisingham, K. W. Hamlen, L. Khan, “Supervised Learning
for Insider Threat Detection Using Stream Mining,” In Proceedings of the 23rd IEEE International
Conference on Tools with Artificial Intelligence, November 7–9, Boca Raton, FL, 2011.
[PARV11b]. P. Parveen, J. Evans, B. Thuraisingham, K. W. Hamlen, L. Khan, “Insider Threat Detection Using
Stream Mining and Graph Mining,” In PASSAT’2011: Proceedings of the 3rd IEEE Conference on
Privacy, Security, Risk and Trust, MIT, October Boston, MA, pp. 1102–1110, 2011.
[PARV13]. P. Parveen, N. McDaniel, J. Evans, B. Thuraisingham, K. W. Hamlen, L. Khan, “Evolving Insider
Threat Detection Stream Mining Perspective,” International Journal on Artificial Intelligence Tools
22 (5), 1360013.
17 Details of Learning Classes
17.1 INTRODUCTION
Insider threats are veritable needles within the haystack. Their occurrence is rare and when they
do occur are usually masked well within normal operation. The detection of these threats requires
identifying these rare anomalous needles in a contextualized setting where behaviors are constantly
evolving over time. To this refined search, we have designed approaches based on both supervised
and unsupervised, ensemble-based learning algorithms that maintain a compressed dictionary
of repetitive sequences found throughout dynamic data streams of unbounded length to identify
anomalies. For example, in unsupervised learning, compression-based techniques are used to model
common behavior sequences. This results in a classifier, exhibiting a substantial increase in clas-
sification accuracy for data streams containing insider threat anomalies. This ensemble of classifiers
allows the unsupervised approach to outperform traditional static learning approaches and boosts
the effectiveness over supervised learning approaches.
This chapter will describe the different classes of learning techniques for nonsequence data
([PARV11a], [PARV13], [PARV11b]). It serves the purpose of providing more detail as to exactly
how each method arrives at detecting insider threats and how ensemble models are built, modified,
and discarded. The first subsection focusses on supervised learning in detail, and the second subsec-
tion focuses on unsupervised learning. Both contain the formulas necessary to understand the inner
workings of each class of learning.
The organization of this chapter is as follows. Supervised learning will be discussed in Section
17.2, while unsupervised learning will be discussed in Section 17.3. Section 17.4 will provide a sum-
mary of this chapter.
f ( x ) = w, x + b (17.1)
where w is the normal vector and b is a bias term. The OCSVM solves an optimization problem to
find the rule with the maximal geometric margin. This classification rule will be used to assign a
label to a test example x. If f (x) < 0, we label x as an anomaly; otherwise, it is labeled normal. In real-
ity, there is a trade-off between maximizing the distance of the hyper-plane from the origin and the
number of training data points contained in the region separated from the origin by the hyper-plane.
203
204 Big Data Analytics with Applications in Insider Threat Detection
E E E C E
A B A B A B A B A B
C D C D E D E D E D
FIGURE 17.1 A graph with a normative substructure (boxed) and anomalies (shaded).
by searching for three factors: modifications, insertions, and deletions of vertices and edges. Each
unique factor runs its own algorithm that finds a normative substructure and attempts to find the
substructures that are similar but not completely identical to the discovered normative substructure.
A normative substructure is a recurring subgraph of vertices and edges that, when coalesced into a
single vertex, most compresses the overall graph. The rectangle in Figure 17.1 identifies an example
of normative substructure for the depicted graph.
Our implementation uses SUBDUE [KETK05] to find normative substructures. The best norma-
tive substructure can be characterized as the one with minimal description length (MDL):
L (S, G ) = DL (G | S ) + DL (S ) (17.2)
where G is the entire graph, S is the substructure being analyzed, DL(G | S) is the description length of
G after being compressed by S, and DL(S) is the description length of the substructure being analyzed.
Description length DL(G) is the minimum number of bits necessary to describe graph G [EBER11].
Insider threats appear as small percentage differences from the normative substructures. This
is because insider threats attempt to closely mimic legitimate system operations except for small
variations embodied by illegitimate behavior. We apply three different approaches for identifying
such anomalies, discussed as follows.
17.3.1 GBAD-MDL
Upon finding the best compressing normative substructure, GBAD-MDL searches for deviations
from that normative substructure in subsequent substructures. By analyzing substructures of the
same size as the normative one, differences in the edges and vertices’ labels and in the direction or
endpoints of edges are identified. The most anomalous of these are those substructures for which
the fewest modifications is required to produce a substructure isomorphic to the normative one. In
Figure 17.1, the shaded vertex labeled E is an anomaly discovered by GBAD-MDL.
17.3.2 GBAD-P
In contrast, GBAD-P searches for insertions that, if deleted, yield the normative substructure.
Insertions made to a graph are viewed as extensions of the normative substructure. GBAD-P calcu-
lates the probability of each extension based on edge and vertex labels, and therefore exploits label
information to discover anomalies. The probability is given by
P( A = v) = P( A = v | A)P( A) (17.3)
where A represents an edge or vertex attribute and v represents its value. The probability P(A = v | A)
can be generated by a Gaussian distribution:
1 ( x − µ)2
ρ( x ) = exp − (17.4)
σ 2π 2σ 2
Details of Learning Classes 205
where µ is the mean and σ is the standard deviation. Higher values of ρ(x) correspond to more
anomalous substructures.
Using GBAD-P therefore ensures that malicious insider behavior that is reflected by the actual
data in the graph (rather than merely its structure) can be reliably identified as anomalous by our
algorithm. In Figure 17.1, the shaded vertex labeled C is an anomaly discovered by GBAD-P.
17.3.3 GBAD-MPS
Finally, GBAD-MPS considers deletions that, if re-inserted, yield the normative substructure. To
discover these, GBAD-MPS examines the parent structure. Changes in size and orientation in the
parent signify deletions amongst the subgraphs. The most anomalous substructures are those with
the smallest transformation cost required to make the parent substructures identical. In Figure 17.1,
the last substructure of A−B−C−D vertices is identified as anomalous by GBAD-MPS because
of the missing edge between B and D marked by the shaded rectangle.
REFERENCES
[COOK00]. D. J. Cook and L. B. Holder, “Graph-Based Data Mining,” IEEE Intelligent Systems, 15 (2),
32–41, 2000.
[COOK07]. D. J. Cook and L. B. Holder, editors. Mining Graph Data. John Wiley & Sons, Inc., Hoboken, NJ, 2007.
[EBER07]. W. Eberle and L. B. Holder, “Mining for Structural Anomalies in Graph-Based Data,” In DMIN’07:
Proceedings of the International Conference on Data Mining, pp. 376–389, 2007.
[EBER11]. W. Eberle, J. Graves, and L. Holder, “Insider Threat Detection Using a Graph-Based Approach,”
Journal of Applied Security Research, 6 (1), 32–81, 2011.
[KETK05]. N. S. Ketkar, L. B. Holder, and D. J. Cook, “Subdue: Compression-Based Frequent Pattern Discovery
in Graph Data,” In Proceedings of the ACM KDD Workshop on Open-Source Data Mining, Chicago, IL,
pp. 71–76, 2005.
[MANE02]. L. M. Manevitz and M. Yousef, “One-class SVMs for Document Classification,” The Journal of
Machine Learning Research, 2, 139–154, 2002.
[MASU09]. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, A Multi-Partition Multi-Chunk
Ensemble Technique to Classify Concept-Drifting Data Streams, Advances in Knowledge Discovery and
Data Mining, Springer, Berlin, pp. 363–375, 2009.
[PARV11a]. P. Parveen, J. Evans, B. Thuraisingham, K. W. Hamlen, and L. Khan, “Insider Threat Detection
Using Stream Mining and Graph Mining,” In PASSAT’2011: Proceedings of the 3rd IEEE Conference on
Privacy, Security, Risk and Trust, MIT Press, Boston, MA, pp. 1102–1110, 2011.
[PARV11b]. P. Parveen, Z. R. Weger, B. Thuraisingham, K. W. Hamlen, and L. Khan, “Supervised Learning
for Insider Threat Detection Using Stream Mining,” In Proceedings of the 23rd IEEE International
Conference on Tools with Artificial Intelligence, November 7–9, Boca Raton, FL, pp. 1032–1039, 2011.
[PARV13]. P. Parveen, N. McDaniel, J. Evans, B. Thuraisingham, K. W. Hamlen, and L. Khan, “Evolving
Insider Threat Detection Stream Mining Perspective,” International Journal on Artificial Intelligence
Tools, 22 (5), 1360013.
[YAN02]. X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining,” In ICDM’02: Proceedings
of the International Conference on Data Mining, pp. 721–724, 2002.
18 Experiments and Results
for Nonsequence Data
18.1 INTRODUCTION
Chapters 16 and 17 described our stream mining techniques for insider threat detection. In par-
ticular, ensemble-based techniques for nonsequence data were discussed. We also discussed both
supervised and unsupervised earning methods. We also discussed stream mining for nonsequence
data. We have argued that we need scalable stream mining techniques as massive amounts of data
streams have to be analyzed for insider threat detection.
In this chapter, we will discuss our testing methodology and experimental results. The organiza-
tion of this chapter is as follows. The dataset we used is discussed in Section 18.2. Experimental
setup is discussed in Section 18.3. Results are presented in Section 18.4. This chapter is summarized
in Section 18.5.
18.2 DATASET
We tested both of our algorithms on the 1998 Lincoln Laboratory Intrusion Detection dataset
[KEND98]. This dataset consists of daily system logs, containing all system calls performed by
all processes over a 7-week period. It was created using the Basic Security Mode (BSM) auditing
program. Each log consists of tokens that represent system calls using the syntax exemplified in
Figure 18.1.
The token arguments begin with a header line and end with a trailer line. The header line reports
the size of the token in bytes, a version number, the system call, and the date and time of execu-
tion in milliseconds. The second line reports the full path name of the executing process. The
optional attribute line identifies the user and group of the owner, the file system and node, and the
device. The next line reports the number of arguments to the system call, followed by the arguments
themselves on the following line. The subject line reports the audit ID, effective user and group
IDs, real user and group IDs, process ID, session ID, and terminal port, and address, respectively.
Finally, the return line reports the outcome and return value of the system call.
Since many system calls are the result of automatic processes not initiated by any particular user,
they are therefore not pertinent to the detection of insider threat. We limit our attention to user-
affiliated system calls. These include calls for exec, execve, utime, login, logout, su, setegid, seteuid,
setuid, rsh, rexecd, passwd, rexd, and ftp. All of these correspond to logging in/out or file operations
initiated by users, and are, therefore, relevant to insider threat detection. Restricting our attention to
such operations helps to reduce extraneous noise in the dataset. Further, some tokens contain calls
made by users from the outside, via web servers, and are not pertinent to the detection of insider
threats. There are six such users in this dataset that have been pulled out. Table 18.1 reports statistics
for the dataset after all irrelevant tokens have been filtered out and the attribute data in Figure 18.2
has been extracted. Preprocessing extracted 62K tokens spanning 500K vertices. This reflected the
activity of all users over nine weeks.
Figure 18.3 shows the features extracted from the output data in Figure 18.1 for our supervised
approach and Figure 18.2 depicts the subgraph structure yielded for our unsupervised approach.
The first number in Figure 18.3 is the classification of the token as either anomalous (−1) or normal (1).
The classification is used by a two-class support vector machine (SVM) for training the model,
but is unused (although required) for the one-class SVM (OCSVM). The rest of the line is a list
207
208 Big Data Analytics with Applications in Insider Threat Detection
FIGURE 18.1 A sample system call record from the MIT Lincoln dataset.
TABLE 18.1
Dataset Statistics after Filtering and
Attribute Extraction
Statistic Value
No. of vertices 500,000
No. of tokens 62,000
No. of normative substructures 5
No. of users All
Duration 9 weeks
of index−value pairs that are separated by a colon (:). The index represents the dimension for use
by SVM, and the value is the value of the token along that dimension. The value must be numeric,
and the list must be ascending by the index. Indices that are missing are assumed to have a value
of 0. Attributes that are categorical in nature (and can take the value of any one of N categories)
are represented by N dimensions. In Figure 18.2, 1:29669 means that the time of day (in seconds)
is 29669. 6:1 means that the user’s ID (which is categorical) is 2142, 8:1 means that the machine IP
address (also categorical) is 135.13.216.191, 21:1 means that the command (categorical) is execve,
32:1 means that the path begins with /opt, and 36:0 means that the return value is 0. The mappings
between the data values and the indices were set internally by a configuration file.
All of these features are important for different reasons. The time of day could indicate that
the user is making system calls during normal business hours, or, alternatively, is logging in late
at night, which could be anomalous. The path could indicate the security level of the system call
being made for instance, a path beginning with /sbin could indicate the use of important system
files, while a path like /bin/mail could indicate something more benign, like sending mails. The
user ID is important to distinguish events; what is anomalous for one user may not be anomalous
ll m 〈user
ina
ca
s
〈data〉
c ID
arg
l audit ID〉
for another. A programmer that normally works from 9 a.m. to 5 p.m. would not be expected to
login at midnight, but a maintenance technician (who performs maintenance on server equipment
during off hours, at night) would. Frequent changes in machine IP addresses or changes that are not
frequent enough could indicate something anomalous. Finally, the system call itself could indicate
an anomaly most users would be expected to login and logout, but only administrators would be
expected to invoke super-user privileges with a command such as su.
TABLE 18.2
Exp. A: One Class vs. Two Class SVM
One Class SVM Two Class SVM
False positives 3706 0
True negatives 25,701 29,407
False negatives 1 5
True positives 1 0
Accuracy 0.87 0.99
False positive rate 0.13 0.0
False negative rate 0.2 1.0
210 Big Data Analytics with Applications in Insider Threat Detection
TABLE 18.3
Exp. B: Updating vs. Nonupdating Stream Approach
Updating Stream Nonupdating Stream
False positives 13,774 24,426
True negatives 44,362 33,710
False negatives 1 1
True positives 9 9
Accuracy 0.76 0.58
False positive rate 0.24 0.42
False negative rate 0.1 0.1
TABLE 18.4
Summary of Data Subset A
(Selected/Partial)
Statistic Dataset A
User Donaldh
No. of vertices 269
No. of edges 556
Week 2–8
Weekday Friday
day and weighted according to the accuracy of the model’s previous decisions. For each test token,
the ensemble reports the majority vote of its models.
The aforementioned stream approach is more practical for detecting insider threats because
insider threats are stream in nature and occur in real time. A situation like that in the first experi-
ment above is not one that will occur in the real world. In the real world, insider threats must be
detected as they occur, not after months of data has piled in. Therefore, it is reasonable to compare
our updating stream ensemble with a simple OCSVM model constructed once and tested (but not
updated) as a stream of new data becomes available (see Table 18.3).
18.3.2 Unsupervised Learning
For our unsupervised approach (based on graph-based anomaly detection), we needed to accurately
depict the effects of two variables. Those variables are K, the number of ensembles maintained,
and q, the number of normative substructures maintained for each model in the ensemble. We
used a subset of data during this wide variety of experiments, as depicted in Table 18.4, in order to
complete them in a manageable time. The decision to use the small subset of data was arrived at due
to the exponential growth in cost for checking subgraph isomorphism.
Each ensemble iteration was run with q values between 1 and 8. Iterations were made with
ensemble sizes of K values between 1 and 6.
18.4 RESULTS
18.4.1 Supervised Learning
Performance and accuracy were measured in terms of total FPs and FNs throughout 7 weeks of test
data as discussed in Table 18.4 (weeks 2–8). The Lincoln Laboratory dataset was chosen because of
Experiments and Results for Nonsequence Data 211
both its large size and its well-known set of anomalies, facilitating an accurate performance assess-
ment via misapprehension counts. Table 18.2 shows the results for the first experiment using our
supervised method. The OCSVM outperforms the two-class SVM in the first experiment. Simply,
the two-class SVM is unable to detect any of the positive cases correctly. Although the two-class
SVM does achieve a higher accuracy, it is at the cost of having a 100% FN rate. By varying the
parameters for the two-class SVM, we found it possible to increase the FP rate (the SVM made an
attempt to discriminate between anomaly and normal data), but in no case could the two-class SVM
predict even one of the truly anomalous cases correctly. The OCSVM, on the other hand, achieves
a moderately low FN rate (20%), while maintaining a high accuracy (87.40%). This demonstrates
the superiority of the OCSVM over the two-class SVM for insider threat detection. The superior-
ity of the OCSVM over two-class SVM for insider threat detection further justifies our decision to
use OCSVM for our test of stream data. Table 18.3 gives a summary of our results for the second
experiment using our supervised method. The updating stream achieves much higher accuracy than
the nonupdating stream, while maintaining an equivalent and minimal FN rate (10%). The accuracy
of the updating stream is 76%, while that of the nonupdating stream is 58%.
The superiority of updating stream over nonupdating stream for insider threat detection further
justifies our decision to use updating stream for our test of stream data. By using labeled data, we
establish a ground truth for our supervised learning algorithm. This ground truth allows us to place
higher weights on FNs or FPs. By weighing one more than the other, we punish a model more for
producing that which we have increased the weight for. When detecting insider threats, it is more
important that we do not miss a threat (FN) than identify a false threat (FP). Therefore, we weigh
FN more heavily, that is, we add an FN cost. Figures 18.4 and 18.5 show the results of weighting the
0.5575
0.557
0.5565
0.556
0.5555
Accuracy
0.555
0.5545
0.554
0.5535
0.553
0.5525
0.552
0 50 100 150 200
False negative cost
25,500
25,450
25,400
Cost
25,350
25,300
25,250
25,200
0 50 100 150 200
False negative cost
TABLE 18.5
Impact of FN Cost
Accuracy F2 Measure
w/ FN cost 0.55682 0.00159
w/o FN cost 0.45195 0.00141
FNs more heavily than FPs with this established ground truth. This is to say that at an FN cost of
50, an FN that is produced will count against a model 50 times more than an FP will. Increasing the
FN cost also increases the accuracy of our OCSVM, updating the stream approach. We can see that
that increasing the FN cost up to 30 only increases the total cost without affecting the accuracy, but
after this, the accuracy climbs and the total cost comes down. Total cost, as calculated by Equation
18.1, represents the total number of FPs and FNs after they have been modified by the increased FN
cost. We see this trend peak at an FN cost of 80 where accuracy reaches nearly 56% and the total
cost is at a low of 25,229.
Total Cost = Total False Positives + (Total False Negatives * FN Cost ) (18.1)
The FNs are weighted by cost more heavily than FPs because it is more important to catch all
insider threats. FPs are acceptable in some cases, but an insider threat detection system is useless if
it does not catch all positive instances of insider threat activity. This is why models that fail to catch
positive cases and produce these FNs are punished, in our best-case result, 80 times more heavily
than those who produce FPs.
Table 18.5 reinforces our decision to include FN cost during model elimination that heavily
punishes models that produce FNs over those that produce FPs. Including FN cost increases the
accuracy of the ensemble and provides a better F2 measure.
18.4.2 Unsupervised Learning
We next investigate the impact of parameters K (the ensemble size) and q (the number of normative
substructures per model) on the classification accuracy and running times for our unsupervised
approach. To more easily perform the larger number of experiments necessary to chart these rela-
tionships, we employ the smaller datasets summarized in Table 18.4 for these experiments. Dataset
A consists of activity associated with user Donaldh during weeks 2–8. This user displays mali-
cious insider activity during the respective time period. This dataset evidences similar trends for all
relationships discussed henceforth; therefore, we report only the details for dataset A throughout the
remainder of this section. Figure 18.6 shows the relationship between the cutoff q for the number of
normative substructures and the running time in dataset A. Times increase approximately linearly
40
30
Time (s)
20
10
1 2 3 4 5 6 7 8
Normative structure
limit (q)
until q = 5 because there are only four normative structures in dataset A. The search for a fifth
structure therefore fails (but contributes running time), and higher values of q have no further effect.
Figure 18.7 shows the impact of ensemble size K and runtimes for dataset A. As expected, run-
times increase approximately linearly with the number of models (2 s per model on average in this
dataset). Increasing q and K also tends to aid in the discovery of TPs. Figures 18.8 and 18.9 illustrate
the positive relationships of q and K, respectively, with TP. Once q = 4, normative substructures are
considered per model and K = 4 models are consulted per ensemble, the classifier reliably detects
all seven TPs in dataset A. These values of q and K therefore strike the best balance between the
coverage of all insider threats and the efficient runtimes necessary for high responsiveness.
Increasing q to 4 does come at the price of raising more false alarms, however. Figure 18.10
shows that the FP rate increases along with the TP rate until q = 4. Dataset A has only four norma-
tive structures, so increasing q beyond this point has no effect. This is supported with q = 4, 5, 6
showing no increase in TP.
40
30
Time (s)
20
10
1 2 3 4 5 6
Ensemble size (K)
8
True positive (TP)
1 2 3 4 5 6
Normative structure
limit (q)
8
True positives (TP)
1 2 3 4 5 6
Ensemble size (K)
150
50
1 2 3 4 5 6
Normative structure
limit (q)
Table 18.6 considers the impact of weighted versus unweighted majority voting on the classifi-
cation accuracy. The unweighted columns are those for λ = 1 and the weighted columns use the
fading factor λ = 0.9. The dataset consists of all tokens associated with the user ID 2143. Weighted
majority voting has no effect in these experiments except when K = 4, where it reduces the FP
rate from 124 (unweighted) to 85 (weighted) and increases the TN rate from 51 (unweighted) to
90 (weighted). However, since these results can be obtained for K = 3 without weighted voting, we
conclude that weighted voting merely serves to mitigate a poor choice of K; weighted voting has
little or no impact when K is chosen wisely.
Table 18.7 gives a summary of our results comparing our supervised and unsupervised learn-
ing approaches. For example, on dataset A, the supervised learning achieves much higher accu-
racy (71%) than the unsupervised learning (56%), while maintaining a lower FP (31%) and FN
rate (0%). On the other hand, unsupervised learning achieves 56% accuracy, 54% FP rate, and
42% FN rate.
TABLE 18.6
Impact of Fading Factor λ (Weighted Voting)
K=2 K=3 K=4
TABLE 18.7
Supervised vs. Unsupervised Learning Approach on Dataset A
Supervised Learning Unsupervised Learning
False positives 55 95
True negatives 122 82
False negatives 0 5
True positives 12 7
Accuracy 0.71 0.56
False positive rate 0.31 0.54
False negative rate 0 0.42
Experiments and Results for Nonsequence Data 215
REFERENCES
[CHAN11]. C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions
on Intelligent Systems and Technology 2 (3), 2:27:1–227:27, 2011. Software available at http://www.csie.
ntu.edu.tw/cjlin/libsvm.
[KEND98]. K. Kendall, “A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems,”
Masters thesis, Massachusetts Institute of Technology, 1998.
19 Insider Threat Detection
for Sequence Data
19.1 INTRODUCTION
In this chapter, we will discuss inside threat detection for sequence data. A sequence is an ordered
list of objects (or events). Sequence contains members (also called elements or terms). In a set, ele-
ment order does not matter. On the other hand, in a sequence, order matters, and, hence, exactly
the same elements can appear multiple times at different positions in the sequence [QUMR13]. For
example, (U, T, D) is a sequence of letters with the letter “U” first and “D” last. This sequence dif-
fers from (D, T, U).
The sequence (U, T, D, A, L, L, A, S) that contains the alphabet “A” at two different positions is
a valid sequence. Figure 19.1 illustrates some sequence of the movement pattern of a user. The first
row represents a particular user’s one movement pattern sequence: student center, office, and media
lab (ml). In this sequence, the user was first at student center and ml (media lab) last [EAGL06].
The organization of this chapter is as follows. Classifying sequence data will be discussed in
Section 19.2. Unsupervised stream-based sequence learning will be discussed in Section 19.3.
Anomaly detection aspects will be discussed in Section 19.4. Complexity analysis will be provided
in Section 19.4. This chapter is summarized in Section 19.5.
Case 1: The decision boundary of the second chunk moves upward compared to that of the
first chunk. As a result, more normal data will be classified as anomalous by the decision
boundary of the first chunk; thus, FP will go up. Recall that a test point having a true
benign (normal) category classified as an anomalous by a classifier is known as an FP.
Case 2: The decision boundary of the third chunk moves downward compared to that of the
first chunk. So, more anomalous data will be classified as normal data by the decision
boundary of the first chunk; thus, the FN will go up. Recall that a test point having a true
malicious category classified as benign by a classifier is known as an FN.
217
218 Big Data Analytics with Applications in Insider Threat Detection
Movement Pattern
(student center)(office)(ml)
(maqs ave)(ml)(tang)(ml)(sloan)(ml)
(100 memorial)(ml)(tang)(black sheep restaurant)(ml)(sloan)(ml)
(off phm)(ml)(starbucks)(ml)
(hamshire&broadway)(off phm)(ml)(starbucks)(ml)
(ml)(100 memorial)(ml)(tang)(black sheep restaurant)(ml)(sloan)(ml)
Data stream
In the more general case, the decision boundary of the current chunk can vary, which causes the
decision boundary of the previous chunk to misclassify both normal and anomalous data. Therefore,
both FP and FN may go up at the same time.
This suggests that a model built from a single chunk will not suffice. This motivates the adoption
of adaptive learning. In particular, we will exploit two approaches as follows:
User
Generate a Incremental-based
Unsupervised LZW stream mining
sequence dictionary (D)
learning containing all
possible Compressed
patterns using the dictionary
Lempel–Ziv– (QD)
Online learning Welch
algorithm
Recall that a model will declare the test data as anomalous based on how much the test differs
from the model’s normative patterns. Once all models cast their vote, we will apply majority voting
to make the final decision as to whether the test point is anomalous or not (as shown in Figure 16.2).
Model Update: We always keep an ensemble of fixed size models (K in that case). Hence, when a
new chunk is processed, we already have K models in the ensemble, and the (K + 1)st model will be
created from the current chunk. We need to update the ensemble by replacing a victim model with
Compression
LZW
Session 1 LZW Old
Session 2 dictionary quantized
LZW dictionary
Session n
(OQD)
LZW
Previous chunk
LZW
Session 1 LZW New
Session 2 dictionary quantized
LZW
Session n dictionary
Compression
LZW (NQD)
New chunk
Generate a
Indexed LZW dictionary
Gather other Unsupervised (D) containing
data sequence all possible
system patterns using
from calls with learning
Lempel–Ziv–
chunki unicode Welch
Update algorithm
models
Compressed
ERACE ALL
YOUR WOR
KS
Ensemble of the dictionary
PRESS ANY
FOR YES.
KEY models (QD)
this new model. Victim selection can be done in a number of ways. One approach is to calculate
the prediction error of each model on the most recent chunk relative to the majority vote. Here, we
assume that ground truth on the most recent chunk is not available. If ground truth is available, we
can exploit this knowledge for training. The new model will replace the existing model from the
ensemble that gives the maximum prediction error.
Compression
Session 1 LZW
LZW Quantized
Session 2 dictionary
dictionary
LZW (QD)
Session n
LZW
FIGURE 19.6 Unsupervised stream-based sequence learning (USSL) from a chunk in ensemble-based case.
fi
wi =
∑ (19.1)
n
fi
i =1
where wi is the weight of a particular pattern pi in the current chunk, fi is the number of times the
pattern pi appears in the current chunk, and n is the total number of distinct patterns found in that
chunk.
Next, we compress the dictionary by keeping only the longest, frequent unique patterns accord-
ing to their associated weight and length, while discarding other subsumed patterns. This technique
is called the compression method (CM), and the new dictionary is a QD. The QD has a set of pat-
terns and their corresponding weights. Here, we use the edit distance to find the longest pattern.
The edit distance is a measure of similarity between pairs of strings ([BORG11], [VLAD66]). It is
the minimum number of actions required to transfer one string to another, where an action can be
substitution, addition, or deletion of a character into the string. As in the case of the earlier example,
the best normative pattern in the QD would be lift, come, etc.
This process is a lossy compression, but is sufficient enough to extract the meaningful normative
patterns. The reason behind this is that the patterns that we extract are the superset of the subsumed
patterns. Moreover, as frequency is another control parameter in our experiment, the patterns which
do not appear often cannot be regular user patterns.
Data relevant to insider threat is typically accumulated over many years of organization and
system operations, and is therefore best characterized as an unbounded data stream. As our data is a
continuous stream of data, we use ensemble-based learning to continuously update our compressed
dictionary. This continuous data stream is partitioned into a sequence of discrete chunks. For exam-
ple, each chunk might be comprised of a day or weeks’ worth of data and may contain several user
sessions. We generate our QD and their associated weight from each chunk. Weight is measured as
the normalized frequency of a pattern within that chunk.
When a new chunk arrives, we generate a new QD model and update the ensemble as mentioned
earlier. Figure 19.6 shows the flow diagram of our dynamic, ensemble-based, unsupervised stream
sequence learning method. Algorithm 19.1 shows the basic building block for updating the ensem-
ble. It takes the most recent data chunk S, ensemble E, and test chunk T. Lines 3−4 generate a new
QD model from the most recent chunk S and temporarily add it to the ensemble E. Lines 5−9 test
chunk T for anomalies for each model in the ensemble. Lines 13−24 find and label the anomalous
patterns in test chunk T according to the majority voting of the models in the ensemble. Finally, line
29 updates the ensemble by discarding the model with the lowest accuracy. An arbitrary model is
discarded in the case of multiple models having the same low performance.
19.3.1 Construct the LZW Dictionary by Selecting the Patterns in the Data Stream
At the beginning, we consider that our data is not annotated (i.e., unsupervised). In other words, we
do not know the possible sequence of future operations by the user. So, we use the LZW algorithm
[ZIV77] to extract the possible sequences that we can add to our dictionary. These can also be
commands like liftliftliftliftliftcomecomecomecomecomecome where each unique letter represents
222 Big Data Analytics with Applications in Insider Threat Detection
LZW
Lossy compression
li lif lift
lf lft lftl lift
ft ftl ftli
Quantized
tl tli tlif dictionary
Dictionary
a unique system call or command. We have used a unicode to index each command. For example, ls,
cp, and find are indexed as l, c, and f, respectively. The possible patterns or sequences are added to
our dictionary would be li, if,ft,tl, lif,ift, ftl, lift, iftl, ftli, tc, co, om, mc, com, come, and so on. When
the sequence li is seen in the data stream for the second time, in order to avoid repetition, it will not
be included in the LZW dictionary. Instead, we increase the frequency by 1 and extend the pattern
by concatenating it with the next character in the data stream, thus turning up a new pattern lif. We
will continue the process until we reach the end of the current chunk. Figure 19.7 demonstrates how
we generate an LZW dictionary from the data stream.
9: AM = AM − x
10: end if
11: end for
12: end for
13: for each candidate a in M∈E AM do
14: if round(W eightedAverage(E, a)) = 1 then
15: A ← A ∪ {a}
16: for each model M in ensemble E do
17: if a ∈ AM then
18: cM ← cM + 1
19: end if
20: end for
21: else
22: for each model M in ensemble E do
23: if a ∈ AM then
24: cM ← cM + 1
25: end if
26: end for
27: end if
28: end for
29: E ← E − {choose(arg minM (cM))}
from all the patterns qdij in E by more than X% (say > 30%). In order to find the anomalies, we
need to first find the matching patterns and delete those from the stream S. In particular, we find
the pattern from the data stream S that is an exact match or α edit distance away from any pattern,
qdij, in E. This pattern will be considered as the matching pattern. α can be half, one-third, or one-
fourth of the length of that particular pattern in qdij. Next, remaining patterns in the stream will be
considered as anomalies.
In order to identify the nonmatching patterns in the data stream S, we compute a distance matrix
L that contains the edit distance between each pattern, qdij in E, and the data stream S. If we have
a perfect match, that is, the edit distance 0 between a pattern qdij and S, we can move backward
exactly the length of qdij in order to find the starting point of that pattern in S, and then delete it
from the data stream. On the other hand, if there is an error in the match that is greater than 0 but
less than α, in order to find the starting point of that pattern in the data stream, we need to traverse
either left, or diagonal, or up within the matrix according to which one among the mentioned value
(L[i,j−1], L[i−1,j−1], L[i−1, j]) gives the minimum, respectively. Finally, once we find the starting
point, we can delete that pattern from the data stream. The remaining patterns in the data stream
will be considered as anomalous.
TABLE 19.1
Time Complexity of Quantization
Dictionary Construction
Description Time Complexity
Pair of patterns O(n2 × K2)
u number of user O(u × n2 × K2)
Insider Threat Detection for Sequence Data 225
Future work should examine how our techniques can be enhanced to provide better accuracy and
fewer false positives and negatives. Scalability of the techniques to handle massive data streams also
needs to be investigated.
REFERENCES
[BORG11]. E. N., Borges, M. G. de Carvalho, R. Galante, M. A. Gones, A. H. F. Laender, “An Unsupervised
Heuristic-Based Approach for Bibliographic Metadata Deduplication,” Information Processing and
Management 47 (5), 706–718, 2011.
[EAGL06]. N. Eagle and A. (Sandy) Pentland, “Reality Mining: Sensing Complex Social Systems,” Personal
and Ubiquitous Computing 10 (4), 255–268, 2006.
[MASU08]. M. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM’08: Proceedings of
the IEEE International Conference on Data Mining, Pisa, Italy, pp. 929–934, 2008.
[MASU09]. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, A Multi-Partition Multi-Chunk Ensemble
Technique to Classify Concept-Drifting Data Streams, Advances in Knowledge Discovery and Data
Mining, Springer, Berlin, pp. 363–375, 2009.
[MASU10]. M. M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In ICDM’10: Proceedings of the IEEE
International Conference on Data Mining, Sydney, Australia, pp. 929–934, 2010.
[MASU11]. M. M., Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel
Class Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on
Knowledge and Data Engineering 23 (6), 859–874, 2011.
[PARV12a]. P. Parveen and B. Thuraisingham, “Unsupervised Incremental Sequence Learning for Insider
Threat Detection,” In IS’2012: Proceedings of the IEEE International Conference on Intelligence and
Security, June, Washington, D.C., 2012.
[PARV12b]. P. Parveen, N. McDaniel, B. Thuraisingham, L. Khan, “Unsupervised Ensemble Based Learning
for Insider Threat Detection. In PASSAT’2012: Proceedings of the 4th IEEE International Conference
on Information Privacy, Security, Risk and Trust, September, Amsterdam, The Netherlands, 2012.
[QUMR13]. S. M. Qumruzzaman, L. Khan, B. M. Thuraisingham, “Behavioral Sequence Prediction for
Evolving Data Stream,” In IRI’2013: Proceedings of the 14th IEEE Conference on Information Reuse
and Integration, August 14−16, San Francisco, CA, pp. 482–488, 2013.
[VLAD66]. L. Vladimir, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet
Physics—Doklady 10 (8), 707–710, 1966.
[ZIV77]. J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions
on Information Theory 23 (3), 337–343, 1977.
20 Experiments and Results
for Sequence Data
20.1 INTRODUCTION
Chapter 19 described in detail our approach to insider threat detection for sequence data. In particu-
lar, both supervised and unsupervised learning techniques for streaming data were discussed. In
this chapter, we will provide an overview of the testing methodology and the experimental results.
We will present sequence dataset that we used for our experiments. Second, we present how we
inject concept drift in the dataset. Finally, we present results showing the anomaly detection rate in
the presence of concept drift1.
The organization of this chapter is as follows. The dataset used is discussed in Section 20.2.
Concept-drift aspects are discussed in Section 20.3. Results are presented in Section 20.4. This
chapter is summarized in Section 20.5.
20.2 DATASET
The datasets used for training and testing have been created from Trace Files received from the
University of Calgary project [GREE88]. As a part of that, 168 trace files were collected from 168
different users of Unix csh. There were four groups of people, namely novice programmers, expe-
rienced programmers, computer scientists, and nonprogrammers. The model that we have tried to
construct is that of a novice programmer who has been gaining experience over the weeks and is
gradually using more and more command sequences similar to that of an experienced user. These
gradual normal behavior changes will be known here as concept drift. Anomaly detection in the
presence of concept drift is difficult to achieve. Hence, our scenario is more realistic. This is a slow
process that takes place over several weeks.
The Calgary dataset [GREE88] as described above was modified by Maxion [MAXI03] for mas-
querade detection. Here, we followed the same guidelines to inject masquerade commands. From
the given list of users, those users who have executed >2400 commands (that did not result in an
error) were filtered out to form the valid user pool. This list had 37 users. The remaining users were
part of the invalid user pool. Out of the list of invalid users, 25 of them were chosen at random and a
block of 100 commands from the commands that they had executed were extracted and put together
to form a list of 2500 (25×100) commands. The 2500 commands were brought together as 250
blocks of 10 commands each. Out of these 250 blocks, 30 blocks were chosen at random as the list
of masquerade commands (300 commands). For each user in the valid users list, the total number
of commands was truncated to 2400. These 2400 commands were split into eight chunks of 300
commands each. The first chunk was kept aside as the training chunk (this contains no masquerade
data). The other seven chunks are the testing chunks. In the testing chunks, a number of masquerade
data blocks (each block comprising of 10 commands) were inserted at random positions. As a result,
for each user, we have one training chunk with 300 commands (no masquerade data) and seven test-
ing chunks which together have 2100 nonanomalous commands and 300 masquerade commands
(see Table 20.1 and Figure 20.1).
227
228 Big Data Analytics with Applications in Insider Threat Detection
TABLE 20.1
Description of the Dataset
Description Number
No. of valid users 37
No. of invalid users 131
No. of valid commands per user 2400
No. of anomalous commands in testing chunks 300
A user from a valid user pool: 300 command/chunk X 8 chunk= 2400 benign
1
log
δ (20.1)
drift =
2× d × n
Experiments and Results for Sequence Data 229
where δ is the variation constant, d is the current distribution of the command over the current num-
ber of individual observations, and n is the number of current observations made. A good value for
the variation constant is 1×10 −5. The variation constant shares an inverse relation to the overall drift
values. The expected range in distribution among produced variations is calculated by adding and
subtracting the calculated drift from the current distribution.
Upon the processing of a sample of user commands, predicted variations can be produced by the
framework upon request. The request can be made of any designated size and the concept drift will
provide new distributions that fall within the range of the calculated drift for a set of commands
of this size. The produced set of commands or new ones can be used to update the concept drift
and provide for a constantly evolving command distribution that represents an individual. Sudden
changes that do not fit within a calculated concept drift can be flagged as suspicious and therefore
possibly representative of an insider threat.
Algorithm 20.1 shows how the distribution is calculated for a set of commands and how the pre-
dicted variation is produced. As an example we take 10 commands, instead of 1000 per chunk, such
that we have [C1, C2, C1, C2, C3, C1, C4, C5, C1, C1]. The distributions will be [C1 = 0.5, C2 = 0.2,
C3 = 0.1, C4 = 0.1, C5 = 0.1]. With a value of δ = 1×10−5 the predicted variance for each command
comes out
1
log
1×1−5 ≈ 0.7071 #of occurrence (20.2)
C1 =
2 ×.5×10
We divide by the number of observation occurrences, in this case 10, because we want
the concept drift per occurrence, not just for a sample of 10. Our predicted variation (PV) val-
ues are [C1 ≈ 0.07071, C2 ≈ 0.11180, C3 ≈ 0.15811, C4 ≈ 0.15811, C5 ≈ 0.15811]. The adjusted
min/max drift comes out to be [C1 = 0.42929/0.57071, C2 = 0.08820/0.31180, C3 = 0/0.25811,
C4 = 0/0.25811, C5 = 0/0.25811]. From these drift values, we can produce a requested predicted
variation for another 10, or any number of, user command values. We look at the original sequence
and assemble at least the minimum drift value worth of commands in the variation. In this example,
the first value is C1 with a minimum of 0.42929 distribution. So we add it, bringing its current dis-
tribution to 1/10 or 0.1. This is the case for the first three values resulting in [C1, C2, C1]. The fourth
value is C2 who has met its minimum distribution drift. Adding it will not go over the maximum
distribution so it is either randomly added or not. The new predicted variation could look like [C1,
C2, C1, C2, C1, C1, C1, C2, C3, C4] or [C1, C2, C1, C3, C1, C4, C1, C5, C3, C1] of which both have com-
mand distributions similar to the original that fall within the new variance bounds.
12: PV ← PV ∪ C
13: else
14: if DC + 1 < MaxDrif tC then
15: flipCoin(P V ← P V ∪ C)
16: else
17: discard(C)
18: end if
19: end if
20: end if
We compare our approach, unsupervised stream-based sequence learning (USSL), with a super-
vised method modified from the baseline approach suggested by Maxion [MAXI03] which uses
naive Bayes’ (NB) classifier. The modified version compared in this book is more incremental in its
model training, and thus will be referred to as NB-INC. All instances of both algorithms were tested
using eight chunks of data. At every test chunk the NB-INC method uses all previously seen chunks
to build the model and train for the current test chunk. For test chunk 3, NB builds the classification
model and trains on chunks 1 and 2. For test chunk 5 NB builds a model and collectively trains
on chunks 1 through 4. These USSL statistics are gathered in a “Grow as you Go” (GG) fashion.
This is to say that, as the ensemble size being used grows larger, the amount of test chunks to go
through decreases. For an ensemble size of 1, models are built from chunk 1 and chunks 2 through
8 are considered testing chunks. For an ensemble size of 3, chunks 4 through 8 are considered to
be testing chunks. Every new chunk, starting at 4 in this case, is used to update the models in each
ensemble after a majority vote is reached and the test data is classified as an anomaly or not. To
clarify, in the example of ensemble size of three after the new model is built, all models vote on
the possible anomaly and contribute to deciding which model is considered the least accurate and
thus discarded. However, only models that have survived an elimination round (i.e., deemed not to
be least accurate at least once) are used to measure the false positive (FP), true positive (TP), false
negative (FN), and true negative (TN) for an ensemble. In particular, at chunk 4, the ensemble has
three models from chunks 1 through 3. From chunk 4, a new model is built. During the elimina-
tion round, chunk 3 is eliminated based on accuracy. In other words, at chunk 4, the ensemble will
have models created from chunks 1, 2, and 4 but not 3. The model updating process makes use of
compression and model replacement. After models are updated, the least accurate one is discarded
to maintain the highest accuracy and maintain the status quo since a new model is created before
every update and anomaly classification step. There are other ways of updating models and selecting
chunks for testing during the classification process not explored here. For this purpose, the limited
USSL method used for the shown results will be referred to as USSL-GG.
20.4 RESULTS
We have compared the NB and USSL-GG algorithms on the basis of true positive rate (TPR), false
positive rate (FPR), execution time, accuracy, F1 measure, and F2 measure. TPR and FPR as mea-
sured by
true positives
TPR = (20.3)
true positives + false negatives
false positives
FPR = (20.4)
false positives + true negatives
are the rates at which actual insider threats are correctly and incorrectly identified by the algo-
rithm. For these calculations a TP is an identified anomaly, that is, actually an anomaly, a TN is an
Experiments and Results for Sequence Data 231
identified piece of normal data, that is, not an anomaly, a FP is an identified anomaly, that is, actu-
ally just normal data, and a FN is an identified piece of normal data, that is, actually an anomaly.
Accuracy better measures the algorithm’s ability to pick out correct and incorrect insider threat
instances on a whole. F1 and F2 measure are weighted variables calculated from the generic Equation
20.6 that penalize for incorrectly identified insider threats (FP) and positive threats that were not
identified (FN). The F1 measure penalizes FP and FN equally while the F2 measure penalizes FN
much more. These values are calculated by
(TP + TN)
Accuracy = (20.5)
(TP + TN + FP + FN)
(1 = n2 ) + TP
Fn = (20.6)
(1 + n )TP + (n2 ) + FN + FP
2
Tables 20.2 through 20.4 show the details of the value comparisons between NB-INC and
USSL-GG for various drift values. USSL-GG has lower FPR and runtime values and higher every-
thing else than NB-INC across the board. USSL-GG runs faster with less false threat identifications
while maintaining higher success rates at catching real threats than NB-INC.
The USSL-GG data in Tables 20.2 through 20.4 are for its optimum results at ensemble size 3,
which will be shown why this is optimal later.
With our optimization results for USSL-GG, we make our final comparison with NB.
Figures 20.2 and 20.3 show the TPR and FPR, respectively. USSL-GG maintains higher TPR and
lower FPR than NB-INC.
TABLE 20.2
NB-INC versus USSL-GG for Various Drift Values on TPR and FPR
Drift TPR for NB-INC TPR for USSL-GG FPR for NB-INC FPR USSL-GG
0.000001 0.34 0.49 0.12 0.10
0.00001 0.36 0.58 0.12 0.09
0.0001 0.37 0.51 0.11 0.10
0.001 0.38 0.50 0.11 0.10
Note: Bold numbers in this table represent the best run times.
TABLE 20.3
NB-INC versus USSL-GG for Various Drift Values on Accuracy and
Runtime
Drift ACC for NB-INC ACC for USSL-GG Time for NB-INC Time USSL-GG
0.000001 0.80 0.85 52.0 3.60
0.00001 0.79 0.87 50.8 3.54
0.0001 0.82 0.86 51.0 3.55
0.001 0.81 0.85 53.4 3.60
Note: Bold numbers in this table represent the best run times.
232 Big Data Analytics with Applications in Insider Threat Detection
TABLE 20.4
NB-INC versus USSL-GG for Various Drift Values on F1 and F2 Measure
Drift F1 Msr for NB-INC F1 Msr for USSL-GG F2 Msr for NB-INC F2 Msr USSL-GG
0.000001 0.34 0.44 0.34 0.47
0.00001 0.36 0.50 0.36 0.54
0.0001 0.37 0.45 0.37 0.49
0.001 0.38 0.44 0.38 0.47
Note: Bold numbers in this table represent the best run times.
0.7
0.6
0.5
0.4
TPR
0.3 NB-INC
USSL-GG
0.2
0.1
0
0.000001 0.00001 0.0001 0.001
Drift
FIGURE 20.2 Comparison between NB-INC vs. our optimized model, USSL-GG in terms of TPR.
0.14
0.12
0.1
0.08
FPR
0.06 NB-INC
USSL-GG
0.04
0.02
0
0.000001 0.00001 0.0001 0.001
Drift
FIGURE 20.3 Comparison between NB-INC vs. our optimized model, USSL-GG in terms of FPR.
Experiments and Results for Sequence Data 233
0.7
0.6
0.5
Drift .00001
0.3
Drift .0001
0.2
Drift .001
0.1
0
1 2 3 4 5
Ensemble size
FIGURE 20.4 Comparison of USSL-GG across multiple drifts and ensemble sizes in terms of TPR.
0.35
0.3
0.25
0
1 2 3 4 5
Ensemble size
FIGURE 20.5 Comparison of USSL-GG across multiple drifts and ensemble sizes in terms of FPR.
0.9
0.85
Ensemble size 2
0.75 Ensemble size 3
Ensemble size 4
0.7
Ensemble size 5
0.65
0.000001 0.00001 0.0001 0.001
Drift
FIGURE 20.6 Comparison of USSL-GG across multiple drifts and ensemble sizes in terms of accuracy.
0.55
0.5
0.45
F1 measure
Drift .000001
Drift .00001
0.4
Drift .0001
0.3
1 2 3 4 5
Ensemble size
FIGURE 20.7 Comparison of USSL-GG across multiple drifts and ensemble sizes in terms of F1 measure.
0.54
0.52
0.5
Drift .000001
F2 measure
0.42
1 2 3 4 5
Ensemble size
FIGURE 20.8 Comparison of USSL-GG across multiple drifts and ensemble sizes in terms of F2 measure.
Experiments and Results for Sequence Data 235
to F measure values penalizing missed insider threats and wrongly classifying harmless instances
as threats. It does not give any direct bonuses for how many threats are correctly identified. This is
important in insider threat detection because missing even one insider threat can be very detrimen-
tal to the system being protected. Wrongly classifying harmless instances are not as important, but
they increase the problem of having to double-check-detected instances. In Figure 20.8, we show a
decrease in F2 measure after ensemble size 3. This makes USSL-GG most effective at ensemble
size 3 because F2 measure penalizes FN more than the FPs and low FN is the most important part
of insider threat detection as stated previously. This makes our final optimization of USSL-GG to
include ensemble size 3. With an ensemble size of 3, it is also worthy to note that runtimes are slower
than that of ensemble sizes 1 or 2. For ensemble sizes >3, the runtimes grow exponentially. These
greatly increased times are not desired and therefore ensemble sizes such as 4 or 5 are not optimal.
REFERENCES
[GREE88]. S. Greenberg, “Using Unix: Collected Traces of 168 Users,” Research Report 88/333/45,
Department of Computer Science, University of Calgary, Calgary, Canada, 1988. http://grouplab.cpsc.
ucalgary.ca/papers/.
[KRAN12]. P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer, J. Read, “Stream
Data Mining Using The Moa Framework,” DASFAA, 2, 309–313, 2012.
[MAXI03]. R.A. Maxion, “Masquerade Detection Using Enriched Command Lines,” In Proceedings of IEEE
International Conference on Dependable Systems & Networks (DSN), San Francisco, CA, pp. 5–14, 2003.
21 Scalability Using Big
Data Technologies
21.1 INTRODUCTION
Several of the techniques we have discussed in Part III are computationally intensive. For exam-
ple, the construction of the Lempel–Ziv–Welch (LZW) dictionary and quantized dictionary
(QD) is time-consuming. Therefore, we need to address scalability issues of these algorithms.
One possible solution is to adopt parallel/distributed computing. Here, we would like to exploit
cloud computing based on commodity hardware. Cloud computing is a distributed parallel solu-
tion. For our approach, we utilize a Hadoop- and MapReduce-based framework to facilitate
parallel computing.
This chapter will be organized in the following ways. First, we will discuss Hadoop/MapReduce
in Section 21.2. Second, we will describe scalable LZW and QD construction algorithm using
MapReduce (MR) in Section 21.3. Finally, we will present details of the results of various MR
algorithms for the construction of QD in Section 21.4. This chapter is summarized in Section 21.5.
237
238 Big Data Analytics with Applications in Insider Threat Detection
Sort and
Map key value shuffle Reduce key
Each line passed to splitting value pairs
individual mapper Apple, 1
Input files
instances Apple, 1 Apple, 4
Apple, 1 Apple, 1
Apple Orange Mango Orange, 1 Apple, 1
Mango, 1 Final output
Apple Orange Mango
Orange Grapes Plum Orange, 1 Grapes, 1 Grapes, 1
Orange Grapes Plum Grapes, 1
Apple, 4
Plum, 1
Grapes, 1
Mango, 1 Mango, 2
Apple, 1 Mango, 1 Mango, 2
Orange, 2
Apple Plum Mango Plum, 1 Plum, 3
Apple Plum Mango Mango, 1
Apple Apple Plum Orange, 1 Orange, 2
Orange, 1
Apple, 1
Apple Apple Plum Apple, 1
Plum, 1 Plum, 1
Plum, 1 Plum, 3
Plum, 1
Here, we have a simple word count program using the MapReduce framework (see Algorithm 21.1).
Mapper tasks input document ID as key and the value is the content of the document. Mapper
emits (term, 1) as intermediate key–value pair where each term appeared in a document is a key
(see line 5 in Algorithm 21.1). The outputs are sorted by key and then partitioned per reducer. The
reducer emits each distinct word as key and frequency count in the document as value (see line 12
in Algorithm 21.1).
(a) (b)
FIGURE 21.2 Approaches for scalable LZW and quantized dictionary construction using MapReduce job.
U1, liftlift
U2, blah... U3, ...
U5, cd
Partition
FIGURE 21.3 First MapReduce job for scalable LZW construction in 2MRJ approach.
240 Big Data Analytics with Applications in Insider Threat Detection
{(U1, li), 2}
{(U1, lif ), 2} {(Uj, bcd), 15}
............
{(U1, lift), 2} {(Uj, dbc), 15}
{(U2, blah), 10}
Partition
Reduce Reduce
FIGURE 21.4 Second MapReduce job for quantized dictionary construction in 2MRJ approach.
dedicated for QD construction in (Figure 21.4). In the first MR job, The mapper takes user ID along
with command sequence as an input to generate intermediate (key, value) pair having the form
((userid, css), 1). Note that css is a pattern which is a command subsequence. In the reduce phase,
the intermediate key (userid, css) will be the input. Here, keys are grouped together and values for
the same key (pattern count) are added. For example, a particular user 1 has command sequences
“liftlift.” The map phase emits ((u1, li), 1) ((u1, lif), 1) value as the intermediate key–value pairs (see
middle portion of Figure 21.3). Recall that the same intermediate key will go to a particular reducer.
Hence, a particular user ID along with pattern/css, that is, key, will arrive to the same reducer. Here,
the reducer will emit (user id, css) as a key and the value will be how many times (aggregated one)
the pattern appears in the command sequence for that user (see bottom portion of Figure 21.3).
Algorithm 21.2 presents pseudocode for LZW dictionary construction using a MapReduce job.
In Algorithm 21.2, the input file consists of line-by-line input. Each line has entries namely, gname
(userid) and command sequences (cseq). Next, mapper will take gname (userid) as key and values
will be command sequences for that user. In mapper, we will look for patterns having length 2, 3, and
so on. Here, we will check whether patterns exist in the dictionary (line 6). If the pattern does not
exist in the dictionary, we simply add that in the dictionary (line 7), and emit intermediate key–value
pairs (line 8). Here, the keys will be composite having gname and pattern. The value will be the fre-
quency count 1. At lines 9 and 10, we increment pointer so that we can look for patterns in new com-
mand sequences (cseq). If the pattern is in the dictionary, we simply emit at line 12 and cseq’s end
pointer is incremented. By not incrementing cseq’s start pointer, we will look for superset patterns.
Combiner will be run from line 15 to 20 in Algorithm 21.2. Combiner is an optional step and
acts as “min reducer.” For the same user, the same pattern may be emitted multiple times with the
Scalability Using Big Data Technologies 241
frequency count 1. In combiner, we aggregate them at line 18. Finally, in line 20, we emit the com-
posite key (gname and pattern) and aggregate the frequency count. Combiner helps us to overcome
unnecessary communication costs and therefore, improves processing time. Recall that combiner
is optional; it may run 0 times or 1 or many times. Hence, the signature of the combiner method
(input/output parameters) needs to match the output signature of mappers and the input signature of
reducers. At reducer from line 21 to 26, aggregation is carried as shown in “Combiner.”
In the second MR job, quantization of the dictionary will be carried out (see Figure 21.4). Mapper
will carry simple transformation by generating a key based on user ID and value based on pattern
frequency. In Algorithm 21.3, mapper will take each line as an input from the input file produced
by the first MapReduce job. Here, mapper will take input (userid, pattern) as key and frequency as
value. Mapper emits the intermediate key–value pair where the key will be user ID and the value
will be concatenation of pattern and frequency (see middle portion of Figure 21.4 and see from line
4 to 7 in Algorithm 21.3). All patterns of a particular user and corresponding frequency will arrive
at the same reducer. The reducer will conduct all pairwise edit distance calculation among all pat-
terns for a particular user. Finally, user ID as key and longest frequent patterns as the value will be
emitted by the reducer (see bottom portion of Figure 21.4).
At the reducer, each user (gname) will be input and list of values will be patterns and their
frequency count. Here, compression of patterns will be carried out for that user. Recall that some
patterns will be pruned using Edit distance. For a user, each pattern will be stored into Hashmap, H.
Each new entry in the H will be pattern as key and value as frequency count 1. For existing pattern
in the dictionary, we will simply update frequency count (line 11). At line 13, the dictionary will be
quantized and H will be updated accordingly. Now, from the QD, patterns along frequency count
will be emitted as values and key will be gname (at line 14).
1. The Executable file is moved from client machine to Hadoop JobTracker [JOB].
2. The JobTracker determines TaskTrackers [TASK] that will execute the job.
3. The Executable file is distributed to the TaskTrackers over the network.
4. Map processes initiates reading data from HDFS.
5. Map outputs are written to local discs.
6. Map outputs are read from discs, shuffled (transferred over the network to TaskTrackers.
7. Reduce processes initiate reading the input from local discs.
8. Reduce outputs are written to discs.
Therefore, if we can reduce the number of jobs, we can avoid expensive disk operations and net-
work transfers. That is, the reason we prefer 1MRJ over 2MRJ.
LZ W LZ W LZ W
(U1, li) (U2, Bl) (Un, ...)
(U1, lif ) (U2, Bla) (Un, ...)
(U1, lift)... (U2, Blah)... (Um, ...)
Partition
Quantization
FIGURE 21.5 1MRJ: 1 MR job approach for scalable LZW and quantized dictionary construction.
Mapper will emit user ID as key and value will be pattern (see Algorithm 21.1). Recall that in mapper,
partial LZW operation will be completed. The same user ID will arrive at the same reducer since user
is the intermediate key. For that user ID, a reducer will have a list of patterns. In the reducer, incom-
plete LZW will be completed here. In addition, full quantization operation will be implemented as
described in Chapter 19.3. Parallelization will be achieved at the user level (interuser parallelization)
instead of within users (intrauser parallelization). In mapper, parallelization will be carried out by
dividing large files into a number of chunks and processing a certain number of files in parallel.
Algorithm 21.1 illustrates the idea. The input file consists of line-by-line input. Each line has
entries namely, gname (userid), and command sequences (cseq). Next, mapper will take gname
(userid) as the key, and values will be command sequences for that user. In mapper, we will look for
patterns having length 2, 3, and so on. Here, we will check whether patterns exist in the dictionary
(line 6). If the pattern does not exist in the dictionary, we simply add that in the dictionary (line 7)
and emit intermediate key–value pairs (line 8) having keys as gname and values as patterns with
length 2, 3, and so on. At lines 9 and 10, we increment pointer so that we can look for patterns in new
command sequences (cseq). If the pattern is in the dictionary, we simply emit at line 12 and cseq’s
end pointer is incremented so that we can look for superset command sequence.
At the reducer, each user (gname) will be input and the list of values will be patterns. Here, compres-
sion of patterns will be carried for that user. Recall that some patterns will be pruned using edit dis-
tance. For a user, each pattern will be stored into Hashmap, H. Each new entry in the H will be pattern
as key and value as frequency count. For an existing pattern in the dictionary, we will simply update the
frequency count (line 18). At line 20, the dictionary will be quantized, and H will be updated accord-
ingly. Now, from the QD all distinct patterns from H will emitted as values along with key gname.
Each file contains the commands used by each of the users over several weeks. Now, in order to
get the big data, first we replicated the user files randomly so that we have
Total Number of Users = 4328; size is 430 MB; and one command file is for one user. Next, we
gave these files as input to our program (written in Python) which gave unique unicode for each
distinct command provided by all users. The output file for all users is 15.5 MB. We dubbed it as
original data (OD).
Finally, we replicated this data 12 times for each user. And we ended up 187 MB of input file
which was given as an input to Map Reduce job of LZW and Compression. We dubbed this as
duplicate big data (DBD).
21.4.3.1 On OD Dataset
We have compared our approaches, namely 2MRJ and 1MRJ, on OD dataset. Here, we have a
varied number of reducers and a fixed number of mappers (e.g., HDFS block size equals 64 MB).
In case of 2MRJ, we have a varied number of reducers in second job’s reducer and not in first map
reducer job. 1MRJ outperforms 2MRJ in terms of processing time on a fixed number of reducers
except in the first case (number of reducers equals to 1). With the latter case, parallelization is lim-
ited at the reducer phase. Table 21.1 illustrates this. For example, for number of reducers equals 9,
total time taken is 3.47 and 2.54 s for 2MRJ and 1MRJ approaches, respectively.
TABLE 21.1
Time Performance of 2MRJ vs. 1MRJ for Varying
Number of Reducers
# of Reducer Time for 2MRJ (M:S) Time for lMRJ (M:S)
1 13.5 16.5
2 9.25 9.00
3 6.3 5.37
4 5.45 5.25
5 5.21 4.47
6 4.5 4.20
7 4.09 3.37
8 3.52 3.04
9 3.47 2.54
10 3.38 2.47
11 3.24 2.48
12 3.15 2.46
246 Big Data Analytics with Applications in Insider Threat Detection
TABLE 21.2
Details of LZW Dictionary Construction and
Quantization Using MapReduce in 2MRJ on OD Dataset
Size/Entries in Size/Entries in
Description Second Job First Job
Map Input 95.75 MB (size) 15.498 MB (size)
Map Output 45,75,120 (entries) 65,37,040 (entries)
Reduce Input 17,53,590 (entries) 45,75,120 (entries)
Reduce Output 37.48 MB 95.75 MB (size)
With regard to 2MRJ case, Table 21.2 presents input/output statistics of both MapReduce jobs. For
example, for first map reduce job mapper emits 6,537,040 intermediate key–value pairs and reducer
emits 95.75-MB output. This 95.75 MB will be the input for mapper for the second MapReduce job.
Here, we will show how the HDFS block size will have an impact on the LZW dictionary con-
struction in the 2MRJ case. First, we vary the HDFS block size that will control the number of map-
pers. With a 64-MB HDFS block size and a 15.5-MB input size, the number of mappers equals to 1.
For a 4-MB HDFS block size, the number of mappers equals 4. Here, we assume that the input file
split size equals the HDFS block size. The smaller HDFS block size (smaller file split size) increases
performance (reduces time). More mappers will be run in various nodes in parallel.
Table 21.3 presents total time taken by mapper (part of first MapReduce job) in 2MRJ case on the
OD dataset. Here, we have varied partition size for LZW dictionary construction. For a 15.498-MB
input file size with an 8-MB partition block size, MapReduce execution framework used two mappers.
TABLE 21.3
Time Performance of Mapper for LZW Dictionary
Construction with Varying Partition Size in 2MRJ
Partition Block No of
Size (MB) Map (s) Mappers
1 31.3 15
2 35.09 8
3 38.06 5
4 36.06 4
5 41.01 3
6 41.03 3
7 41.01 3
8 55.0 2
64 53.5 1
Scalability Using Big Data Technologies 247
TABLE 21.4
Time Performance of 1MRJ for Varying Reducer and HDFS
Block Size on DBD
No of Reducer 64 MB 40 MB 20 MB 10 MB
1 39:24 27:20 23:40 24:58
2 17:36 16:11 13:09 14:53
3 15:54 11:25 9:54 9:12
4 13:12 11:27 8:17 7:41
5 13:06 10:29 7:53 6:53
6 12:05 9:15 6:47 6:05
7 11:18 8:00 6:05 6:04
8 10:29 7:58 5:58 5:04
9 10:08 7:41 5:29 4:38
10 11:15 7:43 5:30 4:42
11 10:40 7:30 4:58 4:41
12 11:04 8:21 4:55 3:46
validates our claim. With more reducers running in parallel, we can run quantization/compression
algorithms for various users in parallel. Recall that in 1MRJ reducer will get each distinct user as
key and values will be LZW dictionary pattern. Let us assume that we have 10 distinct users and
their corresponding patterns. For compression with one reducer, compression for 10 user patterns
will be carried out in a single reducer. On the other hand for five reducers, it is expected that each
reducer will get two users’ patterns. Consequently, 5 reducers will run in parallel and each reducer
will execute the compression algorithm for 2 users serially instead of 10. Therefore, with an increas-
ing number of reducers, performance (decreases time) improves.
Now, we will show how the number of mappers will affect the total time taken in 1MRJ case.
The number of mappers is usually controlled by the number of HDFS blocks (dfs.block.size) in
the input files. The number of HDFS blocks in the input file is determined by HDFS block size.
Therefore, people adjust their HDFS block size to adjust the number of maps.
45
40 R#1
35 R#2
R#3
30
R#5
25
Time
R#8
20
R#9
15 R#11
10 R#12
0
10 20 40 64
HDFS block size
FIGURE 21.6 Time taken for varying number of HDFS block size in 1MRJ.
248 Big Data Analytics with Applications in Insider Threat Detection
45
40
35
30
25 64 MB
Time
40 MB
20 20 MB
15 10 MB
10
0
1 2 3 4 5 6 7 8 9 10 11 12
Number of reducer
FIGURE 21.7 Time taken for varying number of reducer in 1MRJ 86.
Setting the number of map tasks is not as simple as setting up the number of reduce tasks. Here,
first we determine whether the input file is isSplitable. Next, three variables, mapred.min.split.size,
mapred.max.split.size, and dfs.block.size, determine the actual split size. By default, min split size
is 0 and max split size is Long.MAX and block size 64 MB. For actual split size, minSplitSize and
blockSize set the lower bound and blockSize and maxSplitSize together set the upper bound. Here
is the function to calculate:
For our case we use min split size is 0; max split size is Long.MAX and blockSize vary from 10
to 64 MB. Hence, the actual split size will be controlled by the HDFS block size. For example, the
190 MB input file with DFS block size 64 MB file will be split into three with each split having two
64 MB and the rest with 62 MB. Finally, we will end up with three maps.
In Figure 21.7, we show the impact of HDFS block size on the total time taken for a fixed number
of reducers. Here, X axis represents HDFS block size and Y-axis represents the total time taken for
1MR approach with a fixed number of reducers. We demonstrate that with an increasing number
of HDFS block size, the total time taken will increase gradually for a fixed input file. For example,
with regard to HDFS block size 10, 20, 40, and 64 MB, the total time taken in the 1MRJ approach
was 7.41, 8.17, 11.27, and 13.12 min, respectively, for a fixed number of reducers (=4). On one hand,
when the HDFS block size is 10 MB and the input file is 190 MB, 19 maps run where each map
processes a 10 MB input split. On the other hand, for a HDFS block size = 64 MB, three maps will
be run where each map will process a 64 MB input split. In the former case (19 maps with 10 MB),
each map will process a smaller file and in the latter case (three maps with 64 MB), we process a
larger file which consequently consumes more time. In the former case, more parallelization can be
achieved. In our architecture, >10 mappers can be run in parallel. Hence, for a fixed input file and a
fixed number of reducers, the total time increases with increasing HDFS block size.
discuss the scalability of our techniques and the issues in designing big data analytic techniques for
insider threat detection.
The work discussed in this chapter is the first step towards big data analytics for handling
massive data streams for insider threat detection. We need to examine the various techniques for
stream mining for insider threat detection that we have discussed and examine the scalability of the
solutions that have been proposed.
REFERENCES
[GREE88]. S. Greenberg, “Using Unix: Collected Traces of 168 Users,” Research Report 88/333/45,
Department of Computer Science, University of Calgary, Calgary, Canada, 1988. http://grouplab.cpsc.
ucalgary.ca/papers/.
[JOB]. http://wiki.apache.org/hadoop/JobTracker.
[MAXI03]. R.A. Maxion, “Masquerade Detection Using Enriched Command Lines,” In Proc. IEEE
International Conference on Dependable Systems & Networks (DSN), pp. 5–14, 2003.
[PARV13]. P. Parveen, B. Thuraisingham, L. Khan, “Map Reduce Guided Scalable Compressed Dictionary
Construction for Repetitive Sequences,” In 9th IEEE International Conference on Collaborative
Computing: Networking, Applications and Worksharing, 2013b, October.
[TASK]. http://wiki.apache.org/hadoop/TaskTracker.
22 Stream Mining and Big Data
for Insider Threat Detection
22.1 INTRODUCTION
Insider threat detection is a very important problem requiring critical attention. This chapter
presents a number of approaches to detect insider threats through augmented unsupervised and
supervised learning techniques on evolving stream. We have considered both sequence and nonse-
quence stream data.
The supervised learning approach to insider threat detection outperformed the unsupervised
learning approach. The supervised method succeeded in identifying all 12 anomalies in the 1998
Lincoln Laboratory Intrusion Detection dataset with zero false negatives (FN) and a lower false
positive (FP) rate than the unsupervised approach.
For unsupervised learning, graph-based anomaly detection (GBAD) ([COOK00], [COOK07],
[EBER07]) is used. However, applying GBAD to the insider threat problem requires an approach,
that is, sufficiently adaptive and efficient so effective models can be built from vast amounts of
evolving data.
In Section 22.2, we will provide a discussion of our approaches. Future work will be discussed
in Section 22.3. This chapter is summarized in Section 22.4.
22.2 DISCUSSION
Our technique combines the power of GBAD and one-class support vector machines (SVMs)
with the adaptiveness of stream mining to achieve effective practical insider threat detection for
unbounded evolving data streams. Increasing the weighted cost of FN increased accuracy and ulti-
mately allowed our approach to perform well. Though FP could be further reduced through more
parameter tuning, our approach accomplished the goal of detecting all insider threats.
We examined the problem of insider threat detection in the context of command sequences and
propose an unsupervised ensemble-based learning approach that can take into account concept
drift. The approach adopts advantages of both compression and incremental learning. A classifier
is typically built and trained using large amount of legitimate data. However, training a classifier is
very expensive, and furthermore, it has problems when the baseline changes as is the case in real-
life networks. We acknowledge this continuously changing feature of legitimate actions and intro-
duce the notion of concept drift to address the changes. The proposed unsupervised learning system
adapts directly to the changes in command sequence data. In addition, to improve accuracy we use
an ensemble of K classifiers, instead of a single one. Voting is used, and a subset of classifiers is used
because classifiers with more recent data gradually replace those that are outdated. We address an
important problem and propose a novel approach.
For sequence data, our stream-guided sequence learning performed well with limited number
of FP as compared to static approaches. This is because the approach adopts advantages from both
compression and ensemble-based learning. In particular, compression offered unsupervised learn-
ing in a manageable manner; on the other hand, ensemble-based learning offered adaptive learn-
ing. The approach was tested on a real command line dataset and shows effectiveness over static
approaches in terms of true positive and FP.
Compressed/quantized dictionary construction is computationally expensive. It does not scale
well with a number of users. Hence, we look for distributed solution with parallel computing with
251
252 Big Data Analytics with Applications in Insider Threat Detection
commodity hardware. For this, all users’ quantized dictionaries are constructed using a MapReduce
framework on Hadoop. A number of approaches are suggested, experimented with on the bench-
mark dataset, and discussed. We have shown with one 1 MapReduce job that a quantized dictionary
can be constructed and demonstrates effectiveness over other approaches.
22.3.2 Collusion Attack
During unsupervised learning (see Chapter 17), when we update models, collusion attack
([ZHAO05], [WANG09]) may take place. In that case, a set of models among K models will not be
replaced for a while. Each time, when a victim will be selected, these colluded models will survive.
Recall that “collusion” is an agreement between two or more models so that they will always agree
on the prediction. In particular, if we have K = 3 models, two models may maintain secretive agree-
ment and their prediction will be the same and used as ground truth. Therefore, two colluded/secre-
tive models will always survive and never be victim in model update case. Recall that the learning
is unsupervised and majority voting will be taken as ground truth. Hence, we will not be able to
catch an insider attack. Our goal is to identify a colluded attack. For this, during victim selection of
models, we will take into account agreement of models over time. If agreement of models persists
for a long time and survives, we will choose the victim from there.
Spammers post malicious links, send unsolicited messages to legitimate users, and hijack trending
topics. At least 3% of messages can be categorized as spam. We can extend our framework to detect
spam by exploiting anomaly detection.
New authors may appear in a blog. Our goal is to identify these new authors in the stream.
For this, our anomaly detection can be applied. Feature extraction needs to be changed (i.e.,
stylometric feature [JAMA12]. We would like to carry out our techniques for author attribution
([AKIV12], [KOPP09], [SEKE13]).
• Big Data Stream Infrastructure: Apache S4 [NEUM10], Storm, Cassandra [LAKS10], and
so on, including batch processing Hadoop.
• Big Data Mining Tool: Apache Mahout [OWEN11], MOA [ZLIO11], PEGASUS [KANG11],
GraphLab, SAMOA, and their support for Stream Mining.
While there has been a lot of work on stream mining as well as on insider threat detection, the
approaches discussed in this part are some of the early efforts to apply stream mining for insider
threat detection. We also discussed the need to develop scalable techniques to handle massive data-
sets and provided some directions.
In Part IV of this book, we will discuss our experimental systems on applying big data manage-
ment and analytics for various security applications. These systems will provide a better under-
standing on how the techniques discussed in Parts II and III could be scaled for large datasets.
REFERENCES
[AKIV12]. N. Akiva and M. Koppel, “Identifying Distinct Components of a Multi-Author Document,”
In Proceedings of EISIC, Odense, Denmark, pp. 205–209, 2012.
[ALIP10a]. N. Alipanah, P. Parveen, S. Menezes, L. Khan, S. Seida, B.M. Thuraisingham. “Ontology-Driven
Query Expansion Methods to Facilitate Federated Queries,” SOCA, Perth, Australia, pp. 1–8, 2010.
[ALIP10b]. N. Alipanah, P. Srivastava, P. Parveen, B.M. Thuraisingham, “Ranking Ontologies using Verified
Entities to Facilitate Federated Queries,” In Proceedings of Web Intelligence Conference, pp. 332–337,
2010.
[ALIP11]. N. Alipanah, P. Parveen, L. Khan, and B.M. Thuraisingham. “Ontology-Driven Query Expansion
Using Map/Reduce Framework to Facilitate Federated Queries.” In ICWS, Washington, DC, pp. 712–
713, 2011.
[BARO06]. M. Baron and A. Tartakovsky. “Asymptotic Optimality of Change-Point Detection Schemes in
General Continuous-Time Models,” Sequential Analysis, 25(3), 257–296, 2006.
[CANG06]. J.W. Cangussu and M. Baron, “Automatic Identification of Change Points for the System Testing
Process,” COMPSAC (1), 377–384, 2006.
[COOK00]. D.J. Cook and L.B. Holder “Graph-Based Data Mining,” IEEE Intelligent Systems, 15(2), 32–41,
2000.
[COOK07]. D.J. Cook and L.B. Holder (Eds.) Mining Graph Data. John Wiley & Sons, Inc., Hoboken, NJ,
2007.
[DAVI98]. B.D. Davison and H. Hirsh. “Predicting Sequences of User Actions. Working Notes of The Joint
Workshop on Predicting the Future: AI Approaches to Time Series Analysis,” 15th National Conference
on Artificial Intelligence and Machine, Madison, WI, pp. 5–12, AAAI Press, 1998.
[DOMI01]. P. Domingos and G. Hulten, “Catching up with the Data: Research Issues in Mining Data
Streams,” ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery
(DMKD), Santa Barbara, CA, May 20, 2001. http://www.cs.cornell.edu/johannes/papers/dmkd2001-
papers/p9_kollios.pdf.
[EBER07]. W. Eberle and L.B. Holder, “Mining for Structural Anomalies in Graph-Based Data,”
In Proceedings of International Conference on Data Mining (DMIN), San Jose, CA, pp. 376–389, 2007.
[FAN04]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In Proceedings of
ACM SIGKDD, Seattle, WA, pp. 128–137, 2004.
[JAMA12]. A. Jamak, S. Alen, M. Can, “Principal Component Analysis for Authorship Attribution,” Business
Systems Research, 3(2), 49–56, 2012.
[KANG11]. U. Kang, C.E. Tsourakakis, C. Faloutsos, “Pegasus: Mining Peta-Scale Graphs,” Knowledge and
Information Systems, 27(2), 303–325, 2011.
[KHAN02]. L. Khan and F. Luo, “Ontology Construction for Information Selection,” In Proceedings of
ICTAI, Washington, DC, pp. 122–127, 2002.
[KHAN04]. L. Khan, D. McLeod, E.H. Hovy, “Retrieval Effectiveness of an Ontology-Based Model for
Information Selection,” VLDB Journal, 13(1), 71–85, 2004.
[KOPP09]. M. Koppel, J. Schler, S. Argamon, “Computational Methods in Authorship Attribution,” JASIST,
60(1), 9–26, 2009.
[LAKS10]. A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS
Operating Systems Review 44(2), 35–40, 2010.
[MASU10a]. M.M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In Proceedings of IEEE International
Conference on Data Mining (ICDM), Sydney, Australia, pp. 929–934, 2010a.
[MASU10b]. M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel
Class Detection of Data Streams in a Dynamic Feature Space,” In Proceedings of ECML/PKDD (2),
Barcelona, Spain, pp. 337–352, 2010b.
Stream Mining and Big Data for Insider Threat Detection 255
[MASU10c]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class
Detection in Data Streams with Active Mining,” In Proceedings of PAKDD (2), Barcelona, Spain,
pp. 311–324, 2010.
[MASU11]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on Knowled
ge and Data Engineering (TKDE), 23(6), 859–874, 2011a.
[NEUM10]. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. “S4: Distributed Stream Computing Platform.”
In ICDM Workshops, Sydney, Australia, pp. 170–177, 2010.
[OWEN11]. S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in Action, Manning Publications, Shelter
Island, NY, 2011.
[PART11]. J. Partyka, P. Parveen, L. Khan, B.M. Thuraisingham, S. Shekhar, “Enhanced Geographically
Typed Semantic Schema Matching,” Journal of Web Semantics, 9(1), 52–70, 2011.
[PARV06]. P. Parveen and B.M. Thuraisingham, “Face Recognition Using Multiple Classifiers,” In Proceedings
of ICTAI, Arlington, VA, pp. 179–186, 2006.
[SEKE13]. S.E. Seker, K. Al-Naami, L. Khan, “Author Attribution on Streaming Data,” Information Reuse
and Integration (IRI) 2013 IEEE 14th International Conference on, San Francisco, CA, pp. 497–503,
2013.
[WANG09]. X. Wang, L. Qian, H. Jiang, “Tolerant Majority-Colluding Attacks For Secure Localization in
Wireless Sensor Networks,” In WiCom ’09: Proceedings of 5th International Conference on Wireless
Communications, Networking and Mobile Computing, 2009, Beijing, China, pp. 1–5, 2009.
[ZHAO05]. H. Zhao, M. Wu, J. Wang, K. Liu, “Forensic Analysis of Nonlinear Collusion Attacks for
Multimedia Fingerprinting,” Image Processing, IEEE Transactions on, 14(5), 646–661, 2005.
[ZLIO11]. I. Zliobaite, A. Bifet, G. Holmes, B. Pfahringer, “MOA Concept Drift Active Learning Strategies
for Streaming Data,” Journal of Machine Learning Research—Proceedings Track, 17, 48–55, 2011.
Conclusion to Part III
Part III, consisting of nine chapters, described stream data analytics techniques with emphasis on
big data for insider threat detection. In particular, both supervised and unsupervised learning meth-
ods for insider threat detection were discussed.
Chapter 14 provided a discussion of our approach to insider threat detection using stream data
analytics and discussed the big data issue with respect to the problem. That is, massive amounts of
stream data are emanating from various devices and we need to analyze this data for insider threat
detection. Chapter 15 described related work in both insider threat detection and stream data mining.
In addition, aspects of the big data issue were also discussed. Chapter 16 described ensemble-based
learning for insider threat detection. In particular, we have described techniques for both super-
vised and unsupervised learning and discussed the issues involved. We believe that ensemble-based
approaches are suited for data streams as they are unbounded. Chapter 17 described the different
classes of learning techniques for nonsequence data. It described exactly how each method arrives
at detecting insider threats and how ensemble models are built, modified, and discarded. First, we
discussed supervised learning in detail and then discussed unsupervised learning. In Chapter 18,
we discussed our testing methodology and experimental results for mining data streams consist-
ing of nonsequence data. We examined various aspects such as false positives, false negatives, and
accuracy. Our results indicate that supervising learning yields better results for certain datasets. In
Chapter 19, we described both supervised and unsupervised learning techniques for mining data
streams for sequence data. Experimental results of the techniques discussed in Chapter 19 were
presented in Chapter 20. In particular, we discussed our datasets and testing methodology as well
as our experimental results. Chapter 21 discussed how big data technologies can be used for stream
mining to handle insider threats. In particular, we examine one of the techniques we have designed
and showed how it can be redesigned using big data technologies. We also discussed our experimen-
tal results. Finally, Chapter 22 concluded with an assessment of the viability of stream mining for
real-world insider threat detection and the relevance to big data aspects.
Now that we have discussed the various aspects of stream data analytics, handling massive data
streams, as well as applying the techniques for insider threat detection, in Part IV we will describe
the various experimental systems we have designed and developed for BDMA and BDSP.
Part IV
Experimental BDMA and BDSP Systems
Introduction to Part IV
Parts II and III focused on stream data analytics with applications in insider threat detection. There
was also a special emphasis on handling massive amounts of data streams and the use of cloud
computing. We described various stream data analytics algorithms and provided our experimental
results. While Parts II and III focused on big data management and analytics (BDMA) and big data
security and privacy (BDSP) with respect to stream data analytics and insider threat detection, in
Part IV we describe several experimental systems we have designed and developed that are related
to BDMA and BDSP. While these systems have also been discussed in our previous book, in Part
IV, we will emphasize extending our systems to handle big data.
Part IV, consisting of six chapters, describes the various experimental systems we have designed
and developed. In Chapter 23, we discuss a query processing system that functions in the cloud and
manages a large number of RDF triples. These RDF triples can be used to represent big data appli-
cations such as social networks. Chapter 24 describes a cloud-based system called InXite that is
designed to detect evolving patterns and trends in streaming data for security applications. Chapter
25 describes our cloud-centric assured information sharing system that addresses the information
sharing requirements of various organizations including social media users in a secure manner. In
Chapter 26, we describe the design and implementation of a secure information integration frame-
work that uses the Intelligence Community’s data integration framework Blackbook as well as the
Amazon cloud. Chapter 27 describes one of our data mining techniques that is dedicated to the
automated generation of signatures to defend against malware attacks. Due to the need for near
real-time performance of the malware detection tools, we have implemented our data mining tool
in the cloud. This implementation technique shows how BDMA techniques can be applied for cyber
security problems. Finally, in Chapter 28, we have described the design and implementation of an
inference controller for provenance data. We have also argued that there is a need for the inference
controller to manage massive amounts of data as the knowledge base could grow rapidly as it has to
store the data, the metadata, the release data as well as real-world knowledge.
While we have designed and developed several additional systems for BDMA and BDSP, we
believe that the systems we have described in Part IV provide a representative sample of the systems
we have developed.
23 Cloud Query Processing
System for Big Data
Management
23.1 INTRODUCTION
As stated in some of the earlier chapters, cloud computing is an emerging paradigm in the informa-
tion technology and data processing communities. Enterprises utilize cloud computing services to
outsource data maintenance that can result in significant financial benefits. Businesses store and
access data at remote locations in the “cloud.” As the popularity of cloud computing grows, the
service providers face ever increasing challenges. They have to maintain huge quantities of het-
erogeneous data while providing efficient information retrieval. Thus, the key emphasis for cloud
computing solutions is scalability and query efficiency. In other words, cloud computing is a critical
technology for big data management and analytics.
Semantic web technologies are being developed to present data in a standardized way such that
such data can be retrieved and understood by both humans and machines. Historically, webpages
are published in plain HTML (Hypertext Markup Language) files that are not suitable for reason-
ing. Instead, the machine treats these HTML files as a bag of keywords. Researchers are developing
semantic web technologies that have been standardized to address such inadequacies. The most
prominent standards are Resource Description Framework (RDF) [W3b], SPARQL Protocol and
RDF Query Language [W3C] (SPARQL). RDF is the standard for storing and representing data and
SPARQL is a query language to retrieve data from an RDF store. RDF is being used extensively to
represent social networks. Cloud computing systems can utilize the power of these semantic web
technologies to represent and manage the social networks so that the users of these networks have
the capability to efficiently store and retrieve data for data-intensive applications.
Semantic web technologies could be especially useful for maintaining data in the cloud. Semantic
web-based social networks provide the ability to specify and query heterogeneous data in a stan-
dardized manner. Moreover, using the Web Ontology Language (OWL), ontologies, different sche-
mas, classes, data types, and relationships can be specified without sacrificing the standard RDF/
SPARQL interface. Conversely, cloud computing solutions could be of great benefit to the semantic
web-based big data community, such as the social network community. Semantic web datasets are
growing exponentially. In the web domain, scalability is paramount. Yet, high speed response time
is also vital in the web community. We believe that the cloud computing paradigm offers a solution
that can achieve both of these goals.
Existing commercial tools and technologies do not scale well in cloud computing settings.
Researchers have started to focus on these problems recently. They are proposing systems built
from scratch. In [WANG10], researchers propose an indexing scheme for a new distributed database
[COMP] that can be used as a cloud system. When it comes to semantic web data such as RDF, we
are faced with similar challenges. With storage becoming cheaper and the need to store and retrieve
large amounts of data, developing systems to handle billions of RDF triples requiring terabytes of
disk space is no longer a distant prospect. Researchers are already working on billions of triples
([NEWM08], [ROHL07]). Competitions are being organized to encourage researchers to build
efficient repositories [CHAL]. At present, there are just a few frameworks (e.g., RDF-3X [NEUM08],
Jena [CARR04], Sesame [OPEN], BigOWLIM [KIRY05]) for semantic web technologies, and these
263
264 Big Data Analytics with Applications in Insider Threat Detection
Experimental
cloud query
processing
system
Our approach:
System and SPARQL Experiments
Hadoop/mapreduce
operational query and
RDF
architectures optimizer results
SPARQL query
frameworks have limitations for large RDF graphs. Therefore, storing a large number of RDF triples
and efficiently querying them is a challenging and important problem.
In this chapter, we discuss a query-processing system that functions in the cloud and manages a
large number of RDF triples. These RDF triples can be used to represent big data applications such
as social networks as discussed in our previous book [THUR15]. The organization of this chapter
is as follows. Our approach is discussed in Section 23.2. In Section 23.3, we discuss related work.
In Section 23.4, we discuss our system architecture. In Section 23.5, we discuss how we answer an
SPARQL query. In Section 23.6, we present the results of our experiments. In section 23.7, we dis-
cuss our work on security policy enforcement that was built on top of our prototype system. Finally,
in Section 23.8, we draw some conclusions and discuss areas we have identified for improvement in
the future. Key concepts discussed in this chapter are illustrated in Figure 23.1. A more detailed dis-
cussion of the concepts, architectures and experiments are provided in [HUSA11a] and [HUSA11b].
Since semantic web technologies can be used to model big data systems such as social network
systems, our query-processing system can be utilized to query social networks and related big data
systems.
Our
contributions
Hadoop [HADOa] is a distributed file system where files can be saved with replication.
It is an ideal candidate for building a storage system. Hadoop features high fault tolerance and
great reliability. In addition, it also contains an implementation of the MapReduce [DEAN04]
programming model, a functional programming model that is suitable for the parallel process-
ing of large amounts of data. Through partitioning data into a number of independent chunks,
MapReduce processes run against these chunks, making parallelization simpler. Moreover, the
MapReduce programming model facilitates and simplifies the task of joining multiple triple
patterns.
In this chapter, we will describe a schema to store RDF data in Hadoop, and we will detail a
solution to process queries against these data. In the preprocessing stage, we process RDF data
and populate files in the distributed file system. This process includes partitioning and organizing
the data files, and executing dictionary encoding. We will then detail a query engine for informa-
tion retrieval. We will specify exactly how SPARQL queries will be satisfied using MapReduce
programming. Specifically, we must determine the Hadoop “jobs” that will be executed to solve the
query. We will present a greedy algorithm that produces a query plan with the minimal number of
Hadoop jobs. This is an approximation algorithm using heuristics, but we will prove that the worst
case has a reasonable upper bound. Finally, we will utilize two standard benchmark datasets to run
experiments. We will present results for the dataset ranging from 0.1 to over 6.6 billion triples. We
will show that our solution is exceptionally scalable. We will show that our solution outperforms
leading state-of-the-art semantic web repositories using standard benchmark queries on very large
datasets. Our contributions are listed in the following, and illustrated in Figure 23.2. More details
are given in [HUSA11a].
1. We designed a storage scheme to store RDF data in Hadoop distributed file system (HDFS)
[HADOb].
2. We developed an algorithm that is guaranteed to provide a query plan whose cost is
bounded by the log of the total number of variables in the given SPARQL query. It uses
summary statistics for estimating join selectivity to break ties.
3. We built a framework that is highly scalable and fault-tolerant and supports data-intensive
query processing.
4. We demonstrated that our approach performs better than Jena for all queries and
BigOWLIM and RDF-3X for complex queries having large result sets.
web indexing, searches, and data mining. In this section, we will first investigate research related to
MapReduce. Next, we will discuss works related to the semantic web.
Google uses MapReduce for web indexing, data storage, and social networking [CHAN06].
Yahoo! uses MapReduce extensively in its data analysis tasks [OLST08]. IBM has successfully
experimented with a scale-up scale-out search framework using MapReduce technology [MORE07].
In [SISM10], they have reported how they integrated Hadoop and System R. Teradata did a similar
work by integrating Hadoop with a parallel DBMS [XU10].
Researchers have used MapReduce to scale up classifiers for mining petabytes of data [MORE08].
They have worked on data distribution and partitioning for data mining, and have applied three
data mining algorithms to test the performance. Data mining algorithms are being rewritten in
different forms to take advantage of MapReduce technology. In [CHU06], researchers rewrite well-
known machine learning algorithms to take advantage of multicore machines by leveraging the
MapReduce programming paradigm. Another area where this technology is successfully being
used is simulation [MCNA07]. In [ABOU09], researchers reported an interesting idea of combin-
ing MapReduce with existing relational database techniques. These works differ from our research
in that we use MapReduce for semantic web technologies. Our focus is on developing a scalable
solution for storing RDF data and retrieving them by SPARQL queries.
In the semantic web arena, there has not been much work done with MapReduce technology.
We have found two related projects: BioMANTA [ITEE] project and Scalable, High-Performance,
Robust and Distributed (SHARD) [CLOU]. BioMANTA proposes extensions to RDF molecules
[DING05] and implements a MapReduce-based molecule store [NEWM08]. They use MapReduce
to answer the queries. They have queried a maximum of four million triples. Our work differs in the
following ways: first, we have queried one billion triples. Second, we have devised a storage schema
that is tailored to improve query execution performance for RDF data. We store RDF triples in files
based on the predicate of the triple and the type of the object. Finally, we also have an algorithm
to determine a query-processing plan whose cost is bounded by the log of the total number of vari-
ables in the given SPARQL query. By using this, we can determine the input files of a job and the
order in which they should be run. To the best of our knowledge, we are the first ones to come up
with a s torage schema for RDF data using flat files in HDFS and a MapReduce job determination
algorithm to answer an SPARQL query.
SHARD is an RDF triple store using the Hadoop Cloudera distribution. This project shows initial
results demonstrating Hadoop’s ability to improve scalability for RDF datasets. However, SHARD
stores its data only in a triple store schema. It currently does no query planning or reordering, and its
query processor will not minimize the number of Hadoop jobs. There has been significant research
into semantic web repositories with particular emphasis on query efficiency and scalability. In fact,
there are too many such repositories to fairly evaluate and discuss each. Therefore, we will pay
attention to semantic web repositories that are open source or available for download and which
have received favorable recognition in the semantic web and database communities.
In [ABAD09b] and [ABAD07], researchers reported a vertically partitioned DBMS for storage and
retrieval of RDF data. Their solution is a schema with a two-column table for each predicate. Their
schema is then implemented on top of a column-store relational database such as CStore [STON05]
or MonetDB [BONC06]. They observed performance improvement with their scheme over tradi-
tional relational database schemes. We have leveraged this technology in our predicate-based parti-
tioning within the MapReduce framework. However, in the vertical partitioning research, only small
databases (<100 million) were used. Several papers [SIDI08], [MCGL09], and [WEIS08] have shown
that vertical partitioning’s performance is drastically reduced as the dataset size is increased.
Jena [CARR04] is a semantic web framework for Jena. True to its framework design, it allows
integration of multiple solutions for persistence. It also supports inference through the development
of reasoners. However, Jena is limited to a triple store schema. In other words, all data are stored in
a single three column table. Jena has very poor query performance for large datasets. Furthermore,
any change to the dataset requires complete recalculation of the inferred triples.
Cloud Query Processing System for Big Data Management 267
BigOWLIM [KIRY05] is among the fastest and most scalable semantic web frameworks
available. However, it is not as scalable as our framework and requires very high end and costly
machines. It requires expensive hardware (a lot of main memory) to load large datasets and it has a
long loading time. As our experiments show, it does not perform well when there is no bound object
in a query. However, the performance of our framework is not affected in such a case.
RDF-3X [NEUM08] is considered the fastest existing semantic web repository. In other words,
it has the fastest query times. RDF-3X uses histograms, summary statistics, and query optimization
to enable high performance semantic web queries. As a result, RDF-3X is generally able to outper-
form any other solution for queries with bound objects and aggregate queries. However, RDF-3X’s
performance degrades exponentially for unbound queries, and queries with even simple joins if
the selectivity factor is low. This becomes increasingly relevant for inference queries that gener-
ally require unions of subqueries with unbound objects. Our experiments show that RDF-3X is not
only slower for such queries, but also it often aborts and cannot complete the query. For example,
consider the simple query “Select all students.” This query in LUBM requires us to select all gradu-
ate students, select all undergraduate students, and union the results together. However, there are a
very large number of results in this union. While both subqueries complete easily, the union will
abort in RDF-3X for LUBM (30,000) with 3.3 billion triples.
RDF Knowledge Base (RDFKB) [MCGL10] is a semantic web repository using a relational
database schema built upon bit vectors. RDFKB achieves better query performance than RDF-3X
or vertical partitioning. However, RDFKB aims to provide knowledge base functions such as infer-
ence forward chaining, uncertainty reasoning, and ontology alignment. RDFKB prioritizes these
goals ahead of scalability. RDFKB is not able to load LUBM (30,000) with three billion triples, so
it cannot compete with our solution for scalability.
Hexastore [WEIS08] and BitMat [ATRE08] are main memory data structures optimized for
RDF indexing. These solutions may achieve exceptional performance on hot runs, but they are not
optimized for cold runs from persistent storage. Furthermore, their scalability is directly associated
with the quantity of main memory RAM available. These products are not available for testing and
evaluation.
In our previous work [HUSA09], [HUSA10], we proposed a greedy and an exhaustive search
algorithm to generate a query-processing plan. However, the exhaustive search algorithm was
expensive and the greedy one was not bounded and its theoretical complexity was not defined.
In this chapter, we present a new greedy algorithm with an upper bound. Also, we did observe
scenarios in which our old greedy algorithm failed to generate the optimal plan. The new algorithm
is able to obtain the optimal plan in each of these cases. The Join Executer component runs the jobs
using MapReduce framework. It then relays the query answer from Hadoop to the user.
23.4 ARCHITECTURE
Our system architecture is illustrated in Figure 23.3. It essentially consists of an SPARQL query
optimizer and a RDF data manager implemented in the cloud. The operational architecture is
illustrated in Figure 23.4. It consists of two components. The upper part of Figure 23.4 depicts
the data preprocessing component and the lower part shows the query answering one. We have
three subcomponents for data generation and preprocessing. We convert RDF/XML [W3f] to
N-triples [W3a] serialization format using our N-triples converter component. The predicate split
(PS) c omponent takes the N-triples data and splits it into predicate files. The predicate files are then
fed into the Predicate Object Split (POS) component that splits the predicate files into smaller files
based on the type of objects. These steps are described below.
Data Generation and Storage: For our experiments, we use the LUBM [GUO05] dataset. It is
a benchmark dataset designed to enable researchers to evaluate a semantic web repository’s per-
formance [GUO04]. The LUBM data generator generates data in RDF/XML serialization format.
This format is not suitable for our purpose because we store data in HDFS as flat files and so to
268 Big Data Analytics with Applications in Insider Threat Detection
Preprocessor
N-triple convertor
PS Data in
RDF/XML
POS
Web interface
New data Query Answer
Plan generator
Predicate object-based
splitter
Plan executor
retrieve even a single triple, we would need to parse the entire file. Therefore, we convert the data
to N-triples to store the data, because with that format, we have a complete RDF triple (Subject,
Predicate, and Object) in one line of a file that is very convenient to use with MapReduce jobs. The
processing steps to go through to get the data into our intended format are described in following
sections.
File Organization: We do not store the data in a single file because, in a Hadoop and MapReduce
framework, a file is the smallest unit of input to a MapReduce job and in the absence of caching, a
file is always read from the disk. If we have all the data in one file, the whole file will be input to
jobs for each query. Instead, we divide the data into multiple smaller files. The splitting is done in
two steps, which we discuss in the following sections.
Cloud Query Processing System for Big Data Management 269
Predicate Split: In the first step, we divide the data according to the predicates. This division
immediately enables us to cut down the search space for any SPARQL query that does not have a
variable predicate. For such a query, we can just pick a file for each predicate and run the query
on those files only. For simplicity, we name the files with predicates, for example, all the triples
containing a predicate p1:pred go into a file named p1-pred. However, in case we have a variable
predicate in a triple pattern [W3e] and if we cannot determine the type of the object, we have to
consider all files. If we can determine the type of the object, then we consider all files having that
type of object. We discuss more on this in Section 23.5. In real-world RDF datasets, the number of
distinct predicates is in general not a large number [STOC08]. However, there are datasets having
many predicates. Our system performance does not vary in such a case because we just select files
related to the predicates specified in an SPARQL query.
Split Using Explicit Type Information of Object: In the next step, we work with the explicit type
information in the rdf_type file. The predicate rdf:type is used in RDF to denote that a resource is
an instance of a class. The rdf_type file is first divided into as many files as the number of distinct
objects the rdf:type predicate has. For example, if in the ontology, the leaves of the class hierarchy
are c1,c2, …, cn, then we will create files for each of these leaves and the file names will be like
type_c1, type_c2; …, type_cn. Please note that the object values c1,c2, …, cn are no longer needed to
be stored within the file as they can be easily retrieved from the file name. This further reduces the
amount of space needed to store the data. We generate such a file for each distinct object value of
the predicate rdf:type.
Split Using Implicit Type Information of Object: We divide the remaining predicate files accord-
ing to the type of the objects. Not all the objects are URIs (Uniform Resource Identifier); some are
literals. The literals remain in the file named by the predicate; no further processing is required for
them. The type information of a URI object is not mentioned in these files but they can be retrieved
from the type_*files. The URI objects move into their respective file named as predicate type. For
example, if a triple has the predicate p and the type of the URI object is ci, then the subject and
object appear in one line in the file p_ci. To do this split, we need to join a predicate file with the
type_*files to retrieve the type information.
Our MapReduce framework, described in Section 23.5, has three subcomponents in it. It takes
the SPARQL query from the user and passes it to the Input and Plan Generator. This compo-
nent selects the input files by using our algorithm described in Section 23.5, decides how many
MapReduce jobs are needed, and passes the information to the Join Executer component that
runs the jobs using the MapReduce framework. It then relays the query answer from Hadoop to
the user.
1. In a triple pattern, if the predicate is variable, we select all the files as input to the jobs and
terminate the iteration.
2. If the predicate is rdf:type and the object is concrete, we select the type file having that
particular type. For example, for LUBM query 9 (Listing 1), we could select file type_Student
as part of the input set. However, this brings up an interesting scenario. In our dataset, there
is actually no file named type_Student because Student class is not a leaf in the ontology
tree. In this case, we consult the LUBM ontology [LEHI] to determine the correct set of
input files. We add the files type_GraduateStudent, type_UndergraduateStudent, and type_
ResearchAssistant as GraduateStudent; UndergraduateStudent, and ResearchAssistant are
the leaves of the subtree rooted at node Student.
3. If the predicate is rdf:type and the object is variable, then if the type of the variable
is defined by another triple pattern, we select the type file having that particular type.
Otherwise, we select all type files.
4. If the predicate is not rdf:type and the object is variable, then we need to determine if the type
of the object is specified by another triple pattern in the query. In this case, we can rewrite the
query and eliminate some joins. For example, in LUBM Query 9 (Listing 1), the type of Y is
specified as Faculty and Z as Course and these variables are used as objects in the last three
triple patterns. If we choose files advisor_Lecturer, advisor_PostDoc, advisor_FullProfessor,
advisor_AssociateProfessor, advisor_AssistantProfessor, and advisor_ VisitingProfessor
as part of the input set, then the triple pattern in line 2 becomes unnecessary. Similarly,
triple pattern in line 3 becomes unnecessary if files takesCourse_Course and takesCourse_
GraduateCourse are chosen. Hence, we get the rewritten query shown in Listing 2. However,
if the type of the object is not specified, then we select all files for that predicate.
5. If the predicate is not rdf:type and the object is concrete, then we select all files for that
predicate.
Definition 23.1
Triple Pattern, TP: A triple pattern is an ordered set of subject, predicate, and object that appears
in an SPARQL query WHERE clause. The subject, predicate, and object can be either a variable
(unbounded) or a concrete value (bounded).
Definition 23.2
Triple Pattern Join, TPJ: A triple pattern join is a join between two TPs on a variable.
Definition 23.3
MapReduceJoin, MRJ: A MapReduceJoin is a join between two or more triple patterns on a variable.
Definition 23.4
Job, JB: A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and
a set of output files.
Definition 23.5
Definition 23.6
Ideal Model
To answer an SPARQL query, we may need more than one job. Therefore, in an ideal scenario, the cost
estimation for processing a query requires the individual cost estimation of each job that is needed to
272 Big Data Analytics with Applications in Insider Threat Detection
answer that query. A job contains three main tasks that are reading, sorting, and writing. We estimate
the cost of a job based on these three tasks. For each task, a unit cost is assigned to each triple pattern
it deals with. In the current model, we assume that costs for reading and writing are the same:
n−1
Cost =
∑
i =1
MIi + MOi + RIi + ROi + MI n + MO n + RI n
(23.1)
n−1
=
∑
i =1
Jobi + MI n + MO n + RI n
(23.2)
where,
Equation 23.1 is the total cost of processing a query. It is the summation of the individual costs of
each job and only the map phase of the final job. We do not consider the cost of the reduce output of
the final job because it would be the same for any query plan as this output is the final result that is
fixed for a query and a given dataset. A job essentially performs a MapReduce task on the file data.
Equation 23.2 shows the division of the MapReduce task into subtasks. Hence, to estimate the cost
of each job, we will combine the estimated cost of each subtask.
Map input (MI) phase: This phase reads the triple patterns from the selected input files stored in
the HDFS. Therefore, we can estimate the cost for the MI phase to be equal to the total number of
triples in each of the selected files.
Map output (MO) phase: The estimation of the MO phase depends on the type of query being
processed. If the query has no bound variable (e.g., [?X ub:worksFor ?Y]), then the output of the
Map phase is equal to the input. All of the triple patterns are transformed into key–value pairs and
given as output. Therefore, for such a query, the MO cost will be the same as the MI cost. However,
if the query involves a bound variable, (e.g., [?Y ub:subOrganizationOf <http://www.U0.edu>]),
then before making the key–value pairs, a bound component selectivity estimation can be applied.
The resulting estimate for the triple patterns will account for the cost of the Map output phase. The
selected triples are written to a local disk.
Reduce input (RI) phase: In this phase, the triples from the Map output phase are read via HTTP
and then sorted based on their key values. After sorting, the triples with identical keys are grouped
together. Therefore, the cost estimation for the RI phase is equal to the MO phase. The number of key–
value pairs that are sorted in RI is equal to the number of key–value pairs generated in the MO phase.
Reduce output (RO) phase: The RO phase deals with performing the joins. Therefore, it is in
this phase we can use the join triple pattern selectivity summary statistics to estimate the size of its
output. In the following, we talk in detail about the join triple pattern selectivity summary statistics
needed for our framework.
Cloud Query Processing System for Big Data Management 273
However, in practice, the earlier discussion is applicable for the first job only. For the subsequent
jobs, we lack both the precise knowledge and estimate of the number of triple patterns selected after
applying the join in the first job. Therefore, for these jobs, we can take the size of the RO phase of
the first job as an upper bound on the different phases of the subsequent jobs.
Equation 23.3 shows a very important postulation. It illustrates the total cost of an intermediate
job, when i < n includes the cost of the RO phase in calculating the total cost of the job.
Heuristic Model
In this section, we show that the ideal model is not practical or cost-effective. There are several
issues that make the ideal model less attractive in practice. First, the ideal model considers simple
abstract costs, namely, the number of triples read and written by the different phases, ignoring the
actual cost of copying, sorting, etc., these triples, and the overhead for running jobs in Hadoop. But
accurately incorporating those costs in the model is a difficult task. Even making a reasonably good
estimation may be nontrivial. Second, to estimate intermediate join outputs, we need to maintain
comprehensive summary statistics. In a MapReduce job in Hadoop, all the joins on a variable are
joined together. For example, in the rewritten LUBM Query 9 (Listing 2), there are three joins on
variable X. When a job is run to do the join on X, all the joins on X between triple patterns 1, 2,
and 4 are done. If there were more than three joins on X, all will still be handled in one job. This
shows that in order to gather summary statistics to estimate join selectivity, we face an exponen-
tial number of join cases. For example, between triple patterns having p1, p2, and p3, there may be
multiple types of joins because in each triple pattern, a variable can occur either as a subject or an
object. In the case of the rewritten Query 9, it is a subject−subject−subject join between 1, 2, and
4. There can be more types of join between these three, for example, subject−object−subject and
object−subject−object. This means that between P predicates, there can be 2P types of joins on a
single v ariable (ignoring the possibility that a variable may appear both as a subject and as an object
in a triple pattern). If there are P predicates in the dataset, a total number of cases for which we need
to collect summary statistics can be calculated by the formula:
In the LUBM dataset, there are 17 predicates. So, in total, there are 129,140,128 cases, which is a
large number. Gathering summary statistics for such a large number of cases would be very time-
and space-consuming. Hence, we took an alternate approach.
We observe that there is significant overhead for running a job in Hadoop. Therefore, if we
minimize the number of jobs to answer a query, we get the fastest plan. The overhead is incurred
by several disk I/O and network transfers that are an integral part of any Hadoop job. When a job is
submitted to a Hadoop cluster, at least the following set of actions takes place:
1. The Executable file is transferred from the client machine to the Hadoop JobTracker [WIKIa].
2. The JobTracker decides which TaskTrackers [WIKIb] will execute the job.
3. The Executable file is distributed to the TaskTrackers over the network.
4. Map processes start by reading data from HDFS.
5. Map outputs are written to disks.
6. Map outputs are read from disks, shuffled (transferred over the network to TaskTrackers,
which would run Reduce processes), sorted, and written to disks.
7. Reduce processes start by reading the input from the disks.
8. Reduce outputs are written to disks.
These disk operations and network transfers are expensive operations even for a small amount of
data. For example, in our experiments, we observed that the overhead incurred by one job is almost
equivalent to reading a billion triples. The reason is that in every job, the output of the map process
274 Big Data Analytics with Applications in Insider Threat Detection
is always sorted before feeding the reduce processes. This sorting is unavoidable even if it is not
needed by the user. Therefore, it would be less costly to process several hundred million more triples
in n jobs, rather than processing several hundred million less triples in n + 1 jobs.
To further investigate, we did an experiment where we used the query shown in Listing 4. Here, the
join selectivity between TPs 2 and 3 on ?Z is the highest. Hence, a query plan generation algorithm
that uses selectivity factors to pick joins would select this join for the first job. As the other TPs 1 and
4 share variables with either TP 2 or 3, they cannot take part in any other join; moreover, they do not
share any variables so the only possible join that can be executed in this job is the join between TPs 2
and 3 on ?X. Once this join is done, the two joins left are between TP 1 and the join output of the first
job on variable ?X and between TP 4 and the join output of first job on variable ?Y. We found that the
selectivity of the first join is greater than the latter one. Hence, the second job will do this join and TP 4
will again not participate. In the third and last job, the join output of the second job will be joined with
TP 4 on ?Y. This is the plan generated using join selectivity estimation. But the minimum job plan is a
two job plan where the first job joins TPs 1 and 2 on ?X and TPs 3 and 4 on ?Y. The second and final job
joins the two join outputs of the first job on ?Z. The query runtimes we found are given in [HUSA11a].
Listing 4. Experiment Query
?S1 ub:advisor ?X.
?X ub:headOf ?Z.
?Z ub:subOrganizationOf ?Y.
?S2 ub:mastersDegreeFrom ?Y
For each dataset, we found that the two job plan is faster than the three job plan, even though
the three job plan produced less intermediate data because of the join selectivity order. We can
explain this by an observation we made in another small experiment. We generated files of sizes 5
and 10 MB containing random integers. We put the files in HDFS. For each file, we first read the
file by a program and recorded the time needed to do it. While reading, our program reads from one
of the three available replicas of the file. Then, we ran a MapReduce job that rewrites the file with
the numbers sorted. We utilized MapReduce sorting to have the sorted output. Please also note that
when it writes the file, it writes three replications of it. We found that the MapReduce job, which
does reading, sorting, and writing, takes 24.47 times longer to finish for 5 MB. For 10 MB, it is
42.79 times. This clearly shows how the write and data transfer operations of a MapReduce job are
more expensive than a simple read from only one replica. Because of the number of jobs, the three
job plan is doing much more disk read and write operations as well as network data transfers, and,
as a result, is slower than the two job plan, even if it is reading less input data.
Because of these reasons, we do not pursue the ideal model. We follow the practical model,
which is to generate a query plan having minimum possible jobs. However, while generating a
minimum job plan, whenever we need to choose a join to be considered in a job among more than
one joins, instead of choosing randomly, we use the summary join statistics. This is described in
Section 23.5.6. More details of our experimental results with the charts are provided in [HUSA11a].
?Y rdf:type ub:University
?Z ?V ub:Department
?X ub:memberOf ?Z
?X ub:undergraduateDegreeFrom ?Y}
In order to simplify the notations, we will only refer to the TPs by the variable in that pattern. For
example, the first TP (?X rdf:type ub:GraduateStudent) will be represented as simply X. Also, in the
simplified version, the whole query would be represented as follows: {X,Y,Z,XZ,XY}.
We will use the notation join(XY,X) to denote a join operation between the two TPs XY and X
on the common variable X.
Definition 23.7
The Minimum Cost Plan Generation Problem (Bestplan Problem): For a given query, the Bestplan
problem is to generate a job plan so that the total cost of the jobs is minimized. Note that Bestplan
considers the more general case where each job has some cost associated with it (i.e., the ideal
model).
Example: Given the query in our running example, two possible job plans are as follows:
Plan1. job1 = {X,XY,XZ},
resultant TPs = {YZ,YZ}. job2 = {Y,YZ},
resultant TPs = {Z,Z},. job3 = {Z,Z}. Total cost = cost(job1) + cost(job2).
Plan 2. job1 = {XZ,Z} and join(XY,Y)
resultant TPs = {X,X,X}.job2 = join(X,X,X).
Total cost = cost(job1) + cost(job2).
The Bestplan problem is to find the least cost job plan among all possible job plans.
Definition 23.8
Joining Variable: A variable that is common in two or more triple patterns. For example, in the
r unning example query, X, Y, Z are joining variables, but V is not.
Definition 23.9
Complete Elimination: A join operation that eliminates a joining variable. For example, in the
example query, Y can be completely eliminated if we join (XY,Y).
Definition 23.10
Partial Elimination: A join operation that partially eliminates a joining variable. For example, in
the example query, if we perform join (XY,Y) and join (X,ZX) in the same job, the resultant triple
patterns would be {X,Z,X}. Therefore, Y will be completely eliminated, but X will be partially
eliminated. So, the join(X,ZX) performs a partial elimination.
Definition 23.11
E-Count(v): E-count(v) is the number of joining variables in the resultant triple pattern after a com-
plete elimination of variable v. In the running example, join(X,XY,XZ) completely eliminates X and
the resultant triple pattern (YZ) has two joining variables Y and Z. So, E-count(X) = 2. Similarly,
E-count(Y) = 1 and E-count(Z) = 1.
276 Big Data Analytics with Applications in Insider Threat Detection
Computational Complexity of Bestplan: It can be shown that generating the least cost query plan
is computationally expensive, since the search space is exponentially large. At first, we formulate
the problem, and then show its complexity.
Problem Formulation: We formulate Bestplan as a search problem. Let G = (V, E) be a weighted
directed graph, where each vertex vi∈V represents a state of the triple patterns, and each edge
ei ∈ (vi1 , vi2 ) ∈ E represents a job that makes a transition from state vi1 to state vi2 . v0 is the initial
state, where no joins have been performed, that is, the given query. Also, vgoal is the goal state, which
represents a state of the triple pattern where all joins have been performed. The problem is to find
the shortest weighted path from v0 to vgoal.
For example, in our running example query, the initial state v0 = {X,Y,Z,XY,XZ}, and the goal state,
vgoal = ∅, that is, no more triple patterns left. Suppose the first job (job1) performs join(X,XY,XZ).
Then, the resultant triple patterns (new state) would be v1 = {Y,Z,YZ}, and job1 would be represented
by the edge (v0, v1). The weight of edge (v0, v1) is the cost of job1 = cost(job1), where cost is the given
cost function. Figure 23.4 shows the partial graph for the example query.
Search Space Size: Given a graph G = (V, E), Dijkstra’s shortest path algorithm can find the
shortest path from a source to all other nodes in O(|V|log|V|+|E|) time. However, for Bestplan, it
can be shown that in the worst case, |V|≥2K, where K is the total number of joining variables in the
given query. Therefore, the number of vertices in the graph is exponential, leading to an exponen-
tial search problem. In [HUSA11a], we have shown that the worst-case complexity of the Bestplan
problem is exponential in K, the number of joining variables in the given query.
Relaxed Bestplan Problem and Approximate Solution: In the Relaxed Bestplan problem, we
assume uniform cost for all jobs. Although this relaxation does not reduce the search space, the
problem is reduced to finding a job plan having the minimum number of jobs. Note that this is the
problem for the practical version of the model.
Definition 23.12
Relaxed Bestplan Problem: The Relaxed Bestplan problem is to find the job plan that has the
minimum number of jobs.
Next, we show that if joins are reasonably chosen, and no eligible join operation is left undone
in a job, then we may set an upper bound on the maximum number of jobs required for any given
query. However, it is still computationally expensive to generate all possible job plans. Therefore,
we resort to a greedy algorithm (Algorithm 23.1) that finds an approximate solution to the Relaxed
Bestplan problem, but is guaranteed to find a job plan within the upper bound.
12: end if
13: end for
14: Q ← Q∪tmp
15: J ← J + 1
16: end while
17: return {Job1,…,JobJ−1}
Definition 23.13
Early Elimination Heuristic: The early elimination heuristic makes as many complete eliminations
as possible in each job.
This heuristic leaves the fewest number of variables for join in the next job. In order to apply
the heuristic, we must first choose the variable in each job with the least E-count. This heuristic is
applied in Algorithm 23.1.
The algorithm starts by removing all the nonjoining variables from the query Q. In our running
example, Q = {X,Y,VZ,XY,XZ}, and removing the nonjoining variable V makes Q = {X,Y,Z,XY,XZ}.
In the while loop, the job plan is generated, starting from Job1. In line 4, we sort the variables
according to their E-count. The sorted variables are: U = {Y,Z,X}, since Y and Z have E-count = 1,
and X has E-count = 2. For each job, the list of join operations is stored in the variable JobJ, where
J is the ID of the current job. Also, a temporary variable tmp is used to store the resultant triples
of the joins to be performed in the current job (line 6). In the for loop, each variable is checked to
see if the variable can be completely or partially eliminated (line 8). If yes, we store the join result
in the temporary variable (line 9), update Q (line 10), and add this join to the current job (line 11).
In our running example, this results in the following operations: Iteration 1 of the for loop: u1 = (Y)
can be completely eliminated. Here, TP(Q,Y) the triple patterns in Q containing Iteration 3 of the
for loop: u3 = (X) cannot be completely or partially eliminated, since there is no other TP left to
join with it. Therefore, when the for loop terminates, we have job1 = {join(Y,XY),join(Z,XZ)}, and
Q = {X,X,X}. In the second iteration of the while loop, we will have {job2 = {X,X,X}. Since after this
join, Q becomes Empty, the while loop is exited. Finally, {job1,job2} are returned from the algorithm.
In [HUSA11a], we have proved that for any given query Q, containing K joining variables and N
triple patterns, Algorithm Relaxed-Bestplan (Q) generates a job plan containing at most J jobs, where
0 N =0
J = 1 N = 1or K = 1 (23.4)
min(1.71 log2 N , K ) N , K > 1
The second triple pattern in the query makes it impossible to answer and solve the query with
only one job. There are only two possible plans: we can join the first two triple patterns on X first
and then join its output with the last triple pattern on Y or we can join the last two patterns first on Y
and then join its output with the first pattern on X. In such a situation, instead of randomly choosing
a join variable for the first job, we use join summary statistics for a pair of predicates. We select the
join for the first job which is more selective to break the tie. The join summary statistics we use are
described in [STOC08].
Listing 7 shows LUBM Query 2, which we will use to illustrate the way we do a join using map
and reduce methods. The query has six triple patterns and nine joins between them on the variable
X, Y, and Z.
Our input selection algorithm selects files type_GraduateStudent, type_ University, type_
Department, all files having the prefix memberOf, all files having the prefix subOrganizationOf,
and all files having the prefix underGraduateDegreeFrom as the input to the jobs needed to answer
the query.
The query plan has two jobs. In job 1, triple patterns of lines 2, 5, and 7 are joined on X and triple
patterns of lines 3 and 6 are joined on Y. In job 2, triple pattern of line 4 is joined with the outputs
of previous two joins on Z and also the join outputs of job 1 are joined on Y.
The input files of job 1 are type_GraduateStudent, type_University, all files having the prefix
memberOf, all files having the prefix subOrganizationOf, and all files having the prefix under-
GraduateDegreeFrom. In the map phase, we first tokenize the input value which is actually a line
of the input file. Then, we check the input file name and, if input is from type_GraduateStudent, we
output a key–value pair having the subject URI prefixed with X# the key and a flag string GS# as
the value. The value serves as a flag to indicate that the key is of type GraduateStudent. The subject
URI is the first token returned by the tokenizer. Similarly, for input from file type_University output
a key–value pair having the subject URI prefixed with Y# the key and a flag string U# as the value.
If the input from any file has the prefix memberOf, we retrieve the subject and object from the input
line by the tokenizer and output a key–value pair having the subject URI prefixed with X# the key
and the object value prefixed with MO# as the value. For input from files having the prefix subOr-
ganizationOf, we output key–value pairs making the object prefixed with Y# the key and the subject
prefixed with SO# the value. For input from files having the prefix underGraduateDegreeFrom, we
output key–value pairs making the subject URI prefixed with X# the key and the object value pre-
fixed with UDF# the value. Hence, we make either the subject or the object a map output key based
Cloud Query Processing System for Big Data Management 279
on which we are joining. This is the reason why the object is made the key for the triples from files
having the prefix subOrganizationOf because the joining variable Y is an object in the triple pattern
in line 6. For all other inputs, the subject is made the key because the joining variables X and Y are
subjects in the triple patterns in lines 2, 3, 5, and 7.
In the reduce phase, Hadoop groups all the values for a single key and for each key provides the
key and an iterator to the values collection. Looking at the prefix, we can immediately tell if it is a
value for X or Y because of the prefixes we used. In either case, we output a key–value pair using the
same key and concatenating all the values to make a string value. So after this reduce phase, join on
X is complete and on Y is partially complete.
The input files of job 2 are type_Department file and the output file of job 1, job1.out. Like the
map phase of job 1, in the map phase of job 2, we also tokenize the input value which is actually
a line of the input file. Then, we check the input file name and if input is from type_Department,
we output a key–value pair having the subject URI prefixed with Z# the key and a flag string D# as
the value. If the input is from job1.out, we find the value having the prefix Z#. We make this value
the output key and concatenate the rest of the values to make a string and make it the output value.
Basically, we make the Z# values the keys to join on Z.
In the reduce phase, we know that the key is the value for Z. The values collection has two types
of strings. One has X values, which are URIs for graduate students and also Y values from which
they got their undergraduate degree. The Z value, that is, the key, may or may not be a subOrgani-
zationOf the Y value. The other types of strings have only Y values which are universities and of
which the Z value is a suborganization. We iterate over the values collection and then join the two
types of tuples on Y values. From the join output, we find the result tuples which have values for
X, Y, and Z.
23.6 RESULTS
23.6.1 Experimental Setup
In this section, we first present the benchmark datasets with which we experimented. Next, we
present the alternative repositories we evaluated for comparison. Then, we detail our experimental
setup. Finally, we present our evaluation results.
Datasets: In our experiments with SPARQL query processing, we use two synthetic datasets:
LUBM [GUO05] and SP2B [SCHM09]. The LUBM dataset generates data about universities by
using an ontology [LEHI]. It has 14 standard queries. Some of the queries require inference to
answer. The LUBM dataset is very good for both inference and scalability testing. For all LUBM
datasets, we used the default seed. The SP2B dataset is good for scalability testing with complex
queries and data access patterns. It has 16 queries most of which have complex structures.
Baseline Frameworks: We compared our framework with RDF-3X [NEUM08], Jena [JENA],
and BigOWLIM [ONTO]. RDF-3X is considered the fastest semantic web framework with per-
sistent storage. Jena is an open source framework for semantic web data. It has several models
which can be used to store and retrieve RDF data. We chose Jena’s in-memory and SDB models
to compare our framework with. As the name suggests, the in-memory model stores the data in
main memory and does not persist data. The SDB model is a persistent model and can use many
off-the-shelf database management systems. We used MySQL database as SDB’s back-end in our
experiments. BigOWLIM is a proprietary framework which is the state-of-the-art significantly fast
framework for semantic web data. It can act both as a persistent and nonpersistent storage. All of
these frameworks run in a single machine setup.
Hardware: We have a 10-node Hadoop cluster which we use for our framework. Each of the
nodes has the following configuration: Pentium IV 2.80 GHz processor, 4 GB main memory, and
640 GB disk space. We ran Jena, RDF-3X, and BigOWLIM frameworks on a powerful single
machine having 2.80 GHz quad core processor, 8 GB main memory, and 1 TB disk space.
280 Big Data Analytics with Applications in Insider Threat Detection
Software: We used hadoop-0.20.1 for our framework. We compared our framework with Jena-
2.5.7 which used MySQL 14.12 for its SDB model. We used BigOWLIM version 3.2.6. For RDF-3X,
we utilized version 0.3.5 of the source code.
23.6.2 Evaluation
We present performance comparison between our framework, RDF-3X, Jena In-Memory and
SDB models, and BigOWLIM. More details are found in [HUSA11a]. We used three LUBM
datasets: 10,000, 20,000, and 30,000 which have more than 1.1, 2.2, and 3.3 billion triples,
respectively. Initial population time for RDF-3X took 655, 1756, and 3353 min to load the data-
sets, respectively. This shows that the RDF-3X load time is increasing exponentially. LUBM
(30,000) has three times as many triples as LUBM (10,000) yet it requires more than five times
as long to load.
For evaluation purposes, we chose LUBM Queries 1, 2, 4, 9, 12, and 13 to be reported in this
work. These queries provide a good mixture and include simple and complex structures, infer-
ence, and multiple types of joins. They are representatives of other queries of the benchmark and
so reporting only these covers all types of variations found in the queries we left out and also
saves space. Query 1 is a simple selective query. RDF-3X is much faster than HadoopRDF for
this query. RDF-3X utilizes six indexes [NEUM08] and those six indexes actually make up the
dataset. The indexes provide RDF-3X a very fast way to look up triples, similar to a hash table.
Hence, a highly selective query is efficiently answered by RDF-3X. Query 2 is a query with com-
plex structures, low selectivity, and no bound objects. The result set is quite large. For this query,
HadoopRDF outperforms RDF-3X for all three dataset sizes. RDF-3X fails to answer the query
at all when the dataset size is 3.3 billion triples. RDF-3X returns memory segmentation fault
error messages and does not produce any query results. Query 4 is also a highly selective query,
that is, the result set size is small because of a bound object in the second triple pattern but it
needs inferencing to answer it. The first triple pattern uses the class Person which is a super-
class of many classes. No resource in LUBM dataset is of type Person, rather there are many
resources which are its subtypes. RDF-3X does not support inferencing so we had to convert the
query to an equivalent query having some union operations. RDF-3X outperforms HadoopRDF
for this query. Query 9 is similar in structure to Query 2 but it requires significant inferenc-
ing. The first three triple patterns of this query use classes which are not explicitly instantiated
in the dataset. However, the dataset includes many instances of the corresponding subclasses.
This is also the query that requires the largest dataset join and returns the largest result set out
of the queries we evaluated. RDF-3X is faster than HadoopRDF for 1.1 billion triples dataset
but it fails to answer the query at all for the other two datasets. Query 12 is similar to Query 4
because it is both selective and has inferencing in one triple pattern. RDF-3X beats HadoopRDF
for this query. Query 13 has only two triple patterns. Both of them involve inferencing. There is
a bound subject in the second triple pattern. It returns the second largest result set. HadoopRDF
beats RDF-3X for this query for all datasets. RDF-3X’s performance is slow because the first
triple pattern has very low selectivity and requires low selectivity joins to perform inference via
backward chaining.
These results lead us to some simple conclusions. RDF-3X achieves the best performance
for queries with high selectivity and bound objects. However, HadoopRDF outperforms RDF-3X
for queries with unbound objects, low selectivity, or large dataset joins. RDF-3X cannot execute the
two queries with unbound objects (Queries 2 and 9) for a 3.3 billion triples dataset. This demon-
strates that HadoopRDF is more scalable and handles low selectivity queries more efficiently than
RDF-3X.
We also compared our implementation with the Jena In-Memory, the SDB and BigOWLIM
models. Due to space and time limitations, we performed these tests only for LUBM Queries 2
and 9 from the LUBM dataset. We chose these queries because they have complex structures and
Cloud Query Processing System for Big Data Management 281
require inference. It is to be noted that BigOWLIM needed 7 GB of Java heap space to successfully
load the billion triples dataset. We ran BigOWLIM only for the largest three datasets as we are
interested in its performance with large datasets. For each set we obtained the results for the Jena
In-Memory model, Jena SDB model, our Hadoop implementation and BigOWLIM, respectively.
At times the query could not complete or it ran out of memory. In most of the cases, our approach
was the fastest. For Query 2, Jena In-Memory and Jena SDB models were faster than our approach,
giving results in 3.9 and 0.4 s, respectively. However, as the size of the dataset grew, the Jena
In-Memory model ran out of memory space. Our implementation was much faster than the Jena
SDB model for large datasets. For example, for 110 million triples, our approach took 143.5 s as
compared to about 5000 s for Jena SDB model. We found that the Jena SDB model could not finish
answering Query 9. Jena In-Memory model worked well for small datasets but became slower than
our implementation as the dataset size grew and eventually ran out of memory.
For Query 2, BigOWLIM was slower than ours for the 110 and 550 million datasets. For the 550
million dataset, it took 22693.4 s, which is abruptly high compared to its other timings. For the bil-
lion triple dataset, BigOWLIM was faster. It should be noted that our framework does not have any
indexing or triple cache whereas BigOWLIM exploits indexing which it loads into main memory
when it starts. It may also prefetch triples into main memory. For Query 9, our implementation is
faster than BigOWLIM in all experiments.
It should also be noted that our RDF-3X and HadoopRDF queries were tested using cold runs.
What we mean by this is that main memory and file system cache were cleared prior to execution.
However, for BigOWLIM, we were forced to execute hot runs. This is because it takes a significant
amount of time to load a database into BigOWLIM. Therefore, we will always easily outperform
BigOWLIM for cold runs. So, we actually tested BigOWLIM for hot runs against HadoopRDF for
cold runs. This gives a tremendous advantage to BigOWLIM, yet for large datasets, HadoopRDF
still produced much better results. This shows that HadoopRDF is much more scalable than
BigOWLIM, and provides more efficient queries for large datasets.
The final tests we have performed are an in-depth scalability test. For this, we repeated the same
queries for eight different dataset sizes, all the way up to 6.6 billion.
In our experiments we found that Query 1 is simple and requires only one join, thus it took the
least amount of time among all the queries. Query 2 is one of the two queries having the greatest
number of triple patterns. Even though it has three times more triple patterns, it does not take thrice
the time of Query 1 answering time because of our storage schema. Query 4 has one less triple pat-
tern than Query 2, but it requires inferencing. As we determine inferred relations on the fly, queries
requiring inference take longer times in our framework. Queries 9 and 12 also require inferencing.
Details are given in [HUSA11a].
As the size of the dataset grows, the increase in time to answer a query does not grow
proportionately. The increase in time is always less. For example, there are 10 times as many
triples in the dataset of 10,000 universities than 1000 universities, but for Query 1, the time only
increases by 3.76 times and for query 9 by 7.49 times. The latter is the highest increase in time,
yet it is still less than the increase in the size of the datasets. Due to space limitations, we do
not report query runtimes with PS schema here. We found that PS schema is much slower than
POS schema.
SubClassOf
City Employee
InstanceOf
Dunder
Mifflin
Access Tokens (AT) permit access to security-relevant data. An agent in possession of an AT may
view the data permitted by that AT. We denote ATs by positive integers.
Definition 23.15
Access Token Tuples (ATT) have the form AccessToken, Element, ElementType, ElementName,
where Element can be Subject, Object, or Predicate, and ElementType can be described as URI,
DataType, Literal, Model, or BlankNode. Model is used to access subject models, and will be
explained later in this section.
For example, in the ontology/knowledge base in Figure 23.5, David is a subject and 1, Subject,
URI, David is an ATT. Any agent having AT 1 may retrieve David’s information over all files
(subject to any other security restrictions governing access to URIs, literals, etc., associated with
David’s objects). While describing ATT’s for, we leave the ElementName blank (_).
Based on the record organization, we support six access levels along with a few subtypes
described below. Agents may be assigned one or more of the following access levels. Access levels
with a common AT combine conjunctively, while those with different ATs combine disjunctively.
1. Predicate data access: If an object type is defined for one particular predicate in an access
level, then an agent having that access level may read the whole predicate file (subject to
any other policy restrictions). For example, 1,Predicate, is Paid,_ is an ATT that permits
its possessor to read the entire predicate file is Paid.
2. Predicate and subject data access: Agents possessing a subject ATT may access
data associated with a particular subject, where the subject can be either a URI or a
DataType. Combining one of these subject ATTs with a predicate data access ATT hav-
ing the same AT, grants the agent access to a specific subject of a specific predicate. For
example:
a. Predicate and subject as URIs: Combining ATT’s 1, Predicate, is Paid, and 1,
Subject,URI, MichaelScott (drawn from the ontology in Figure 23.5) permits an agent
with AT 1 to access a subject with URI MichaelScott of predicate is Paid.
Cloud Query Processing System for Big Data Management 283
Agent
AT
Predicate
Vitamins
Object E
b. Predicate and subject as DataTypes: Similarly, Predicate and DataType ATT’s can be
combined to permit access to subjects of a specific data type over a specific predicate file.
For brevity, we omit descriptions of the different subject and object variations of each of the
remaining access levels.
3. Predicate and object: This access level permits a principal to extract the names of subjects satis-
fying a particular predicate and object. For example, with ATT’s 1, Predicate,hasVitamins,_,
and 1, Object,URI,E, an agent possessing AT 1 may view the names of subjects (e.g., foods)
that have vitamin E. More generally, if X1 and X2 are the set of triples generated by Predicate
and Object triples (respectively) describing an AT, then agents possessing the AT may view
set X1∩X2 of triples. An illustration of this example is displayed in Figure 23.6.
4. Subject access: With this access level, an agent may read the subject’s information over
all the files. This is one of the less restrictive access levels. The subject can be a DataType
or BlankNode.
5. Object access: With this access level, an agent may read the object’s subjects over all the
files. Like the previous level, this is one of the less restrictive access levels. The object can
be a URI, DataType, Literal, or BlankNode.
6. Subject model level access: Model level access permits an agent to read all necessary
predicate files to obtain all objects of a given subject. Of these objects, the ones that are
URIs are next treated as subjects to extract their respective predicates and objects. This
process continues iteratively until all objects finally become literals or blank nodes. In this
manner, agents possessing model level access may generate models on a given subject.
The following example drawn from Figure 23.5 illustrates David lives in LongIsland. LongIsland
is a subject with an Avg_Summer_Temp predicate having object 75° F. An agent with model level
access of David read the average summer temperature of.
An Access Token List (AT-list) is an array of one or more ATs granted to a given agent, along with a time
stamp identifying the time at which each was granted. A separate AT list is maintained for each agent.
When a system administrator decides to add an AT to an agent’s AT list, the AT and time stamp
are first stored in a temporary variable. Before committing the change, the system must first detect
potential conflicts in the new AT list.
284 Big Data Analytics with Applications in Insider Threat Detection
Final output of an Agent’s ATs: Each AT permits access to a set of triples. We refer to this set as
the AT’s result set. The set of triples accessible by an agent is the union of the result sets of the ATs
in the agent’s AT list. Formally, if Y1, Y2, …, Yn are the result sets of ATs AT1, AT2, …, ATn (respec-
tively) in an agent’s AT list, then the agent may access the triples in set Y1∪Y2∪…∪Yn.
Security Level Defaults: An administrator’s AT assignment burden can be considerably simpli-
fied by conservatively choosing default security levels for data in the system. In our implementation,
all items in the data store have default security levels. Personal information of individuals is kept
private by denying access to any URI of data type Person by default. This prevents agents from
making inferences about any individual to whom they have not been granted explicit permission.
However, if an agent is granted explicit access to a particular type or property, the agent is also
granted default access to the subtypes or subproperties of that type or property.
As an example, consider a predicate file Likes that lists elements that an individual likes. Assume
further that Jim is a person who likes Flying, SemanticWeb, and Jenny, which are URIs of type
Hobby, ResearchInterest, and Person, respectively, and 1 is an AT with ATTs 1,Subject,URI,Jim
and 1,Likes,Predicate,_. By default, agent Ben, having only AT 1, cannot learn that Jenny is in
Jim’s Likes list since Jenny’s data type is Person. However, if Ben also has AT 2 described by ATT
2,Object,URI,Jenny, then Ben will be able to see Jenny in Jim’s Likes list.
23.7.3 Conflicts
A conflict arises when the following three conditions occur: (1) An agent possesses two ATs 1 and
2, (2) the result set of AT 2 is a proper subset of AT 1, and (3) the time stamp of AT 1 is earlier than
the time stamp of AT 2. In this case the latter, more specific AT supersedes the former, so AT 1 is
discarded from the AT list to resolve the conflict. Such conflicts arise in two varieties, which we
term subset conflicts and subtype conflicts.
A subset conflict occurs when AT 2 is a conjunction of ATTs that refines those of AT 1. For
example, suppose AT 1 is defined by ATT 1,Subject,URI,Sam and AT 2 is defined by ATTs
2,Subject,URI,Sam and 2,Predicate,HasAccounts,_. In this case the result set of AT 2 is a sub-
set of the result set of AT 1. A conflict will therefore occur if an agent possessing AT 1 is later
assigned AT 2. When this occurs, AT 1 is discarded from the agent’s AT list to resolve the
conflict.
Subtype conflicts occur when the ATTs in AT 2 involve data types that are subtypes of those in
AT 1. The data types can be those of subjects, objects or both.
Conflict resolution is summarized by Algorithm 23.2. Here, Subset(AT1, AT2) is a function that
returns true if the result set of AT1 is a proper subset of the result set of AT2, and SubjectSubType(AT1,
AT2) returns true if the subject of AT1 is a subtype of the subject of AT2. Similarly, ObjectSubType
(AT1, AT2), decides subtyping relations for objects instead of subjects.
3 currentAT[lengthcurrentAT ].AT←newAT;
4 currentAT[lengthcurrentAT ].TS←TS newAT;
5 else
6 count←0;
7 while count<lengthcurrentAT do
8 ATtempATTS←currentAT[count].AT;
9 tempTS←currentAT[count].TS;
10 /* the timestamp during the AT assignment */
11 if (Subset(newAT, tempATTS) AND (TSnewAT≥tempTS)) then
12 /* a conflict occurs */
13 currentAT[count].AT←newAT;
14 currentAT[count].TS←TSnewAT;
15 else if (Subset(tempATTS, newAT) AND (tempTS<TSnewAT )) then
16 currentAT[count].AT←newAT;
17 currentAT[count].TS←TSnewAT ;
18 else if ((SubjectSubType(newAT,tempATTS)OR)
ObjectSubType (newAT,tempATTS)) AND TSnewAT≥tempTS) then
19 /* a conflict occurs */
20 currentAT[count].AT←newAT;
21 currentAT[count].TS←TSnewAT;
22 else if ((SubjectSubType(tempATTS,newAT)ORObjectSubType(tempATTS,newAT))
AND (tempATTS<TSnewAT ) then
23 currentAT[count].AT←newAT;
24 currentAT[count].TS←TSnewAT ;
25 end
26 count←count+1;
27 end
28 end
systems. Finally, we will incorporate security at all levels of the system. That is, security should
not be just an add-on to the query-processing prototype. It has to be built into the SPARQL query
optimizer.
REFERENCES
[ABAD07]. D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach, “Scalable Semantic Web Data Management
Using Vertical Partitioning,” In Proceedings of 33rd International Conference of Very Large Data Bases,
Vienna, Austria, 2007.
[ABAD09a]. D. J. Abadi, “Data Management in the Cloud: Limitations and Opportunities,” IEEE Data
Engineering Bulletin, 32 (1), 3–12, 2009.
[ABAD09b]. D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach, “SW-Store: A Vertically Partitioned DBMS
for Semantic Web Data Management,” VLDB Journal, 18 (2), 385–406, 2009.
[ABOU09]. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, A. Rasin, “HadoopDB: An
Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” Proceedings
of the VLDB Endowment, 2 (1), 922–933, 2009.
[ATRE08]. M. Atre, J. Srinivasan, J. A. Hendler, “BitMat: A Main-Memory Bit Matrix of RDF Triples for
Conjunctive Triple Pattern Queries,” In Proceedings of the 5th International Workshop on Semantic Web
Conference, Karlsruhe, Germany, 2008.
[BONC06]. P. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, J. Teubner, “MonetDB/XQuery: A
Fast XQuery Processor Powered by a Relational Engine,” In Proceedings of ACM SIGMOD International
Conference on Management of Data, Chicago, IL, pp. 479–490, 2006.
[CARR04]. J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, K. Wilkinson, “Jena: Implementing
the Semantic Web Recommendations,” In Proceedings of 13th International World Wide Web Conference
on Alternate Track Papers and Posters, New York, NY, pp. 74–83, 2004.
[CHAL]. Semantic Web Challenge, http://challenge.semanticweb.org.
[CHAN06]. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,
R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” In Proceedings of the 7th
USENIX Symposium on Operating System Design and Implementation, Seattle, WA, pp. 205–218, Nov.
2006.
[CHEB07]. A. Chebotko, S. Lu, F. Fotouhi, “Semantics Preserving SPARQL-to-SQL Translation,” Technical
Report, TR-DB-112007-CLF, 2007.
[CHON05]. E. I. Chong, S. Das, G. Eadon, J. Srinivasan, “An Efficient SQL- Based RDF Querying Scheme,”
In VLDB ’05: Proceedings of 31st International Conference on Very Large Data Bases, Trondheim,
Norway, pp. 1216–1227, 2005.
[CHU06]. C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. Bradski, A. Y. Ng, K. Olukotun, “Map-Reduce for Machine
Learning on Multicore,” In Proceedings of the 19th International Conference on Neural Information
Processing Systems (NIPS), Vancouver, BC, Canada, pp. 281−288, 2006.
[CLOU]. Cloudera University, http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-
hadoop-to-build-a-scalable-distributed-triple-store.
[COMP]. National University of Singapore School of Computing, http://www.comp.nus.edu.sg/∼epic/.
[CYGA05]. R. Cyganiak, “A Relational Algebra for SPARQL,” Technical Report, HPL-2005-170, 2005.
[DEAN04]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In
Proceedings of the 6th Conference Symposium on Operating Systems Design and Implementation,
San Francisco, CA, pp. 137–150, 2004.
[DING05]. L. Ding, T. Finin, Y. Peng, P. P. da Silva, D. L. Mcguinness, “Tracking RDF Graph Provenance
Using RDF Molecules,” In Proceedings of the 4th International Semantic Web Conference, Galway,
Ireland, 2005.
[GUO04]. Y. Guo, Z. Pan, J. Heflin, “An Evaluation of Knowledge Base Systems for Large OWL Datasets,”
In Proceedings of the International Semantic Web Conference, Hiroshima, Japan, 2004.
[GUO05]. Y. Guo, Z. Pan, J. Heflin, “LUBM: A Benchmark for OWL Knowledge Base Systems,” Web Semantics:
Science, Services and Agents on the World Wide Web, 3 (2−3), 158–182, 2005.
[HADOa]. Apache Software Foundation, http://hadoop.apache.org.
[HADOb]. Apache Software Foundation, http://hadoop.apache.org/core/docs/r0.18.3/hdfs_design.html.
[HUSA09]. M. F. Husain, P. Doshi, L. Khan, B. Thuraisingham, “Storage and Retrieval of Large RDF
Graph Using Hadoop and MapReduce,” In Proceedings of the 1st International Conference on Cloud
Computing, Bejing, China, 2009, http://www.utdal- las.edu/mfh062000/techreport1.pdf.
Cloud Query Processing System for Big Data Management 287
[HUSA10]. M. F. Husain, L. Khan, M. Kantarcioglu, B. Thuraisingham, “Data Intensive Query Processing for
Large RDF Graphs Using Cloud Computing Tools,” In Proceedings of the IEEE International Conference
on Cloud Computing, Miami, FL, pp. 1–10, July 2010.
[HUSA11a]. M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan, B. M. Thuraisingham, “Heuristics-
Based Query Processing for Large RDF Graphs Using Cloud Computing,” IEEE Transactions on
Knowledge and Data Engineering, 23 (9), 1312–1327, 2011.
[HUSA11b]. M. F. Husain, “Data Intensive Query Processing for Semantic Web Data Using Hadoop and
MapReduce,” PhD thesis, The University of Texas at Dallas, May 2011.
[ITEE]. The University of Queensland Australia, School of Information Technology and Electrical Engineering,
http://www.itee.uq.edu.au/eresearch/projects/biomanta.
[JENA]. Apache Software Foundation, http://jena.sourceforge.net.
[KHAL10]. A. Khaled, M. F. Husain, L. Khan, K. W. Hamlen, B. M. Thuraisingham, “A Token-Based Access
Control System for RDF Data in the Clouds,” In CloudCom: 2010 IEEE 2nd International Conference
on Cloud Computing Technology and Science, Indianapolis, IN, USA, 2010.
[KIRY05]. A. Kiryakov, D. Ognyanov, D. Manov, “OWLIM: A Pragmatic Semantic Repository for OWL,”
In SSWS’05: Proceedings of the 2005 International Workshop on Scalable Semantic Web Knowledge
Base Systems, New York, NY, 2005.
[LEHI]. Lehigh University, http://www.lehigh.edu/∼zhp2/2004/0401/univ-bench.owl.
[MCGL09]. J. P. McGlothlin and L. R. Khan, “RDFKB: Efficient Support for RDF Inference Queries and
Knowledge Management,” In IDEAS’09: Proceedings of the International Database Engineering and
Applications Symposium, Cetraro, Italy, 2009.
[MCGL10]. J. Ps. McGlothlin and L. Khan, “Materializing and Persisting Inferred and Uncertain Knowledge
in RDF Datasets,” In Proceedings of AAAI Conference on Artificial Intelligence, Atlanta, GA, 2010.
[MCNA07]. A. W. Mcnabb, C. K. Monson, K. D. Seppi, “MRPSO: MapReduce Particle Swarm Optimization,”
In GECCO: Proceedings of the Annual Conference on Genetic and Evolutionary Computation, London,
England, UK, 2007.
[MORE07]. J. E. Moreira, M. M. Michael, D. Da Silva, D. Shiloach, P. Dube, L. Zhang, “Scalability of
the Nutch Search Engine,” In ICS’07: Proceedings of the 21st Annual International Conference on
Supercomputing, Rotterdam, The Netherlands, pp. 3–12, June 2007.
[MORE08]. C. Moretti, K. Steinhaeuser, D. Thain, N. Chawla, “Scaling Up Classifiers to Cloud Computers,”
In ICDM’08: Proceedings of the IEEE International Conference on Data Mining, Pisa, Italy, 2008.
[NEUM08]. T. Neumann and G. Weikum, “RDF-3X: A RISC-Style Engine for RDF,” Proceedings of VLDB
Endowment, 1 (1), 647–659, 2008.
[NEWM08]. A. Newman, J. Hunter, Y. F. Li, C. Bouton, M. Davis, “A Scale-Out RDF Molecule Store for
Distributed Processing of Biomedical Data,” In Proceedings of the Semantic Web for Health Care and
Life Sciences Workshop, Karlsruhe, Germany, 2008.
[OLST08]. C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, “Pig Latin: A Not-So-Foreign Language
for Data Processing,” In Proceedings of ACM SIGMOD International Conference on the Management of
Data, Vancouver, BC, Canada, 2008.
[ONTO]. Ontotext AD, http://www.ontotext.com/owlim/big/index.html.
[OPEN]. Eclipse RDF4J, http://docs.rdf4j.org/migration/.
[ROHL07]. K. Rohloff, M. Dean, I. Emmons, D. Ryder, J. Sumner, “An Evaluation of Triple-Store Technologies
for Large Data Stores,” In Proceedings of the OTM Confederated International Conference on the Move
to Meaningful Internet Systems, Vilamoura, Portugal, 2007.
[SCHM09]. M. Schmidt, T. Hornung, G. Lausen, C. Pinkel, “SP2Bench: A SPARQL Performance Benchmark,”
In ICDE ’09: Proceedings of the 25th International Conference on Data Engineering, Shanghai, China,
2009.
[SIDI08]. L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, S. Manegold, “Column-Store Support for RDF
Data Management: Not All Swans Are White,” Proceedings of VLDB Endowment, 1 (2), 1553–1563,
2008.
[SISM10]. Y. Sismanis, S. Das, R. Gemulla, P. Haas, K. Beyer, J. McPherson, “Ricardo: Integrating R and
Hadoop,” In SIGMOD’10: Proceedings of the ACM SIGMOD International Conference Management of
Data, Indianapolis, IN, USA, 2010.
[STOC08]. M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, D. Reynolds, “SPARQL Basic Graph Pattern
Optimization Using Selectivity Estimation,” In WWW ’08: Proceedings of the 17th International
Conference on World Wide Web, Beijing China, 2008.
[STON05]. M. Stonebraker et al. “C-Store: A Column-Oriented DBMS,” In VLDB ’05: Proceedings of the 31st
International Conference on Very Large Data Bases, Trondheim, Norway, pp. 553–564, 2005.
288 Big Data Analytics with Applications in Insider Threat Detection
24.1 INTRODUCTION
This chapter describes a cloud-based system called InXite, also called InXite-Security (Stream-
based Data Analytics for Threat Detection and Prediction), that is designed to detect evolving
patterns and trends in streaming data. InXite comprises four major modules: InXite Information
Engine, InXite Profile Generator, InXite Psychosocial Analyzer, and InXite Threat Evaluator and
Predictor, each of which is outlined in this chapter. We also describe the novel methods we have
developed for stream data analytics that are at the heart of InXite.
InXite integrates information from a variety of online social media sites such as Twitter,
Foursquare, Google+, and LinkedIn builds people profiles through correlation, aggregation, and
analyses in order to identify persons of interest who pose a threat. Other applications include gar-
nering user feedback on a company’s products, providing inexpensive-targeted advertising, and
monitoring the spread of an epidemic, among others.
InXite is designed to detect evolving patterns and trends in streaming data including emails,
blogs, sensor data, and social media data such as tweets. InXite is designed on top of two power-
ful and patented data mining systems, namely Tweethood (location extraction for Tweets), with
the explicit aim of detecting and predicting suspicious events and people, and stream-based novel
class detection (SNOD). We also designed a separate system, SNOD++, an extension of SNOD, for
detecting multiple novel classes of threats for InXite. Our goal is to decipher and monitor topics in
data streams as well as to detect emerging trends. This includes general changes in topics such as
sports or politics and also includes new, quickly emerging trends such as hurricanes and bombings.
The problem of correctly associating data streams (e.g., Tweet messages) with trends and topics is a
challenging one. The challenge is best addressed with a streaming model due to the continuous and
large volume of incoming messages.
It should be noted that InXite is a general purpose system that can be adapted for a variety
of applications including security, marketing, law enforcement, healthcare, emergency response,
and finance. Also, the design of InXite is cloud based and can handle massive amounts of data.
In other words, InXite is essentially a big data analytics system. This chapter mainly focuses on
the adaptation of InXite for security applications which we call InXite-Security. Other adapta-
tions of InXite are called InXite-Marketing, InXite-Law, InXite-Healthcare, InXite-Emergency,
and InXite-Finance among others. That is, while InXite-Security is developed mainly for counter-
terrorism and intelligence applications, all of the features can be tailored for marketing and law
enforcement applications with some effort. We have completed the design and implementation of
InXite-Security and InXite-Marketing. We have also completed an initial design and implementa-
tion of InXite-Law. Other applications such as healthcare and finance will be part of our future
work. For convenience, we will use the term InXite to mean InXite-Security in this chapter.
The organization of this chapter is as follows. Our premise for InXite is discussed in Section
24.2. We will describe in detail the design of all the modules of InXite and the implementation
in Section 24.3. A note on InXite-Marketing is discussed in Section 24.4. Related work is dis-
cussed in Section 24.5. This chapter is concluded in Section 24.6. Figure 24.1 describes the concepts
289
290 Big Data Analytics with Applications in Insider Threat Detection
Streaming
data Results
InXite
information
integration and
analysis
Analyst
discussed in this chapter. It should be noted that while we have focused on social media systems, the
techniques we have designed and developed can be applied to various big data systems.
utilizes the various modules of InXite and gives recommendations to businesses for selling prod-
ucts. The design of InXite uses Tweethood to obtain demographics information about individuals
and SNOD and SNOD++ for detecting novel classes of threats and sentiments.
24.3.2 Information Engine
The first step is to extract concepts and relationships from the vast amount of data streams and
categorize the messages. Then we provide semantic representation of the knowledge buried in the
streams. This would enable an analyst to interact more directly with the hidden knowledge. The sec-
ond step is to represent the concepts as ontologies, and subsequently integrate and align the ontolo-
gies. Once the multiple graphs extracted from the streams are integrated, then our analytics tools
will analyze the graphs and subsequently predict threats.
The information engine module, illustrated in Figure 24.3, integrates the attributes of a user
from multiple data streams including social networks (e.g., Twitter, LinkedIn, Foursquare, etc.) and
performs entity resolution, ontology alignment, conflict resolution, data provenance, and reasoning
under uncertain and incomplete information. At the heart of the Information Engine is Tweethood,
a novel, patent-pending method/algorithm to determine user attributes including, but not limited
to, location, age, age group, race, ethnicity, threat, languages spoken, religion, economic status,
Other
POI - Person of interest applications
RDBMS - Relational database management system
education level, gender, hobbies, or interests based on the attribute values for the friends of the user
[MOTO09].
While entity resolution algorithms have been around since the mid-1990s, InXite uses a com-
bination of content-based similarity matching [TUNG06] and friends-based similarity matching
[MOTO09] algorithms. The information engine module consists of two major components:
1. Entity extraction: The process of extracting (mining) and/or, in certain cases, predicting
user specific attributes which include demographic information, such as, age, gender, and
so on and information about his/her social networks, such as friends, followers, people he/
she is following, and so on.
2. Information integration: The process of integrating or joining two or more user profiles
from the same or different sources, such as social networks, blogs, and so on. This is done
using the information obtained from the previous step.
In the following sections, we describe our methodology for implementing these two modules.
1. Using text mining techniques in the literature, extract relevant information about the user
from the multiple sources of data (including social networks, databases, etc.).
2. Organize the information into (key, value) pairs.
3. Use Tweethood to predict values that are unknown for any keys.
1. Construct ontologies for the entities extracted using various ontology construction tech-
niques in the literature.
2. Carry out entity resolutions by determining whether two entities are the assigning scores
as to how similar the entities are.
3. Apply data mining techniques to observe patterns in the entity resolution process.
4. For those entities that cannot be completely resolve, use the patterns observed in step 3 and
resolve the entities.
5. Link the various ontologies constructed using the results from the entity resolution process
to form a linkage of ontologies which are essentially person of interest (i.e., user) profiles.
Twitter LinkedIn
Bio, twitter link, friends Education, occupation, current
position, experience, industry
InXite profile
generation and
prediction
Google+
Gender, nickname, Predicted via SNOD Foursquare
introduction, bragging rights, and other POI, POI categories
other names, places lived techniques
Age, email, interests,
associations, travel,
psychosocial properties
novel information nodes as long as information is added or discovered in the searching process.
This means that profiles are constantly edited, updated, and merged after profile generation.
4. Background check score computation: For individuals located in the USA, we run back-
ground checks using existing software/websites. The integration and prediction of user
attributes helps in successful user disambiguation and allows us to do an advanced search
of the database.
Based on the previous crimes committed by the individual, we assign a score which
reflects the likelihood of him/her being a threat in the near future.
Example: If criminal and Type_of_Crime = Violent or Federal → High Score
5. Online reputation-based score computation: InXite analyzes various online data sources
such as newspapers, blogs, and social networking sites to analyze the sentiment about
the user and determines his/her involvement in political events like rallies, riots, scams,
frauds, robberies, among others. As in the case for other modules, the integration and pre-
diction of other attributes allows for successful user disambiguation.
Example: If user received award from president it could lead to a low score. On the other
hand, an individual who is an active participant in rallies (news article in NY Times) →
high score.
6. Social graph-based score computation: The final module is based on our patent pending
algorithm, Tweethood. We predict the threat level for all friends of the POI (based on
above listed = factors) and aggregate to obtain score for the central POI.
Example: If Threat (friend1) = 0.9 AND Threat (friend2) = 0.1 AND Threat (friend3) =
0.8 AND Threat (friend4) = 0.7 AND Threat (friend5) = 0.5, then Threat (POI) = 0.6
Threat Assessment: Once the profiles of a user have been constructed, we then examine the
various attributes to determine whether the given user is a potential terrorist. For example, a user’s
attributes (e.g., age, location, etc.) as well as their behavioral, social and psychological properties
are extracted using our analytics algorithms. We then also apply existing algorithms (e.g., Mark
Sageman’s algorithms ([SAGE04], [SAGE08])) to enhance a user’s psychological profile. These pro-
files will be used to determine whether the current user will carry out terrorist attacks, homicides,
and so on. For threat assessment, here are some results that we have obtained using existing algo-
rithms as well as our data analytics algorithms.
• Demographics: Up to 0.2 points are assigned to fitting into rages for the following catego-
ries: age, education, religion, politics, and hobbies. These are then added up for the final
demographics score. The ranges for these categories are based on the research of Marc
Sageman [SAGE08].
• Psychology: Verb usage is categorized into traits. Four of these traits are found to be indic-
ative of low psychological stability. These four traits are measured by percentage of total
verb usage and added together to form the psychology score. The psychology submodule is
based on techniques given in [MARK03] and [SHAV92].
• NLP (Natural Language Processing): Weighted average of sentiment used between verbs
and high-profile nouns (i.e., White House or Pentagon). Negative or threatening verb analy-
ses have a weight of 1 while positive or benign verb analyses have a weight of 0.1. This
allows strong statements such as a correlation of “bomb” and “Pentagon” to produce an
overwhelmingly high score.
• Social structure: Standard means average of friends’ threat scores. A friend’s threat score
is the average of their other scores (demographics, psychology, NLP, social structure, back-
ground, and online reputation).
• Background checks: Represents a DoD standard background check on the individual.
• Online reputation: If no previous association is found with this person or all associations
are positive, the score will be 0. Any score higher than this directly represents the percent-
age of previous associations from mainstream media that are analyzed to have a negative
sentiment.
296 Big Data Analytics with Applications in Insider Threat Detection
Our design utilizes Tweethood and SNOD as well as existing algorithms to develop a compre-
hensive system for threat assessment/evaluation.
The work in [KINS11] aims to identify zip codes/locations based on the particular language
used in tweets and is thus a different approach toward location identification. We have considerably
enhanced current efforts and have developed a novel technique for microlevel location mining.
The pseudocode for microlevel location mining is as follows:
Word clouds
- Closest friends
- Frequent words
Microlocation
identification and Tweet frequency
analysis plots
Social graph
visualization
• User demographics-based: For example, if we know that 95% of all African Americans are
pro-Obama, then our system will give a positive bias to a tweet from an African American
about President Obama.
• Social factor-based (based on Tweethood): If we know that 9 out of 10 friends of a user are
pro-Obama, then we will give a positive bias to tweets from that user.
Our training dataset is a labeled dataset—each tweet with its sentiment type positive or negative
or neutral. We obtained the labeled training dataset from set of tweets which has emo-icons in it.
Based on the emo-icon, we determined the label of the tweet and made the training dataset. For each
of the training data/tweet, we first remove the stopwords in it. Then we remove all the words starting
with “@” or “http.” Then we convert each token of the tweet to standard form means: we convert
a token like “hungryyyyyy” to “hungryy.” Then from each tweet we make the list of unigrams
and bigrams for that tweet with its sentiment type. We saved the list of unigrams and bigrams in a
HashSet and also convert the tweet as unigram and bigrams.
Now, for each token in the HashSet and for each tweet, we check whether the tweet contains the
token or not. Then we make the occurrence matrix based on their presence/absence. So, at this point
we have a dataset of large numbers of dimensions. So to reduce the dimensionality we leverage the
entropy concept. We choose best N attribute based on the higher information gain. Now, we have
considerably good data and then we use WEKA for classifying purpose and we use naïve Bayesian
classifier and decision tree (J48) classifier.
The pseudocode for the sentiment mining algorithm is as follows:
Input: Set of training tweets T, Bag of stopword S, Set of testing tweets R, number of attribute N
Output: Labels of each of the tweet of R.
For every tweet in T
Remove all the stopwords in S.
Remove the words starting with ‘@’ or ‘http’ and make each token as standard form.
Make the set G, of unigrams and bigrams, and convert each tweet as a set of those and make the
set W. [W contains the unigrams and bigrams of each tweet.]
For each token g in G
For each token in w in W
If g matches w then encode it as 1 and fill up the occurrence matrix M.
Else encode it as 0 and fill up the occurrence matrix M.
Choose the best N attribute from M based on the information gain and make new dataset D.
Use D as the training dataset and build the classifier NB or J48.
Use the trained classifier to classify the instances of test set R.
298 Big Data Analytics with Applications in Insider Threat Detection
WORD CLOUDS: Shows frequently used words. More frequent words are shown with a larger
font.
ENTITY CLOUDS: Shows frequently used entities. Entities of interest among the profile and
their friends are shown in size by correlation to frequency of discussion.
TWEET FREQUENCY: Line graphs of tweets over time show useful timing information. Lack
of tweets are as important as writing tweets. Sleeper agents are known to cut contact and go silent
before acting.
SOCIAL GRAPH VISUALIZATION: Visually shows the threat level of the most popular
friends that are associated with a given user online.
ASSOCIATED IMAGES: Brought to life in a slide show, all the images gathered from online
sources for the given user profile.
InXite profiles
part of speech obtained from a tagger developed by CMU specifically for twitter language usage
instead of published text documents. The other is based on our patent-pending technology SNOD.
Our integrated algorithms provide much higher accuracy than the current approaches. Figure 24.6
illustrates details of the threat prediction module (module 5 of Figure 24.2)
The pseudocode for the threat detection algorithm is as follows:
In the case of outliers, those statements that cannot be grouped with similar ones, novel class
detection is used to determine its viability as an actual threat. In this regard, SNOD is a very useful
tool.
24.3.5.1 SNOD++
InXite is based on SNOD which occurs because of the dynamic nature of the stream. Second, if
more than one novel class appears in the stream, SNOD cannot detect them. Third, SNOD does
not address the problem of high-dimensional feature spaces, which may lead to higher training
and classification error. Finally, SNOD does not apply any optimizations for feature extraction and
classification. Therefore, we have designed a practical and robust blogs and tweets detection tool
using powerful systems called SNOD++. That is, we have enhanced SNOD into SNOD++ that is
more practical and robust than SNOD. In addition to addressing the infinite-length, concept-drift,
and concept-evolution problems, SNOD++ addresses feature evolution and multiple novel classes,
as well as applies subspace clustering and other optimizations, all of which improve the robustness,
power, and accuracy of the algorithm.
4. Use a separate custom bolt to construct word/entity clouds, graphs for tweet frequency,
determine the threat score for the top friends of this user and download images associated
with this user.
5. Store all information obtained in step 4a as a part of the user’s profile.
In addition to Storm and Hbase, we are also exploring the use of other big data technologies such
as Spark to implement InXite. In addition, NoSQL database systems such as CouchDB are also
being explored to store the massive amounts of data collected from social media systems.
24.3.8 Implementation
All of the modules of InXite illustrated in Figure 24.2 have been implemented. These include entity
extraction and information integration, profile generation and threat analysis, psychosocial analy-
sis, as well as threat prediction. With respect to the cloud implementation, we have completed the
implementation of Tweethood in the cloud as well as SNOD in the cloud. The remaining modules
of InXite are yet to be implemented in the cloud.
Multiple demonstrations of InXite are available. These include canned demonstrations, real-time
demonstrations (that requires access to numerous tweets) as well as Tweethood in the cloud. We
have taken a plug-and-play (i.e., a component-based approach) to the development of InXite. This
means that if a customer has a certain module of his/her own (e.g., sentiment mining) that he/she
wishes to use, then he can replace our sentiment mining module with that of his/her choice. This
feature is a great strength of InXite. Many of the products require an all or nothing approach. With
InXite you can select the modules you want to meet your needs. It should also be noted that while
the design of InXite does not limit the data to tweets, the implementation handles only tweets. That
is, the design of InXite can handle any data whether it is structured data in databases or unstruc-
tured data in the form of social graphs or tweets.
As we already discussed in the sentiment analysis section, we can predict the sentiment of tweet
for a person, though it needs some further improvement like incorporating NLP techniques. Our
system is easily workable for a particular person’s tweet and a particular subject. For example, if
we want to determine John Smith’s sentiment about iPhone-5, we can go through all his tweets
about the iPhone-5 and based on the sentiment of each of the tweet, we can determine the overall
sentiment of John Smith about iPhone-5. So now, once we know the sentiment of John Smith about
iPhone-5, say it is positive, we can recommend him some more i-products or products related to the
iPhone-5, for example, headphone, charger, and so on. Here is another factor to consider that we
call peer effect. The positive statements made by an individual’s friends can be mined for individual
products. Because the friends or associates of this individual talk positively about a product, espe-
cially one that the individual does not mention themselves, we can extrapolate these products for
the recommender system. For example, if John Smith has ten friends and all of them have a positive
sentiment about Android phones, then it is a good idea to recommend some Android products to
John Smith. So, we can consider it as a weighted vector of “personal sentiment” and “peer senti-
ment.” Based on the weighted factor, we can decide what product and in what extent we should
recommend him. These weights will be mostly influenced by the individual’s personal sentiment
and the sentiment of his peers will help fill in the gaps and more greatly expand the recommenda-
tion options for this person.
The pseudocode for the recommender system is as follows:
We have a demonstration system for InXite-Marketing (as well as InXite-Law) that has imple-
mented all of the features that we have described above including the recommender system. There
are also additional features we have implemented. For example, based on the interests of a user, we
could predict the product he would be interested in the future and this way a business can market
this product first to this user and gain a competitive advantage.
to the creation of our tools, we can iteratively refine each component (novel class detection for
trend analysis, entity extraction, scalability on cloud, and ontology construction) separately. All the
frameworks and tools that we have used (or are using) for the development of InXite are open source
and have been extensively used in our previous research and hence our tools will be able to accom-
modate any changes to the platform.
REFERENCES
[ABRO09]. S. Abrol, L. Khan, T.M. Al-Khateeb, “MapIt: Smarter Searches using Location Driven Knowledge
Discovery and Mining,” 1st SIGSPATIAL ACM GIS 2009 International Workshop on Querying and
Mining Uncertain Spatio-Temporal Data (QUeST), November 2009, Seattle, WA.
[AHME09]. M.S. Ahmed and L. Khan, “SISC: A Text Classification Approach Using Semi Supervised
Subspace Clustering,” ICDM Workshops, Miami, FL, pp. 1–6, 2009.
[AHME10]. M.S. Ahmed, L. Khan, M. Rajeswari, “Using Correlation Based Subspace Clustering for Multi-
label Text Data Classification,” ICTAI, 2, 296–303, 2010.
[BACK08]. L. Backstrom, J. Kleinberg, R. Kumar, J. Novak. “Spatial Variation in Search Engine Queries.” In
Proceedings of the 17th International Conference on WWW, New York, NY, 2008.
[CHAN11]. S. Chandra and L. Khan, “Estimating Twitter User Location Using Social Interactions A Content
Based Approach,” The 3rd IEEE International Conference on Social Computing, Oct. 9–11, MIT Press,
Boston, MA, 2011.
[DONG05]. X. Dong, A.Y. Halevy, J. Madhavan, “Reference Reconciliation in Complex Information Spaces,”
In SIGMOD Conference, Baltimore, MD, pp. 85–96, 2005.
[FRIG04]. H. Frigui and O. Nasraoui, “Unsupervised Learning of Prototypes and Attribute Weights,” Pattern
Recognition, 37 (3), 567–581, 2004.
Big Data Analytics for Multipurpose Social Media Applications 305
[GO09]. G. Alec, R. Bhayani, L. Huang, “Twitter Sentiment Classification Using Distant Supervision,”
CS224N Project Report, Stanford, 2009, 1–12.
[GOYA09]. M.A. Goyal, H. Daum, S. Venkatasubramanian, “Streaming for Large Scale NLP: Language
Modeling,” Human Language Technologies: The 2009 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, Boulder, CO, 2009.
[HUBE09]. B. Huberman and D. R. F. Wu, “Social Networks That Matter: Twitter Under the Microscope,”
First Monday, 14, 2009.
[KATA06]. I. Katakis, G. Tsoumakas, I. Vlahavas, “Dynamic Feature Space and Incremental Feature
Selection for the Classification of Textual Data Streams,” ECML PKDD: 2006 International Workshop
on Knowledge Discovery from Data Streams, pp. 102–116, 2006.
[KHAN02]. L. Khan and F. Luo, “Ontology Construction for Information Selection,” In Proceedings of
ICTAI, Washington, DC, 2002.
[KINS11]. K. Sheila, V. Murdock, N. O’Hare. “I’m Eating a Sandwich in Glasgow: Modeling Locations with
Tweets,” In Proceedings of the 3rd International Workshop on Search and Mining user-Generated
Contents, Glasgow, UK, pp. 61–68, 2011, ACM.
[LIN11]. J. Lin, R. Snow, W. Morgan, “Smoothing Techniques for Adaptive Online Language Models: Topic
Tracking in Tweet Streams,” In Proceedings of ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, August, San Diego, CA, 2011.
[MARK03]. M. Markou and S. Singh, “Novelty Detection: A Review. Part 1: Statistical Approaches, Part 2:
Neural Network Based Approaches,” Signal Processing, 83, 2481–2497, 2499–2521, 2003.
[MASU10]. M.M. Masud, Q. Chen, L. Khan, C.C. Aggarwal, J. Gao, J. Han, B.M. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In Proceedings of ICDM, Sydney, Australia, 2010.
[MASU11]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class Detection
in Concept-Drifting Data Streams Under Time Constraints,” IEEE TKDE, 23 (1), 859–874, 2011.
[MOTO09]. M. Motoyama and G. Varghese, “I Seek You: Searching and Matching Individuals in Social
Networks,” In ACM WIDM, Hong Kong, 2009.
[PAK10]. P. Alexander and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining,”
In Proceedings of LREC, Valletta, Malta, pp. 1320–1326, 2010.
[SAGE04]. S. Marc. Understanding Terror Networks. University of Pennsylvania Press, Philadelphia, PA,
2004.
[SAGE08]. M. Sageman and L. Jihad, Terror Networks in the Twenty-First Century, University of Pennsylvania
Press, Philadelphia, PA, 2008.
[SHAV92]. S.R. Phillip and K.A. Brennan, “Attachment Styles and the “Big Five” Personality Traits: Their
Connections with Each Other and with Romantic Relationship Outcomes,” Personality and Social
Psychology Bulletin, 18 (5), 536–545, 1992.
[SMIT01]. D.A. Smith and G. Crane, “Disambiguating Geographic Names in a Historical Digital Library,” 5th
European Conference on Research and Advanced Technology for Digital Libraries (ECDL01), Lecture
Notes in Computer Science, Darmstadt, September 2001.
[SPIN08]. E.J. Spinosa, A.P. de Leon, F. de Carvalho, J. Gama, “Cluster-Based Novel Concept Detection in
Data Streams Applied to Intrusion Detection in Computer Networks,” In Proceedings of ACM SAC,
Fortaleza, Ceara, Brazil, pp. 976–980, 2008.
[TUNG06]. A. Tung, R. Zhang, N. Koudas, B. Doi, “Similarity Search: A Matching Based Approach,” In
Proceedings of VLDB, Seoul, Korea, 2006.
[WENE06]. B. Wenerstrom and C. Giraud-Carrier. “Temporal Data Mining in Dynamic Feature Spaces,” In
Proceedings of ICDM, Hong Kong, pp. 1141–1145, 2006.
25 Big Data Management
and Cloud for Assured
Information Sharing
25.1 INTRODUCTION
The advent of cloud computing and the continuing movement toward software as a service (SaaS)
paradigms have posed an increasing need for assured information sharing (AIS) as a service in the
cloud. The urgency of this need was voiced in April 2011 by the National Security Agency (NSA)
Chief Information Officer (CIO) Lonny Anderson in describing the agency’s focus on a “cloud-
centric” approach to information sharing with other agencies [NSA11]. Likewise, the Department
of Defense (DoD) has been embracing cloud computing paradigms to more efficiently, economi-
cally, flexibly, and scalably meet its vision of “delivering the power of information to ensure mission
success through an agile enterprise with the freedom of maneuverability across the information
environment” ([DoD, DoD07]). Both agencies therefore have a tremendous need for effective AIS
technologies and tools for cloud environments. Furthermore, there is also an urgent need for those in
different social circles within agencies to share the information in the cloud securely and in a timely
manner. Therefore, extending the AIS tools to function in a cloud-centric social media environment
is becoming a need for many organizations.
Although a number of AIS tools have been developed in recent years for policy-based informa-
tion sharing ([AWAD10, Fini09, RAO08, THUR08]), to our knowledge none of these tools operate
in the cloud and hence do not provide the scalability needed to support large numbers of users
such as social media users, involving massive amounts of data such as social media data including
text, images, and video. The early prototype systems we developed for supporting cloud-based AIS
have applied cloud-centric query engines (QEs) (e.g., the SPARQL query optimizer discussed in
Chapter 23) that query large amounts of data in relational and semantic web databases by utilizing
noncloud-based policy engines that enforce policies expressed in the eXtensible Access Control
Markup Language (XACML) ([THUR10, THUR11]). While this is a significant improvement over
prior efforts (and has given us insights into implementing cloud-based solutions), it nevertheless has
at least three significant limitations. First, XACML-based policy specifications are not expressive
enough to support many of the complex policies needed for AIS missions like those of the NSA
and DoD, as well as applications such as social networks. Second, to meet the scalability and effi-
ciency requirements of mission-critical tasks, the policy engine needs to operate in the cloud rather
than externally. Third, secured query processing based on relational technology has limitations in
representing and processing unstructured data needed for command and control applications.
To share the large amounts of data securely and efficiently, there clearly needs to be a seamless
integration of the policy and data managers for social media in the cloud. Therefore, in order to
satisfy the cloud-centric AIS needs of the DoD and NSA, we need (i) a cloud-resident policy man-
ager that enforces information-sharing policies expressed in a semantically rich language and (ii) a
cloud-resident data manager that securely stores and retrieves data and seamlessly integrates with
the policy manager. To our knowledge, no such system currently exists. Therefore, our project has
designed and developed such cloud-based AIS systems for social media users. Our policy engine as
well as data are represented using semantic web technologies and therefore can represent and reason
about social media data. That is, we have developed a cloud-centric policy manager that enforces
307
308 Big Data Analytics with Applications in Insider Threat Detection
Big data
management and
cloud for assured
information
sharing
FIGURE 25.1 Big data management and cloud for assured information sharing.
policies specified in the resource description framework (RDF) and a cloud-centric data manager
that will store and manage data, such as social graphs and associated data, also specified in RDF.
This RDF data manager is essentially a QE for SPARQL, a language widely used by the semantic
web community to query RDF data. Furthermore, our policy manager and data manager will have
seamless integration since they both manage RDF data.
To address the AIS requirements of various organizations, including social media users, we have
designed and developed a series of cloud-based AIS systems that handle massive amounts of data.
That is, we have essentially used big data management techniques for AIS. This chapter provides
an overview of our design. The organization of this chapter is as follows. Our design philosophy is
discussed in Section 25.2. Our system design will be discussed in Section 25.3. In particular, we
will discuss the design and implementation of CAISS in Section 25.3.1 and the design of CAISS++
in Section 25.3.2. Formal policy analysis and the implementation approach for CAISS++ will be
provided in Sections 25.3.3 and 25.3.4, respectively. Related efforts are discussed in Section 25.4.
Extending our approach to social media applications is discussed in Section 25.5. This chapter is
concluded in Section 25.6. Figure 25.1 illustrates the contents of this chapter. Details of our work
can also be found in [THUR12].
While our initial CAISS design and implementation will be the first system supporting cloud-
centric AIS, it will operate only on a single trusted cloud and will therefore not support information
sharing across multiple clouds. Furthermore, while CAISS’s RDF-based, formal semantics approach
to policy specification will be significantly more expressive than XACML-based approaches, it
will not support an enhanced machine interpretability of content since RDF does not provide a
sufficiently rich vocabulary (e.g., support for classes and properties). Phase 2 will therefore develop
a fully functional and robust AIS system called CAISS++ that addresses these deficiencies. The
preliminary design for CAISS++ is completed and will be discussed later in this chapter. CAISS
is an important stepping-stone toward CAISS++ because CAISS can be used as a baseline frame-
work against which CAISS++ can be compared along several performance dimensions, such as
storage model efficiency and ontology language (OWL)-based policy expressiveness. Furthermore,
since CAISS and CAISS++ share the same core components (policy engine and query processor),
the lessons learned from the implementation and integration of these components in CAISS will be
invaluable during the development of CAISS++. Finally, the evaluation and testing of CAISS will
provide us with important insights into the shortcomings of CAISS, which can then be systemati-
cally addressed in the implementation of CAISS++.
We will also conduct a formal analysis of policy specifications and the software-level protec-
tion mechanisms that enforce them to provide exceptionally high-assurance security guarantees
for the resulting system. We envisage CAISS++ to be used in highly mission-critical applications.
Therefore, it becomes imperative to provide guarantees that the policies are enforced in a provably
correct manner. We have extensive expertise in formal policy analysis ([JONE10], [JONE11]) and
their enforcement via machine-certified, in-line reference monitors ([HAML06a], [HAML06b],
[SRID10]). Such analyses will be leveraged to model and certify security properties enforced by
core software components in the trusted computing base of CAISS++.
CAISS++ will be a breakthrough technology for information sharing due to the fact that it uses a
novel combination of cloud-centric policy specification and enforcement along with a cloud-centric
data storage and efficient query evaluation. CAISS++ will make use of ontologies, a sublanguage of
the web ontology language (OWL), to build policies. A mixture of such ontologies with a semantic
web-based rule language (e.g., SWRL) facilitates distributed reasoning on the policies to enforce
security. Additionally, CAISS++ will include an RDF-processing engine that provides cost-based
optimization for evaluating SPARQL queries based on information sharing policies.
for systems with relatively few users and policies. However, for systems with a large number of users
and a substantial number of access requests, the aforementioned strategy becomes a performance
bottleneck. Finally, XACML is not sufficiently expressive to capture the semantics of information
sharing policies. Prior research has shown that semantic web-based policies are far more expressive.
This is because semantic web technologies are based on a description logic (DL) and have the power
to represent knowledge as well as reason about knowledge. Therefore, our first step is to replace the
XACML-based policy engine with a semantic web-based policy engine. Since we already have our
RDF-based policy engine for the phase 1 prototype, we will enhance this engine and integrate it
with our SPARQL query processor. Since our policy engine is based on RDF and our query proces-
sor also manages large RDF graphs, there will be no impedance mismatch between the data and
the policies.
Enhanced Policy Engine: Our current policy engine has a limitation in that it does not operate
in a cloud. Therefore, we will port our RDF policy engine to the cloud environment and integrate
it with the SPARQL QE for federated query processing in the cloud. Our policy engine will benefit
from the scalability and the distributed platform offered by Hadoop’s MapReduce framework to
answer SPARQL queries over large distributed RDF triple stores (billions of RDF triples). The
reasons for using RDF as our data model are as follows: (1) RDF allows us to achieve data interoper-
ability between the seemingly disparate sources of information that are catalogued by each agency/
organization separately; (2) the use of RDF allows participating agencies to create data-centric
applications that make use of the integrated data that is now available to them; and (3) since the RDF
does not require the use of an explicit schema for data generation, it can be easily adapted to ever-
changing user requirements. The policy engine’s flexibility is based on its accepting high-level poli-
cies and executing them as query rules over a directed RDF graph representation of the data. While
our prior work focusses on provenance data and access control policies, our CAISS prototype will
be flexible enough to handle data represented in RDF and will include information-sharing policies.
The strength of our policy engine is that it can handle any type of policy that could be represented
using RDF and horn logic rules.
The second limitation of our policy engine is that it currently addresses certain types of policies
such as confidentiality, privacy, and redaction policies. We need to incorporate information-sharing
policies into our policy engine. We have however conducted simulation studies for incentive-based
AIS as well as AIS prototypes in the cloud. We have defined a number of information-sharing poli-
cies such as “US gives information to UK provided UK does not share it with India.” We specify
such policies in RDF and incorporate them to be processed by our enhanced policy engine.
Enhanced SPARQL Query Processor: While we have a tool that will execute SPARQL queries
over large RDF graphs on Hadoop, there is still the need for supporting path queries (i.e., SPARQL
queries that provide answers to a request for paths in an RDF graph). An RDF triple can be viewed
as an arc from the subject to object with the predicate used to label the arc. The answers to the
SPARQL query are based on reachability (i.e., the paths between a source node and a target node).
The concatenation of the labels on the arcs along a path can be thought of as a word belonging to the
answer set of the path query. Each term of a word is contributed by some predicate label of a triple
in the RDF graph. We have designed an algorithm to determine the candidate triples as an answer
set in a distributed RDF graph. First, the RDF document is converted to an N-triple file that is split
based on predicate labels. A term in a word could correspond to some predicate file. Second, we
form the word by tracing an appropriate path in the distributed RDF graph. We use MapReduce jobs
to build the word and to get the candidate RDF triples as an order set. Finally, we return all of the
set of ordered RDF triples as the answers to the corresponding SPARQL query.
Integration Framework: Figure 25.2 provides an overview of the CAISS architecture. The
integration of the cloud-centric RDF policy engine with the enhanced SPARQL query proces-
sor must address the following. First, we need to make sure that RDF-based policies can be
stored in the existing storage schema used by the query processor. Second, we need to ensure
that the enhanced query processor is able to efficiently evaluate policies (i.e., path queries) over
Big Data Management and Cloud for Assured Information Sharing 311
Application
the underlying RDF storage. Finally, we need to conduct a performance evaluation of CAISS to
verify that it meets the performance requirements of various participating agencies. Figure 25.3
illustrates the concept of operation of CAISS. Here, multiple agencies will share data in a single
cloud. The enhanced policy engine and the cloud-centric SPARQL query processor will enforce
the information-sharing policies. This proof of concept system will drive the detailed design and
implementation of CAISS++.
There are several benefits in developing a proof of concept prototype such as CAISS before we
embark on CAISS++. First, CAISS itself is useful to share data within a single cloud. Second,
we will have a baseline system that we can compare against with respect to efficiency and ease
of use when we implement CAISS++. Third, this will give us valuable lessons with respect to
the integration of the different pieces required for AIS in the cloud. Finally, by running different
scenarios on CAISS, we can identify potential performance bottlenecks that need to be addressed
in CAISS++.
...
Hadoop HDFS
Result
25.3.2 Design of CAISS++
We have examined alternatives and carried out a preliminary design of CAISS++. On the basis of
the lessons learned from the CAISS prototype and the preliminary design of CAISS++, we will
carry out a detailed design of CAISS++ and subsequently implement an operational prototype of
CAISS++ during phase 2. In this section, we will first discuss the limitations of CAISS and then
discuss the design alternatives for CAISS++.
CAISS++ overcomes the limitations of CAISS. The detailed design of CAISS++ and its imple-
mentation will be carried out during phase 2. The lessons learned from CAISS will also drive the
detailed design of CAISS++. We assume that the data is encrypted with appropriate DoD encryp-
tion technologies and therefore will not conduct research on encryption in this project. The concept
of operation for CAISS++ is shown in interaction with several participating agencies in Figure 25.4
where multiple organizations share data in a single cloud.
The design of CAISS++ is based on a novel combination of an OWL-based policy engine with
an RDF-processing engine. Therefore, this design is composed of several tasks, each of which is
solved separately, after which all tasks are integrated into a single framework. (1) OWL-based
policy engine: The policy engine uses a set of agency-specific domain ontologies as well as an
upper ontology to construct policies for the task of AIS. The task of enforcing policies may require
the use of a distributed reasoner; therefore, we will evaluate existing distributed reasoners (DRs).
(2) RDF-processing engine: The processing engine requires the construction of sophisticated storage
Big Data Management and Cloud for Assured Information Sharing 313
USA
FBI NSA
CIA
DHS
CAISS
Australia
UK
architectures as well as an efficient query processor. (3) Integration Framework: The final task is
to combine the policy engine with the processing engine into an integrated framework. The initial
design of CAISS++ will be based on a trade-off between simplicity of design vs. its scalability
and efficiency. The first design alternative is known as centralized CAISS++ and it chooses sim-
plicity as the trade-off whereas the second design alternative (known as decentralized CAISS++)
chooses scalability and efficiency as the trade-off. Finally, we also provide a hybrid CAISS++
architecture that tries to combine the benefits of both, centralized and decentralized CAISS++.
Since CAISS++ follows a requirements-driven design, the division of tasks that we outlined above
to achieve AIS are present in each of the approaches that we present next.
Centralized CAISS++: Figure 25.5 illustrates two agencies interacting through Centralized
CAISS++. Centralized CAISS++ consists of shared cloud storage to store the shared data. All
the participating agencies store their respective knowledge bases consisting of domain ontology
with corresponding instance data. Centralized CAISS++ also consists of an upper ontology, a QE,
and a DR. The upper ontology is used to capture the domain knowledge that is common across the
domains of participating agencies, whereas domain ontology captures the knowledge specific to a
given agency or a domain. Note that the domain ontology for a given agency will be protected from
the domain ontologies of other participating agencies. Policies can either be captured in the upper
ontology or in any of the domain ontologies depending on their scope of applicability. Note that the
domain ontology for a given agency will be protected from domain ontologies of other participating
agencies.
The design of an upper ontology as well as domain ontologies that capture the requirements of the
participating agencies is a significant research area and is the focus of the ontology engineering prob-
lem. Ontologies will be created using suitable dialects of OWL that are based on DLs that are usually
decidable fragments of the first-order logic and will be the basis for providing sound formal seman-
tics. Having represented knowledge in terms of ontologies, reasoning will be done using existing
optimized reasoning algorithms. Query answering will leverage reasoning algorithms to formulate
and answer intelligent queries. The encoding of policies in OWL will ensure that they are enforced in
a provably correct manner. Later, we present an ongoing research project at The University of Texas
at Dallas that focusses on providing a general framework for enforcing policies in a provably correct
manner using the same underlying technologies. This work can be leveraged toward modeling and
enforcement of security policies in CAISS++. The instance data can choose between several avail-
able data storage formats. The QE receives queries from the participating agencies, parses the query,
314 Big Data Analytics with Applications in Insider Threat Detection
Applet HTML
Web browser/client
Agency 2
Upper ontology
Storage layout
CAISS
Applet HTML
Web browser/client
Agency 1
and determines whether the computation requires the use of a DR. If the query is simple and does
not require the use of a reasoner, the QE executes the query directly over the shared knowledge base.
Once the query result has been computed, the result is returned to the querying agency. If, however,
the query is complex and requires inferences over the given data, the QE uses the DR to compute
the inferences and then returns the result to the querying agency. A distributed DL reasoner differs
from a traditional DL reasoner in its ability to perform reasoning over cloud data storage using the
MapReduce framework. During the preliminary design of CAISS++ in Phase 1, we will conduct a
thorough investigation of the available DRs using existing benchmarks such as the Lehigh University
Benchmark (LUBM) [GUO05]. The goal of this investigation is to determine whether we can use one
of the existing reasoners or we need to build our own DR. In Figure 25.5, an agency is illustrated as
a stack consisting of a web browser, an applet, and HTML. An agency uses the web browser to send
the queries to CAISS++, which are handled by the query processor.
The main differences between centralized CAISS++ and CAISS are as follows: (1) CAISS will
use RDF to encode security policies, whereas centralized CAISS++ will use a suitable sublanguage
of OWL that is more expressive than RDF and can therefore capture the security policies better.
(2) The SPARQL query processor in CAISS will support a limited subset of SPARQL expressivity,
that is, it will provide support only for BGPs, whereas the SPARQL query processor in centralized
CAISS++ will be designed to support the maximum expressivity of SPARQL. (3) The Hadoop stor-
age architecture used in CAISS only supports data insertion during an initialization step. However,
when data needs to be updated, the entire RDF graph is deleted and a new dataset is inserted in its
place. On the other hand, centralized CAISS++, in addition to supporting the previous feature, also
opens up HDFS’s append-only feature to users. This feature allows users to append new information
to the data that they have previously uploaded to the system.
Decentralized CAISS++: Figure 25.6 illustrates two agencies in interaction with decentralized
CAISS++. It consists of two parts, namely, global CAISS++ and local CAISS++. The global
CAISS++ consists of a shared cloud storage that is used by the participating agencies to store only
their respective domain ontologies and not the instance data unlike the centralized CAISS++. Note
that domain ontologies for various organizations will be sensitive; therefore, CAISS++ will make
Big Data Management and Cloud for Assured Information Sharing 315
Security credentials
Federated query
Instance Global-CAISS
Local-CAISS data
Web services HTML Web services
Business logic Business logic
Application server Applet Application server
Java virtual machine Web browser/client
Java virtual machine
Agency 2
Distributed
Query engine (QE)
reasoner (DR)
use of its own domain ontology to protect a participating agency from accessing other domain ontol-
ogies. When a user from an agency queries the CAISS++ data store, global CAISS++ processes
the query in two steps. In the first step, it performs a check to verify whether the user is authorized
to perform the action specified in the query. If the result of step 1 verifies the user as an authorized
user, then it proceeds to step 2 of query processing. In the second step, global CAISS++ federates
the actual query to the participating agencies. The query is then processed by the local CAISS++
of a participating agency. The result of computation is then returned to the global CAISS++ that
aggregates the final result, and returns it to the user. Step 2 of query processing may involve query
splitting if the data required to answer a query spans multiple domains. In this case, the results of
subqueries from several agencies (their local CAISS++) will need to be combined for further query
processing. Once the results are merged and the final result is computed, the result is returned to
the user of the querying agency. The figure illustrates the agencies with a set of two stacks, one of
which corresponds to the local CAISS++ and the other consisting of a web browser, an applet, and
HTML, which is used by an agency to query global CAISS++. Table 25.1 shows the pros and cons
of the centralized CAISS++ approach, while Table 25.2 shows the pros and cons of the decentral-
ized CAISS++ approach.
Hybrid CAISS++: Figure 25.7 illustrates an overview of hybrid CAISS++ that leverages the
benefits of centralized CAISS++ as well as decentralized CAISS++. A hybrid CAISS++ archi-
tecture is illustrated in Figure 25.8. It is a flexible design alternative as the users of the participating
agencies have the freedom to choose between centralized CAISS++ and decentralized CAISS++.
TABLE 25.1
The Pros and Cons of Centralized CAISS++
Pros Cons
Simple approach Difficult to update data. Expensive approach as data needs to be
migrated to central storage on each update or a set of updates
Ease of implementation Leads to data duplication
Easier to query If data is available in different formats it needs to be
homogenized by translating it to RDF
316 Big Data Analytics with Applications in Insider Threat Detection
TABLE 25.2
The Pros and Cons of Decentralized CAISS++
Advantages Disadvantages
No duplication of data Complex query processing
Scalable and flexible Difficult to implement
Efficient May require query rewriting and query splitting
A hybrid CAISS++ is made up of global CAISS++ and a set of local CAISS++’s located at each
of the participating agencies. Global CAISS++ consists of a shared cloud storage that is used by the
participating agencies to store the data they would like to share with other agencies.
A local CAISS++ of an agency is used to receive and process a federated query on the instance
data located at the agency. A participating group is a group comprised of users from several agencies
who want to share information with each other. The members of a group arrive at a mutual agree-
ment on whether they opt for the centralized or decentralized approach. Additional users can join a
group at a later point in time if the need arises. The hybrid CAISS++ will be designed to simultane-
ously support a set of participating groups. Additionally, a user can belong to several participating
groups at the same time. We describe a few use-case scenarios that illustrate the operation.
1. The first case corresponds to the scenario where a set of users who want to securely share
information with each other opt for a centralized approach. Suppose users from Agency
1 want to share information with users of Agency 2 and vice versa; then both the agen-
cies store their knowledge bases comprising of domain ontology and instance data on the
shared cloud storage located at global CAISS++. The centralized CAISS++ approach
works by having the participating agencies arrive at mutual trust on using the central cloud
storage. Subsequently, information sharing proceeds as in centralized CAISS++.
2. The second case corresponds to the scenario where a set of users opts for a decentralized
approach. For example, Agencies 3, 4, and 5 wish to share information with each other and
USA
FBI NSA
CIA
DHS
Hybrid-CAISS
Decentralized-CAISS Centralized-CAISS
Australia
UK
Security credentials
Federated query
Global-CAISS
Local-CAISS
Web services HTML Web services
Business logic Business logic
Application server Applet Application server
Java virtual machine Web browser/client Java virtual machine
Agency 2
Distributed
Query engine (QE)
reasoner (DR)
Upper ontology
Instance Agency 2
Agency 1 Agency n
Local-CAISS data
Domain Domain Domain
Web services HTML ontology ontology ... ontology
Business logic
Application server Applet Instance Instance
Java virtual machine Web browser/client data data
Agency 1
mutually opt for the decentralized approach. All the three agencies store their respective
domain ontologies at the central cloud storage, and this information is only accessible to
the members of this group. The subsequent information-sharing process proceeds in the
manner described earlier for the decentralized CAISS++ approach.
3. The third case corresponds to the scenario where a user of an agency belongs to multiple
participating groups, some of which opt for the centralized approach and others for the
decentralized approach. Since the user is a part of a group using the centralized approach
to sharing, he/she needs to make his/her data available to the group by shipping his/her
data to the central cloud storage. Additionally, since the user is also a part of a group using
the decentralized approach for sharing, he/she needs to respond to the federated query with
the help of the local CAISS++ located at his/her agency.
Table 25.3 shows the trade-offs between the different approaches, and this will enable users to
choose a suitable approach of AIS based on their application requirements. Next we describe details
of the cloud storage mechanism that makes use of Hadoop to store the knowledge bases from various
agencies and then discuss the details of distributed SPARQL query processing over the cloud storage.
In Figure 25.9, we present an architectural overview of our Hadoop-based RDF storage and retrieval
framework. We use the concept of a “Store” to provide data loading and querying capabilities on RDF
TABLE 25.3
A Comparison of the Three Approaches Based on Functionality Hadoop Storage
Architecture
Functionality Centralized CAISS++ Decentralized CAISS++ Hybrid CAISS++
No data duplication X √ May be
Flexibility X X √
Scalablility X √ √
Efficiency √ √ √
Simplicity—No query rewriting √ X X
Trusted centralized cloud data storage √ X X
318 Big Data Analytics with Applications in Insider Threat Detection
Hadoop store
Layout1—vertically Layout2—hybrid
partitioned
graphs that are stored in the underlying HDFS. A store represents a single RDF dataset and can there-
fore contain several RDF graphs, each with its own separate layout. All operations on an RDF graph
are then implicitly converted into operations on the underlying layout including the following:
• Layout Formatter: This block performs the function of formatting a layout, which is the
process of deleting all triples in an RDF graph while preserving the directory structure
used to store that graph.
• Loader: This block performs loading of triples into a layout.
• Query Engine: This block allows a user to query a layout using an SPARQL query. Since
our framework operates on the underlying HDFS, the querying mechanism on a layout
involves translating an SPARQL query into a possible pipeline of MapReduce jobs and
then executing this pipeline on a layout.
• Connection: This block maintains the necessary connections and configurations with the
underlying HDFS.
• Config: This block maintains configuration information such as graph names for each of
the RDF graphs that make up a store.
Since RDF data will be stored under different HDFS folders in separate files as a part of our
storage schema, we need to adopt certain naming conventions for such folders and files.
Naming Conventions: A Hadoop store can be composed of several distinct RDF graphs in our
framework. Therefore, a separate folder will be created in HDFS for each such Hadoop Store.
The name of this folder will correspond to the name that has been selected for the given store.
Furthermore, an RDF graph is divided into several files in our framework depending on the storage
layout that is selected. Therefore, a separate folder will be created in HDFS for each distinct RDF
graph. The name of this folder is defined to be “default” for the default RDF graph, while for a
named RDF graph, the uniform resource identifier (URI) of the graph is used as the folder name.
We use the abstraction of a store in our framework for the reason that this will simplify the manage-
ment of data belonging to various agencies. Two of the layouts to be supported by our framework are
given below. These layouts use a varying number of HDFS files to store RDF data.
Vertically Partitioned Layout: Figure 25.10 presents the storage schema for the vertically parti-
tioned layout. For every unique predicate contained in an RDF graph, this layout creates a separate
file using the name of the predicate as the file name, in the underlying HDFS. Note that only the
local name part of a predicate Universal Resource Identifier (URI) is used in a file name and a
Big Data Management and Cloud for Assured Information Sharing 319
Layout—vertical partitioning
p1 p2
nodes nodes
... pn
s1 o1 s1 o 1
... ...
... ...
sn on sn o n
s eparate mapping exists between a file name and the predicate URI. A file for a given predicate
contains a separate line for every triple that contains that predicate. This line stores the subject and
object values that make up the triple. This schema will lead to significant storage space savings
since moving the predicate name to the name of a file completely eliminates the storage of this
predicate value. However, multiple occurrences of the same resource URI or literal value will be
stored multiple times across all files as well as within a file. Additionally, an SPARQL query may
need to look up multiple files to ensure that a complete result is returned to a user, for example, a
query to find all triples that belong to a specific subject or an object.
Hybrid Layout: Figure 25.11 presents the storage schema for the hybrid layout. This layout is an
extension of the vertically partitioned layout, since in addition to the separate files that are created
for every unique predicate in an RDF graph, it also creates a separate triples file containing all the
triples in the SPO (subject, predicate, object) format. The advantage of having such a file is that it
directly gives us all triples belonging to a certain subject or an object. Recall that such a search
operation required scanning through multiple files in the vertically partitioned layout. The storage
space efficiency of this layout is not as good as the vertically partitioned layout due to the addition
of the triples file. However, an SPARQL query to find all triples belonging to a certain subject or
object could be performed more efficiently using this layout.
Distributed processing of SPARQL: Query processing in CAISS++ comprises of several steps
(Figure 25.12). The first step is query parsing and translation where a given SPARQL query is first
parsed to verify syntactic correctness, and then a parse tree corresponding to the input query is built.
The parse tree is then translated into an SPARQL algebra expression. Since a given SPARQL query
can have multiple equivalent SPARQL algebra expressions, we annotate each such expression with
Layout2—hybrid
p1 p2 Triples
s1 o 1 s1 o1 s1 p1 o1
... pn
... ... ...
sn o n sn on sn pn on
SPARQL
Parser and
Query algebra Optimizer Metastore
translator
expression
Distributed
reasoner
Optimized
Answer Evaluation
query
set engine
plan
Storage
layout
instructions on how to evaluate each operation in this expression. Such annotated SPARQL algebra
expressions correspond to query-evaluation plans that serve as the input to the optimizer. The opti-
mizer selects a query plan that minimizes the cost of query evaluation. In order to optimize a query,
an optimizer must know the cost of each operation. To compute the cost of each operation, the opti-
mizer uses a metastore that stores statistics associated with the RDF data. The cost of a given query-
evaluation plan is alternatively measured in terms of the number of MapReduce jobs or the number of
triples that will be accessed as a part of query execution. Once the query plan is chosen, the query is
evaluated with that plan, and the result of the query is output. Since we use a cloud-centric framework
to store RDF data, an evaluation engine needs to convert SPARQL algebra operators into equivalent
MapReduce jobs on the underlying storage layouts (described earlier). Therefore, in CAISS++, we
will implement a MapReduce job for each of the SPARQL algebra operators. Additionally, the evalu-
ation engine uses a DR to compute inferences required for query evaluation.
Framework Integration: The components that we have outlined that are a part of CAISS++
need to be integrated to work with one another. Furthermore, this process of integration depends
on a user’s selection of one of the three possible design choices provided with CAISS++, namely,
centralized CAISS++, decentralized CAISS++, or hybrid CAISS++. The integration of the vari-
ous pieces of CAISS++ that have been presented so far needs to take into account several issues.
First, we need to make sure that our ontology engineering process has been successful in captur-
ing an agency’s requirements and, additionally, the ontologies can be stored in the storage schema
used by the Hadoop storage architecture. Secondly, we need to ensure that the distributed SPARQL
query processor is able to efficiently evaluate queries (i.e., user-generated SPARQL queries as well
as SPARQL queries that evaluate policies) over the underlying RDF storage. Finally, we need to
conduct a performance evaluation of CAISS++ to verify that it meets the performance require-
ments of various participating agencies as well as leads to significant performance advantages when
compared with CAISS.
Policy Specification and Enforcement: The users of CAISS++ can use a language of their
choice (e.g., XACML, RDF, Rei, etc.) to specify their information sharing policies. These policies
will be translated into a suitable sublanguage of OWL using existing or custom-built translators. We
will extend our policy engine for CAISS to handle policies specified in OWL. In addition to RDF
Big Data Management and Cloud for Assured Information Sharing 321
policies, our current policy engine can handle policies in OWL for implementing role-based access
control, inference control, and social network analysis.
25.3.4 Implementation Approach
The implementation of CAISS was carried out in Java, and is based on a flexible design where we
can plug and play multiple components. A service provider and/or user will have the flexibility to
use the SPARQL query processor as well as the RDF-based policy engine as separate components
or combine them. The open source component used for CAISS will include the Pellet reasoner
as well as our in-house tools such as the SPARQL query processor on the Hadoop/MapReduce
framework and the cloud-centric RDF policy engine. CAISS will allow us to demonstrate basic AIS
scenarios on our cloud-based framework.
We have also completed a preliminary implementation of CAISS+. In the implementation of
CAISS++, we have used Java as the programming language. We have used Protégé as our ontol-
ogy editor during the process of ontology engineering which includes designing domain ontologies
as well as the upper ontology. In the future, we will evaluate several existing distributed reasoning
algorithms such as WebPIE and QueryPIE to determine the best algorithm that matches an agency’s
requirements. The selected algorithm will then be used to perform reasoning over OWL-based
security policies. Additionally, the design of the Hadoop storage architecture is based on Jena’s
SPARQL database (SDB) architecture and features some of the functionalities that are available
with Jena SDB. The SPARQL QE also features the code written in Java. This code consists of
several modules including query parsing and translation, query optimization, and query execu-
tion. The query execution module will consist of MapReduce jobs for the various operators of the
SPARQL language. Finally, our web-based user interface makes use of several components such as
JBoss, EJB, JSF, among others. We are also exploring the use of other big data technologies such as
Storm and Spark for our cloud platform. In addition, NoSQL database systems such as Hbase and
CouchDB are also being explored for integration into our AIS platform.
Query Administration
Table View
ZQL parser
Access
XACML policy builder XACML policy evaluator control
layer
Hive
Table 1 Table n
+ HiveQL
View 1 View n
HDFS
File 1 File n
Web interface
New data Query Answer
Plan generator
Predicate object-based
splitter
Plan executer
processing is composed of two main steps: (1) the preprocessing and (2) the query optimization
and execution.
Preprocessing: In order to execute an SPARQL query on RDF data, we carried out data prepro-
cessing steps and stored the preprocessed data in HDFS. A separate MapReduce task was written
to perform the conversion of RDF/XML data into N-triples as well as for prefix generation. Our
storage strategy is based on predicate splits [HUSA11].
Query Execution and Optimization: We have developed an SPARQL query execution and opti-
mization module for Hadoop. As our storage strategy is based on predicate splits, first, we exam-
ine the predicates present in the query. Second, we examine a subset of the input files that are
matched with predicates. Third, SPARQL queries generally have many joins in them and all of
these joins may not be possible to perform in a single MapReduce job. Therefore, we have developed
an algorithm that decides the number of jobs required for each kind of query. As part of optimiza-
tion, we applied a greedy strategy and cost-based optimization to reduce query-processing time.
We have also developed an XACML-based centralized policy engine that will carry out federated
RDF query processing on the cloud. Details of the enforcement strategy are given in [HAML10a],
[HUSA11], and [KHAL10].
RDF Policy Engine: In our prior work [CADE11a], we have developed a policy engine to process
RDF-based access control policies for RDF data. The policy engine is designed with the following
features in mind: scalability, efficiency, and interoperability. This framework (Figure 25.15) can be
used to execute various policies, including access control policies and redaction policies. It can also
be used as a testbed for evaluating different policy sets over RDF data and to view the outcomes
graphically. Our framework presents an interface that accepts a high-level policy, which is then
translated into the required format. It takes a user’s input query and returns a response that has
been pruned using a set of user-defined policy constraints. The architecture is built using a modular
approach; therefore, it is very flexible in that most of the modules can be extended or replaced by
another application module. For example, a policy module implementing a discretionary access
control (DAC) could be replaced entirely by an RBAC module or we may decide to enforce all our
constraints based on a generalized redaction model. It should be noted that our policy engine also
handles role-based access control policies specified in OWL and SWRL [CADE10]. In addition, it
handles certain policies specified in OWL for inference control such as association-based policies
324 Big Data Analytics with Applications in Insider Threat Detection
Inference engine/
Privacy policies privacy controller
ontologies
rules
XML, RDF
Semantic web documents, web
engine
pages, databases
where access to collections of entities is denied and logical policies where A implies B, and if access
to B is denied, then access to A should also be denied ([CADE10], [CADE11b], [CARM09]). This
capability of our policy engine will be useful in our design and implementation of CAISS++ where
information is shared across multiple clouds.
Assured Information Sharing Prototypes: We have developed multiple systems for AIS. Under an
AFOSR-funded project (between 2005 and 2008), we developed an XACML-based policy engine
to function on top of relational databases and demonstrated the sharing of (simulated) medical data
[THUR08]. In this implementation, we specified the policies in XACML and stored the data in mul-
tiple Oracle databases. When one organization requests data from another organization, the policies
are examined and authorized data is released. In addition, we also conducted simulation studies on
the amount of data that would be lost by enforcing the policies while information sharing. Under our
Multidisciplinary University Research Initiative (MURI) project, also funded by AFOSR, we con-
ducted simulation studies for incentive-based information sharing [KANT10]. We have also exam-
ined risk-based access control in an information-sharing scenario [CELI07]. In addition to access
control policies, we have specified different types of policies including need-to-share p olicies and
trust policies (e.g., A shared data with B provided B does not share the data with C). Note that the
9/11 Commission Report calls for the migration from the more restrictive need-to-know to the less
restrictive need-to-share policies. These policies are key to support the specification of the directive
concerning AIS obligations.
Formal Policy Analysis: By reducing high-level security policy specifications and system models
to the level of the denotational and operational semantics of their binary-level implementations, our
past work has developed formally machine-certifiable security enforcement mechanisms of a vari-
ety of complex software systems, including those implemented in .NET [HAML06b], ActionScript
[SRID10], Java [JONE10], and native code [HAML10b]. Working at the binary level provides
extremely high formal guarantees because it permits the tool chain that produces mission-critical soft-
ware components to remain untrusted; the binary code produced by the chain can be certified directly.
This strategy is an excellent match for CAISS++ because data-security-specification languages such
as XACML and OWL can be elegantly reflected down to the binary level of bytecode languages with
XML-aware system application program interfaces (APIs), such as Java bytecode. Our past work has
applied binary instrumentation (e.g., in-lined reference monitoring) and a combination of binary type
checking [HAML06b], model checking [SRID10], and automated theorem proving (e.g., via ACL2)
to achieve fully automated machine certification of binary software in such domains.
Such development efforts are an important step toward securing cloud infrastructures but are only
in their inception stages. The goal of our system is to add another layer of security above the secu-
rity offered by Hadoop [UTD1]. Once the security offered by Hadoop becomes robust, it will only
strengthen the effectiveness of our system. Similar efforts have been undertaken by Amazon and
Microsoft for their cloud computing offerings ([AMAZ16], [MARS10]). However, this work falls
in the public domain, whereas our system is designed for a private cloud infrastructure. This distin-
guishing factor makes our infrastructure “trusted” over public infrastructures where the data must
be stored in an encrypted format.
SPARQL Query Processor: Only a handful of efforts have been reported on SPARQL query
processing. These include BioMANTA [BIOM] and SHARD [SHAR11]. BioMANTA proposes
extensions to RDF Molecules [DING05] and implements a MapReduce-based molecule store
[NEWM08]. They use MapReduce to answer the queries. They have queried a maximum of 4 mil-
lion triples. Our work differs in the following ways: first, we have queried 1 billion triples. Second,
we have devised a storage schema which is tailored to improve query execution performance for
RDF data. To our knowledge, we are the first to come up with a storage schema for RDF data using
flat files in HDFS, and a MapReduce job determination algorithm to answer an SPARQL query.
Scalable, high-performance, robust, and distributed (SHARD) is an RDF triple store using the
Hadoop Cloudera distribution. This project shows initial results, demonstrating Hadoop’s ability to
improve scalability for RDF datasets. However, SHARD stores its data only in a triple store schema.
It does no query planning or reordering, and its query processor will not minimize the number of
Hadoop jobs. None of the efforts have incorporated security policies.
RDF-Based Policy Engine: There exists prior research devoted to the study of enforcing policies
over RDF stores. These include the work in [CARM04] which uses RDF for policy specification and
enforcement. In addition, the policies are generally written in RDF. In [JAIN06], the authors propose
an access control model for RDF. Their model is based on RDF data semantics and incorporates
RDF and RDF schema (RDFS) entailments. Here, protection is provided at the resource level, which
adds granularity to their framework. Other frameworks enforcing policies over RDF\OWL include
[KAGA02] and [USZO04]. [USZO04] describes KAoS, a policy and domain services framework
that uses OWL, to represent both policies and domains. [KAGA02] introduces Rei, a policy frame-
work that is flexible and allows different kinds of policies to be stated. Extensions to Rei have been
proposed recently [KHAN10]. The policy-specification language allows users to develop declarative
policies over domain-specific ontologies in RDF, DAML+OIL, and OWL. The authors in [REDD05]
also introduced a prototype, RAP, for the implementation of an RDF store with integrated mainte-
nance capabilities and access control. These frameworks, however, do not address cases where the
RDF store can become very large or the case where the policies do not scale with the data. Under an
IARPA-funded project, we have developed techniques for very large RDF graph processing [UTD2].
Hadoop Storage Architecture: There has been significant interest in large-scale distributed
storage and retrieval techniques for RDF data. The theoretical designs of a parallel processing
framework for RDF data are presented in the work done by Castagna et al. [CAST09]. This work
advocates the use of a data-distribution model with varying levels of granularity such as the triple
level, graph level, and dataset level. A query over such a distributed model is then divided into a
set of subqueries over machines containing the distributed data. The results of all subqueries will
then be merged to return a complete result to a user application. Several implementations of this
theoretical concept exist in the research community. These efforts include the work done by Choi
et al. [CHOI09] and Abraham et al. [ABRA10]. A separate technique that has been used to store and
retrieve RDF data makes use of peer-to-peer systems ([ABER04], [CAI04], [HART07], [VALL06]).
However, there are some drawbacks with such systems as peer-to-peer systems need to have super
peers that store information about the distribution of RDF data among the peers. Another disadvan-
tage is a need to federate an SPARQL query to every peer in the network.
Distributed Reasoning: The InteGrail system uses distributed reasoning, whose vision is to shape
the European railway organization of the future [INTE09]. In [URBA09], authors have shown a
326 Big Data Analytics with Applications in Insider Threat Detection
scalable implementation of RDFS reasoning based on MapReduce, which can infer 30 billion triples
from a real-world dataset in less than 2 hours, yielding an input and output throughput of 123.000
triples/s and 3.27 million triples/s, respectively. They have presented some nontrivial optimizations
for encoding the RDFS ruleset in MapReduce and have evaluated the scalability of their implemen-
tation on a cluster of 64 compute nodes using several real-world datasets.
Access Control and Policy Ontology Modeling: There have been some attempts to model access
control and policy models using semantic web technologies. In [CIRI07], authors have shown how
OWL and DL can be used to build an access control system. They have developed a high-level
OWL−DL ontology that expresses the elements of a role-based access control system and have
built a domain-specific ontology that captures the features of a sample scenario. Finally, they have
joined these two artifacts to take into account attributes in the dentition of the policies and in the
access control decision. In [REUL10], authors first presented a security policy ontology based on
the DOGMA, which is a formal ontology engineering framework. This ontology covers the core ele-
ments of security policies (i.e., condition, action, resource) and can easily be extended to represent
specific security policies, such as access control policies. In [ANDE09], authors present an onto-
logically motivated approach to multilevel access control and provenance for information systems.
25.4.3 Commercial Developments
RDF Processing Engines: Research and commercial RDF processing engines include Jena by HP
Labs, BigOWLIM, and RDF-3X. Although the storage schemas and query-processing mechanisms
for some of these tools are proprietary, they are all based on some type of indexing strategy for RDF
data. However, only a few tools exist that use a cloud-centric architecture for processing RDF data, and,
moreover, these tools are not scalable to a very large number of triples. In contrast, our query processor
in CAISS++ will be built as a planet-scale RDF processing engine that supports all SPARQL operators
and will provide optimized execution strategies for SPARQL queries and can scale to billions of triples.
Semantic Web-Based Security Policy Engines: As stated in Section 25.2, the current work on
semantic web-based policy specification and enforcement does not address the issues of policy gen-
eration and enforcement for massive amounts of data and support a large number of users.
Cloud: To the best of our knowledge, there is no significant commercial competition for cloud-
centric AIS. Since we have taken a modular approach to the creation of our tools, we can iteratively
refine each component (policy engine, storage architecture, and query processor) separately. Due to
the component-based approach we have taken, we will be able to adapt to changes in the platforms
we use (e.g., Hadoop, RDF, OWL, and SPARQL) without having to depend on the particular fea-
tures of a given platform.
technologies we have discussed in Chapter 7 as well as in other chapters in Sections I through III
have to be explored for use in AIS between multiple social networks.
REFERENCES
[ABER04]. K. Aberer, P. Cudŕe-Mauroux, M. Hauswirth, T. Van Pelt, “GridVine: Building Internet-Scale
Semantic Overlay Networks,” In Proceedings of International Semantic Web Conference, Hiroshima,
Japan, pp. 107–121, 2004.
[ABRA10]. J. Abraham, P. Brazier, A. Chebotko, J. Navarro, A. Piazza, “Distributed Storage and Querying
Techniques for a Semantic Web of Scientific Workflow Provenance,” In Proceedings IEEE International
Conference on Services Computing (SCC), Miami, FL, pp. 178–185, 2010.
[AMAZ16]. Overview of Security Processes. Available at: https://aws.amazon.com/whitepapers/overview-of-
security-processes/, 2016.
[ANDE09]. B. Andersen and F. Neuhaus, “An Ontological Approach to Information Access Control and Provenance,”
In Proceedings of Ontology for the Intelligence Community, Fairfax, VA, USA, pp. 1–6, October 2009.
[AWAD10]. M. Awad, L. Khan, B. M. Thuraisingham. “Policy Enforcement System for Inter-Organizational
Data Sharing,” Journal of Information Security and Privacy 4 (3), 22–39, 2010.
[BIOM]. Biomanta http://www.itee.uq.edu.au/eresearch/projects/biomanta.
[CADE10]. T. Cadenhead, M. Kantarcioglu, B. Thuraisingham, “Scalable and Efficient Reasoning for
Enforcing Role-Based Access Control,” In Proceedings of Data and Applications Security and Privacy
XXIV, 24th Annual IFIP Working Group 11.3 Working Conference, Rome, Italy, pp. 209–224, 2010.
[CADE11a]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, “Transforming Provenance
Using Redaction,” In SACMAT’2011: Proceedings of ACM Symposium on Access Control Models and
Technologies, Innsbruck, Austria, pp. 93–102, 2011.
[CADE11b]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, “A Language for Provenance
Access Control,” In CODASPY’2011: Proceedings of ACM Conference on Data Application Security
and Privacy, San Antonio, TX, USA, pp. 125–144, 2011.
[CADE12a]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. M. Thuraisingham, “A Cloud-Based RDF Policy
Engine for Assured Information Sharing,” In SACMAT’2012: Proceedings of ACM Symposium on Access
Control Models and Technologies, Newark, NJ, USA, pp. 113–116, 2012.
[CADE12b]. T. Cadenhead, M. Kantarcioglu, V. Khadilkar, B. M. Thuraisingham, “Design and Implementation
of a Cloud-based Assured Information Sharing System,” Proceedings of International Conference on
Mathematical Methods, Models and Architectures for Computer Network Security, St. Petersburg,
Russia, pp. 36–50, 2012.
[CAI04]. M. Cai and M. Frank, “RDFPeers: A Scalable Distributed RDF Repository Based on a Structured
Peer-to-Peer Network,” In Proceedings ACM World Wide Web Conference, New York, NY, USA,
pp. 650–657, 2004.
328 Big Data Analytics with Applications in Insider Threat Detection
[CARM04]. B. Carminati, E. Ferrari, B.M. Thuraisingham,“Using RDF for Policy Specification and
Enforcement,” In Proceedings of International Workshop on Database and Expert Systems Applications,
Zaragoza, Spain, pp. 163–167, 2004.
[CARM09]. B. Carminati, E. Ferrari, R. Heatherly, M. Kantarcioglu, B. M. Thuraisingham, “Design and
Implementation of a Cloud-Based Assured Information Sharing System,” In Proceedings of ACM
Symposium on Access Control Models and Technologies, Stresa, Italy, pp. 177–186, 2009.
[CAST09]. P. Castagna, A. Seaborne, C. Dollin, “Parallel Processing Framework for RDF Design and Issues,”
Technical Report, HP Laboratories, HPL-2009-346, 2009.
[CELI07]. E. Celikel, M. Kantarcioglu, B. Thuraisingham, E. Bertino, “Managing Risks in RBAC Employed
Distributed Environments,” On the Move to More Meaningful Internet Systems 2007: CoopIS, DOA,
ODBASE, GADA, and IS. Volume 4804 of Lecture Notes in Computer Science, Springer, New York,
pp. 1548–1566, 2007.
[CHOI09]. H. Choi, J. Son, Y. Cho, M. Sung, Y. Chung, “SPIDER: A System for Scalable, Parallel/Distributed
Evaluation of Large-Scale RDF Data,” In CIKM’09: Proceedings of ACM Conference on Information
and Knowledge Management, Hong Kong, China, pp. 2087–2088, 2009.
[CIRI07]. L. Cirio, I. Cruz, R. Tamassia, “A Role and Attribute Based Access Control System Using Semantic
Web Technologies,” IFIP Workshop on Semantic Web and Web Semantics, Vilamoura, Algarve, Portugal,
pp. 1256–1266, 2007.
[DING05]. L. Ding, T. Finin, Y. Peng, P. da Silva, D. McGuinness, “Tracking RDF Graph Provenance using
RDF Molecules,” In Proceedings International Semantic Web Conference, Galway, Ireland, 2005,
https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2005-poster-demo/PID-87.pdf.
[DoD]. DoD Information Enterprise Strategic Plan, 2010–2012, http://dodcio.defense.gov/Portals/0/
Documents/DodIESP-r16.pdf.
[DoD07]. Department of Defense Information Sharing Strategy, 2007, http://www.defense.gov/releases/
release.aspx?releaseid=10831.
[Fini09]. T. Finin et al., “Assured Information Sharing Life Cycle,” In Proceedings of Intelligence and
Security Informatics, Dallas, TX, USA, pp. 307–309, 2009.
[GUO05]. Y. Guo, Z. Pan, J. Heflin, “LUBM: A Benchmark for OWL Knowledge Base Systems,” Web
Semantics 3 (2, 5), 158–182, 2005.
[Haml06a]. K. Hamlen, G. Morrisett, F. Schneider, “Computability Classes for Enforcement Mechanisms,”
ACM Transactions on Programming Languages and Systems 28 (1), 175–205, 2006.
[Haml06b]. K. Hamlen, G. Morrisett, F. Schneider, “Certified In-Lined Reference Monitoring on .NET,”
In Proceedings of the ACM Workshop on Programming Language and Analysis for Security, Ottawa,
Canada, pp. 7−16, 2006.
[Haml10a]. K. Hamlen, M. Kantarcioglu, L. Khan, B. Thuraisingham, “Security Issues for Cloud Computing,”
Journal of Information Security and Privacy 4 (2), 2010.
[Haml10b]. K. Hamlen, V. Mohan, R. Wartell, “Reining in Windows API Abuses with In-lined Reference
Monitors,” Technical Report UTDCS-18-10, Computer Science Department, The University of Texas at
Dallas, 2010.
[HART07]. A. Harth, J. Umbrich, A Hogan, S. Decker, “YARS2: A Federated Repository for Searching nd
Querying Graph Structured Data,” In Proceedings of International Semantic Web Conference, Busan,
Korea, pp. 211–224, 2007.
[HUSA11]. M. Husain, J. McGlothlin, M. Masud, L. Khan, B. Thuraisingham, “Heuristics-Based Query
Processing for Large RDF Graphs Using Cloud Computing,” IEEE Transansactions on Knowledge and
Data Engineering 23, pp. 1312–1327, 2011.
[INTE09]. Distributed reasoning: Seamless integration and processing of distributed knowledge, http://www.
integrail.eu/documents/fs04.pdf.
[JAIN06]. A. Jain and C. Farkas, “Secure Resource Description Framework: An Access Control Model,”
In Proceedings of ACM Symposium on Access Control Models and Technologies, Lake Tahoe, CA,
pp. 121–129, 2006.
[JONE10]. M. Jones and K. Hamlen, “Disambiguating Aspect-Oriented Security Policies” In Proceedings of
9th International Conference on Aspect-Oriented Software Development, Rennes and St. Malo, France,
pp. 193–204, 2010.
[JONE11]. M. Jones and K. Hamlen, “A Service-Oriented Approach to Mobile Code Security,” In
MobiWIS’2011: Proceedings of the 8th International Conference on Mobile Web Information Systems,
Niagara Falls, ON, Canada, pp. 531–538, 2011.
[KAGA02]. L. Kagal, “Rei: A policy language for the me-centric project,” HPL-2002-270, accessible online
http://www.hpl.hp.com/techreports/2002/HPL-2002-270.html, 2002.
Big Data Management and Cloud for Assured Information Sharing 329
[KANT10]. M. Kantarcioglu, “Incentive-Based Assured Information Sharing,” AFOSR MURI Review, October
2010.
[KHAD11]. V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, S. Mehrotra, “Secure Data Processing in a
Hybrid Cloud,” In CoRR’2011: Proceedings of Computering Research Repository, abs/1105.1982, 2011.
[KHAL10]. A. Khaled, M. Husain, L. Khan, K. Hamlen, B. Thuraisingham, “A Token-Based Access Control System
for RDF Data in the Clouds,” In Proceedings of CloudCom, Indianapolis, IN, USA, pp. 104–111, 2010.
[KHAN10]. A. Khandelwal, J. Bao, L. Kagal, I. Jacobi, L. Ding, J. Hendler, “Analyzing the AIR Language: A
Semantic Web (Production) Rule Language,” In Proceedings of International Web Reasoning and Rule
Systems, Bressanone, Brixen, Italy, pp. 58–72, 2010.
[MARS10]. A. Marshall, M. Howard, G. Bugher, B. Harden, Security Best Practices in Developing Windows
Azure Applications, Microsoft Corp., Redmond, WA, USA, 2010.
[NEWM08]. A. Newman, J. Hunter, Y. Li, C. Bouton, M. Davis, “A Scale-Out RDF Molecule Store for
Distributed Processing of Biomedical Data,” Semantic Web for Health Care and Life Sciences Workshop,
World Wide Web Conference, Beijing, China, 2008.
[NSA11]. http://www.informationweek.com/news/government/cloud-saas/229401646, 2011.
[OMALl09]. D. O’Malley, K. Zhang, S. Radia, R. Marti, C. Harrell, Hadoop Security Design. https://issues.
apache.org/jira/secure/attachment/12428537/security-design.pdf.
[RAO08].P. Rao, D. Lin, E. Bertino, N. Li, J. Lobo, “EXAM: An Environment for Access Control Policy
Analysis and Management,” In POLICY’08: Proceedings of IEEE Workshop on Policies for Distributed
Systems and Networks, Palisades, NY, USA, pp. 238–240, 2008.
[REDD05]. P. Reddivari, T. Finin, J. Joshi, A. “Policy-Based Access Control for an RDF Store. Policy
Management for the Web,” In IJCAI’05: Proceedings of the International Joint Conference on Artificial
Intelligence Workshop, Edinburgh, Scotland, UK, 2005.
[REUL10]. Q. Reul, G. Zhao, R. Meersman, “Ontology-Based Access Control Policy Interoperability,” In
MISC’2010: Proceedings of the 1st Conference on Mobility, Individualisation, Socialisation and
Connectivity, London, UK, 2010.
[SHAR11]. SHARD: http://blog.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-
build-a-scalable-distributed-triple-store/.
[SRID10]. M. Sridhar and K. Hamlen, “Model-Checking In-Lined Reference Monitors,” Proceedings of the
11th International Conference on Verification, Model Checking, and Abstract Interpretation, Madrid,
Spain, pp. 312–327, 2010.
[TALB09]. D. Talbot, “How Secure is Cloud Computing?” http://www.technologyreview.com/computing/23951/.
[THUR08]. B. Thuraisingham, H. Kumar, L. Khan, “Design and Implementation of a Framework for Assured
Information Sharing Across Organizational Boundaries,” Journal of Information Security and Privacy
2 (4), 67–90, 2008.
[THUR10]. B. Thuraisingham, V. Khadilkar, A. Gupta, M. Kantarcioglu, L. Khan, Secure Data Storage and
Retrieval in the Cloud. CollaborateCom, Chicago, IL, USA, 2010.
[THUR11]. B. Thuraisingham and V. Khadilkar, “Toward the Design and Implementation of a Cloud-centric
Assured Information System,” TR# UTDCS, September 2011.
[THUR12]. B. M. Thuraisingham, V. Khadilkar, J. Rachapalli, T. Cadenhead, M. Kantarcioglu, K. W. Hamlen,
L. Khan, M. F. Husain, “Cloud-Centric Assured Information Sharing,” In PAISI’2012: Proceedings of the
Pacific Asia Workshop on Intelligence and Security Informatics, Kuala Lumpur, Malaysia, pp. 1–26, 2012.
[THUS09]. A. Thusoo, J. Sharma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, “Hive
- A Warehousing Solution Over a Map-Reduce Framework,” In Proceedings of VLDB Endowment, Lyon,
France, 2(2), 1626–1629, 2009.
[URBA09]. Urbani, J., S. Kotoulas, E. Oren, F. van Harmelen, “Scalable Distributed Reasoning using
MapReduce,” Proceedings of the International Semantic Web Conference 2009, Lecture Notes in
Computer Science, Bernstein, A., Karger, D.R., Heath, T. et al., Vol. 5823, Springer, Berlin, Heidelberg,
pp. 634–649, 2009
[USZO04]. A. Uszok, J. Bradshaw, M. Johnson, R. Jeffers, A. Tate, J. Dalton, S. Aitken, “KAoS Policy
Management for Semantic Web Services,” IEEE Intelligent Systems 19 (4), 32–41, 2004.
[UTD1]. UTD Secure Cloud Repository, http://cs.utdallas.edu/secure-cloud-repository/.
[UTD2]. UTD Semantic Web Repository, http://cs.utdallas.edu/semanticweb/.
[VALL06]. E. Valle, A. Turati, A. Ghioni. “AGE: A Distributed Infrastructure for Fostering RDF-Based
Interoperability,” In DAIS’06: Proceedings of Distributed Applications and Inter-Operable Systems,
Bologna, Italy, 2006.
[ZQL]. Zql: a Java SQL parser. http://zql.sourceforge.net/.
26 Big Data Management for
Secure Information Integration
26.1 INTRODUCTION
Cloud computing and big data services like Amazon S3 [AMAZ] are gaining a lot of popu-
larity because of factors such as cost efficiency and ease of maintenance. We have evaluated
the feasibility of using S3 storage services for storing semantic web data using the Intelligence
Community’s Blackbook system. Blackbook was an initiative by Intelligence Advanced Research
Project Activity (IARPA) toward building a semantic web-based data integration framework
[BLAC]. The main purpose of the Blackbook system is to provide intelligence analysts an easy-
to-use tool to access data from disparate data sources, make logical inferences across the data
sources, and share this knowledge with other analysts using the system. Besides providing a web
application interface, it also exposes its services by means of web services. Blackbook integrates
data from different data sources, thereby making it prudent to store the data sources in a shared
environment like the one provided by cloud computing services. Blackbook essentially uses sev-
eral semantic data sources to produce search results. But storing shared data in cloud environ-
ments in a secure manner is a big challenge. Our approach to solving this problem is discussed
in this chapter.
In our approach, we stored one of the Blackbook data sources on Amazon S3 in a secure manner,
thus leveraging cloud computing services within a semantic web-based framework. We encrypted
the data source using Advanced Encryption Standard [AES] before storing it on Amazon S3. Also,
we do not store the original key anywhere in our system. Instead, the key is generated by two sepa-
rate components; each called a “Key Server.” Then, the generated key is used to encrypt data.
To prevent replay attacks, we used the Lamport one time password (OTP) [LAMP81] scheme
to generate the passwords that are used by the client for authentication with the “Key Servers.”
We used the role-based access control (RBAC) model [SAND96] to restrict system access to
authorized users and implemented the RBAC policies using Sun’s implementation of XACML
[OASI].
In this chapter, we describe the design and implementation of a secure information integration
framework that uses Blackbook. Details of Blackbook can be found in [BLAC]. In Section 26.2,
we present a detailed description of our implementation. Section 26.3 presents our experimental
results. This chapter summary and future directions are presented in Section 26.4. Additional
details of our work can be found in [PARI09] and [PARI12]. Figure 26.1 illustrates the contents
of this chapter.
331
332 Big Data Analytics with Applications in Insider Threat Detection
Integrating Architecture
blackbook Experiments
and security
with and results
model
amazon S3
FIGURE 26.1 Big data and cloud for secure information integration.
Cloud computing is a paradigm of computing in which dynamically scalable and often virtual-
ized resources are provided as a service over the Internet [CLOU]. The concept incorporates the
following combinations:
Economic advantage is one of the main motivations behind the cloud computing paradigm,
since it promises the reduction of capital expenditure (CapEx) and operational expenditure (OpEx)
[JENS09]. Various organizations can share data and computational power using the cloud comput-
ing infrastructure. For instance, salesforce.com is an industry leader in customer relationship man-
agement (CRM) products and one of the pioneers to leverage the cloud computing infrastructure on
a massive scale. Since Blackbook is a data-integration framework, it can search and integrate data
from various data sources that may be located on local machines or remote servers. We utilized the
data storage services provided by Amazon S3 to store data sources used by Blackbook.
The reasons we chose Amazon S3 are as follows:
One of the major challenges for current cloud computing systems is privacy risk. That is, privacy
is an important concern for cloud computing services in terms of legal compliance and user trust. In
[PEAR09], the author provides some interesting insights about how privacy issues should be taken
into consideration when designing cloud computing services. The main privacy risks identified in
[PEAR09] include the following:
• For the cloud service user—being forced to be tracked or give personal information against
his/her will.
• For the organization using cloud service—noncompliance of enterprise policies, loss of
reputation, and credibility.
• For implementers of cloud platforms—exposure of sensitive information stored on the
platforms, loss of reputation, and credibility.
• For providers of applications on top of cloud platforms—legal noncompliance, loss of
reputation.
• For the data subject—exposure of personal information.
We have used Amazon S3 in our implementation. “Amazon S3 is storage for the Internet. It is
designed to make web-scale computing easier for developers. Amazon S3 provides a simple web
Big Data Management for Secure Information Integration 333
services interface that can be used to store and retrieve any amount of data, at any time, from any-
where on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpen-
sive data storage infrastructure that Amazon uses to run its own global network of web sites. The
service aims to maximize benefits of scale and to pass those benefits on to developers.” [AMAZ].
Many organizations use services like Amazon S3 for data storage. Some important questions that
need to be addressed include: Is the data we store on S3, secure? Is it accessible by any user outside
our organization? How do we restrict access to files for users within the organization? To keep our
data secure, we propose to encrypt the data using Advanced Encryption Standard (AES) before
uploading the data files on Amazon S3. To restrict access to files to users within the organization,
we propose to implement RBAC policies using XACML. In RBAC, permissions are associated with
roles and users are made members of appropriate roles. This simplifies management of permissions
[SAND96]. Our system architecture is illustrated in Figure 26.2.
The data sources are stored on an Amazon S3 server in an encrypted form. The two keys used
to encrypt the data source are stored on two servers—key server 1 and key server 2. The policies
associated with the data sources for different users are also stored on these servers.
The system uses the OTP for authentication. This password is only valid for a single session or
transaction. OTPs avoid the shortcomings associated with static passwords [ONE]. Unlike static
passwords, they are not vulnerable to replay attacks. So if an intruder manages to get hold of an OTP
that was used previously to log into a service or carry a transaction, the system’s security would
not be compromised since that password will no longer be valid. The only drawback of OTP is that
humans cannot memorize them and hence require additional technology in order to work.
OTP generation algorithms make use of randomness to prevent the prediction of future OTPs
based on the previously observed OTPs. Some of the approaches to generate OTPs are as follows:
• Use of a mathematical algorithm to generate a new password based on the previous passwords.
• Based on time synchronization between the authentication server and the client providing
the password.
• Use of a mathematical algorithm where the new password is based on a challenge (e.g., a
random number chosen by the authentication server or transaction details) and/or a counter.
Client
(browser)
Policy Key
server 1 server 1
1. Search
input 9. Results Trusted server-1
2a. OTP Stack Value +
credentials Server
OTP
value
Client 3a. Key1
Search engine OTP
stack
2b. OTP stack value +
credentials
Policy Key
Other server 2 server 2
XOR
data sources 3b. Key2
Trusted server-2
8. Decrypted 4. XORed key
data source Encryption/decryption Server
service provider OTP value
We use Lamport’s OTP scheme for authentication. The Lamport OTP approach is based on a
mathematical algorithm for generating a sequence of “passkey” values, and each successor value
is based on the value of predecessor. The core of Lamport’s OTP scheme requires that cooperating
client/service components agree to use a common sequencing algorithm to generate a set of expir-
ing OTPs (client side) and validate client-provided passkeys included in each client-initiated request
(service side). In our case, the client is the Blackbook system and the service components are the
“Key Servers.” The client generates a finite sequence of values starting with a “seed” value and each
successor value is generated by applying some transformation algorithm (or F(S) function) to the
previous sequence value:
We use the “password” of the user, which is salted with some randomly generated bytes (using
SHA1PRNG) as a key to generate the seed value using SHA-256 [SECU02]. The next values in the
sequence are generated using the obtained seed value using SHA-256. All these generated values
are stored in a stack on the client machine. The topmost value on the stack is stored on both the “Key
Servers” (1 and 2). If the client sends a request for the first time, the topmost value of the client stack
is compared with that on the “Key Servers” (1 and 2). If the values match, the client is authenticated
and the topmost value on the client stack is removed. For subsequent requests, the topmost value
on the client stack is used to compute the successor value using the hash function (used to build the
stack). If the generated value and the value on the “Key Servers” match, the user is authenticated;
the topmost value on the client stack is stored on the “Key Servers” and subsequently removed
from the client stack. If the client stack is exhausted, a new stack is generated and the topmost value
on the stack is stored on the “Key Servers.” Once the user is authenticated using the OTP scheme,
the user request is evaluated against the policies applicable for the resource (data source in our case)
requested by the user for access. The predefined policies are stored in the “Policy Server” compo-
nent of the “Key Servers.” If the policies for the resource are applicable for the user request, the
“Key Server” sends the keys used to encrypt the resource requested by the user.
We use XACML to implement the access control using the policies defined in an XML file. After
the user is authenticated with the system, the system checks if the user is authorized to access the
requested resource. The user request is handled by the Policy Enforcement Point (PEP) that converts
a users’ request into an XACML request and sends it to the Policy Decision Point (PDP) for further
evaluation. The PDP evaluates the request and sends back a response that can be either “access
permitted” or “access denied,” with the appropriate obligations. (We are not considering obligations
for our system.) A policy is a collection of several subcomponents: target, rules, rule-combining
algorithm, and obligations.
Target: Each policy has only one target that helps in determining whether the policy is rel-
evant for the request. The policy’s relevance for the request determines if the policy is
to be evaluated for the request, which is achieved by defining attributes of three catego-
ries in the target—subject, resource, and action. For example, we have specified the value
“[email protected]” for the subject and “amazons3” for the resource.
Rules: We can associate multiple roles with the policy. Each rule consists of a condition, an
effect, and a target.
Conditions are statements about attributes that return True, False, or Indeterminate upon
evaluation.
Effect is the consequence of the satisfied rule that assumes the value Permit or Deny. We have
specified the value as “Permit.”
Target helps in determining if the rule is relevant for the request.
Big Data Management for Secure Information Integration 335
Rule-combining Algorithms: As a policy can have various rules, it is possible for different
rules to generate conflicting results. Rule-combining algorithms resolve such conflicts to
arrive at one outcome per policy per request. Only one rule-combining algorithm is appli-
cable to one policy.
Obligations allow the mechanism to provide a finer level of access control than mere permit
and deny decisions. They are the actions that must be performed by the PEP in conjunction
with the enforcement of an authorization decision.
After successful authentication and authorization, the Amazon File Manager downloads the
requested resource from the Amazon S3 server. More specifically, Key Server – 1 sends key1 and the
Key Server – 2 sends key2 to the Amazon File Manager. These keys are XORED to get keyorg, that is,
Then, keyorg is used to decrypt the resource by the Encryption/Decryption Service Provider.
The main motive behind using two key servers is to avoid a single point of failure. If any of the
key servers is hacked, the data is not compromised as two keys; one from each of the key servers is
needed to decrypt the data sources. In case one of the key servers is hacked and the keys stored on
that server are compromised, we run into the risk of rendering the data source stored on Amazon
useless as we need two keys, one from each key server, to retrieve the original key used to encrypt
the data source. To avoid this, we propose to take periodic backups of the keys on each of the key
server.
Scenario: We now describe a sample scenario, depicting an interaction with the Amazon S3 stor-
age service, with respect to the Blackbook system.
1. The user U fires a search query to Blackbook (step 1 in Figure 26.2). Blackbook federates
the query across various data sources, including the data source F that is securely stored
on Amazon S3.
2. We follow the OTP scheme to authenticate the client (Blackbook in this case) for using the
AWS S3 service. The client machine sends the topmost value on the OTP stack along with
the user credentials and the request to key servers 1 & 2 (steps 2a and 2b in Figure 26.2).
3. If the value passed by the client matches that on the OTP stack of the key servers and the
policies applicable for the user are valid for the request, the key servers send the “key” used
to decrypt the data source (steps 3a and 3b in Figure 26.2).
4. The keys, key1 and key2, obtained from the key servers 1 & 2 are /XOR-ed to obtain the
original key used to decrypt the data source F (step 4 in Figure 26.2).
5. The Amazon File Manager passes the Amazon account credentials and the data source
name to retrieve the data source (steps 5 and 6 in Figure 26.2).
6. The Encryption/Decryption Service Manager retrieves the encrypted data source, and
then, using the XOR-ed key, decrypts the data source (steps 7 and 8 in Figure 26.2).
7. Blackbook performs a search on the data source retrieved from Amazon along with other
data sources and returns the results to the user (step 9 in Figure 26.2).
26.3 EXPERIMENTS
In our approach, we have used the Advanced Encryption Standard to encrypt data before storing
it on the Amazon S3 server. Uploading the data on the Amazon server is a one-time process. The
data source needs to be uploaded again only when the stored data needs to be modified. But the data
source stored on Amazon S3 needs to be downloaded every time the user issues a search query to
the Blackbook system. Since the data source needs to be decrypted every time a query is issued, it
may affect performance since encryption and decryption are costly operations.
We ran the experiments on a Dell desktop computer running on Ubuntu Gutsy 7.10 with the fol-
lowing hardware configuration: Intel® Pentium® 4 CPU 3.00 GHZ, 1 GB RAM. The network band-
width while running the experiments varied between 250 and 300 Mbps. We generated the data
files using the triple generation program provided by SP2B, the SPARQL Performance Benchmark
[SPAR]. We experimented with 30 files of different sizes, ranging from 1 to 30 MB. Details of the
experiments are given in [PARI09].
REFERENCES
[AES] Advanced Encryption Standard, http://en.wikipedia.org/wiki/Advanced_Encryption_Standard.
[AMAZ] Amazon S3, http://aws.amazon.com/s3/.
[BLAC] http://info.publicintelligence.net/IARPA_overview_UMD.pdf.
[CLOU] Cloud Computing, http://en.wikipedia.org/wiki/Cloud_computing.
[JENS09] M. Jensen, J. Schwenk, N. Gruschka, and L. L. Iacono, “On Technical Security issues in Cloud
Computing,” In Proceedings of IEEE International Conference on Cloud Computing, pp. 109–116, 2009.
[LAMP81] L. Lamport, “Password Authentication with Insecure Communication,” Communications of the
ACM 24 (11), 770–772, 1981.
[OASI] OASIS, https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xacml.
[ONE] One-time Password, http://en.wikipedia.org/wiki/One-time_password.
Big Data Management for Secure Information Integration 337
[PARI09] P. Parikh, “Secured Information Integration with a Semantic Web-Based Framework,” Master’s
thesis, The University of Texas at Dallas, 2009.
[PARI12] P. Parikh, M. Kantarcioglu, V. Khadilkar, B. M. Thuraisingham, and L Khan, “Secure Information
Integration with a Semantic Web-Based Framework,” In Proceedings of the IRI, Las Vegas, NV, USA,
pp. 659–663, 2012.
[PEAR09] S. Pearson, HP Labs, “Taking Account of Privacy when Designing Cloud Computing Services,”
In Proceedings of IEEE ICSE Cloud09, Workshop on Software Engineering Challenges in Cloud
Computing, Vancouver, IEEE, pp. 44–52, 2009.
[SAND96] R. Sandhu, E. J. Coyne, H. L. Feinstein, and C. Youman, – “Role Based Access Control Models,”
IEEE Computer 29 (2), 38–47, 1996.
[SECU02] Secure Hash Standard, http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf, 2002.
[SPAR] SPARQL Performance Benchmark http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20
BLOG%20%5B136%5D/1423.
27 Big Data Analytics for
Malware Detection
27.1 INTRODUCTION
In the previous chapters in Part IV, we discussed sample big data management and analytics
(BDMA) and big data security and privacy (BDSP) systems. These include systems such as the
SPARQL query processor, InXite, CAISS, and the Secure Data Integration framework. In this chap-
ter, we will discuss an experimental system that uses big data analytics and cloud for malware detec-
tion. In other words, we will show how big analytics techniques can be used for malware detection.
In fact, some of our work discussed in Part III on stream data analytics for insider threat detection
has been influenced by the system discussed in this chapter.
Malware is a potent vehicle for many successful cyber attacks every year, including data and
identity theft, system and data corruption, and denial of service; it therefore constitutes a significant
security threat to many individuals and organizations. The average direct malware cost damages
worldwide per year from 1999 to 2006 have been estimated at $14 billion USD [COMP07]. This
includes labor costs for analyzing, repairing and disinfecting systems, productivity losses, revenue
losses due to system loss or degraded performance, and other costs directly incurred as a result of
the attack. However, the direct cost does not include the prevention cost, such as antivirus software,
hardware, and IT (information technology) security staff salary. Aside from these monetary losses,
individuals and organizations also suffer identity theft, data theft, and other intangible losses due
to successful attacks.
Malware includes viruses, worms, Trojan horses, time and logic bombs, botnets, and spyware. A
number of techniques have been devised by researchers to counter these attacks; however, the more
successful the researchers become in detecting and preventing the attacks, the more sophisticated
malicious code appears in the wild. Thus, the arms race between malware authors and m alware
defenders continues to escalate. One popular technique applied by the antivirus community to
detect malicious code is signature detection. This technique matches untrusted executables against
a unique telltale string or byte pattern known as a signature, which is used as an identifier for a
particular malicious code. Although signature detection techniques are widely used, they are not
effective against zero-day attacks (new malicious code), polymorphic attacks (different encryptions
of the same binary), or metamorphic attacks (different code for the same functionality) [CRAN05].
There has therefore been a growing need for fast, automated, and efficient detection techniques that
are robust to these attacks.
This chapter describes a data mining technique that is dedicated to the automated generation of
signatures to defend against these kinds of attacks. Due to the need for the near real-time perfor-
mance of the malware detection tools, we have developed our data mining tool in the cloud. We
describe the detailed design and implementation of this cloud-based tool in the remaining sections
of this chapter.
This chapter is organized as follows. Section 27.2 discusses malware detection. Section 27.3
discusses related work. Section 27.4 discusses the classification algorithm and proves its effective-
ness analytically. Section 27.5 then describes the feature-extraction and feature-selection technique
using cloud computing for malware detection, and Section 27.6 discusses data collection, experi-
mental setup, evaluation techniques, and results. Section 27.7 discusses several issues related to our
approach, and finally, Section 27.8 summarizes our conclusions. Figure 27.1 illustrates the concepts
of this chapter.
339
340 Big Data Analytics with Applications in Insider Threat Detection
Cloud-based Experiments
Malware Stream
feature and
detection classification
extraction results
r most recent labeled consecutive data chunks, divide these r chunks into v partitions, and train a
classifier with each partition. Therefore, v classifiers are trained using the r consecutive chunks.
We then update the ensemble by choosing the best Kv classifiers (based on accuracy) among the
newly trained v classifiers and the existing Kv classifiers. Thus, the total number of classifiers in the
ensemble remains constant. Our approach is therefore parameterized by the number of partitions v,
the number of chunks r, and the ensemble size K.
Our approach does not assume that new data points appearing in the stream are immediately
labeled. Instead, it defers the ensemble updating process until labels for the data points in the latest
data chunk become available. In the meantime, new unlabeled data continue to be classified using
the current ensemble. Thus, the approach is well suited to applications in which misclassifications
solicit corrected labels from an expert user or other source. For example, consider the online credit
card fraud detection problem. When a new credit card transaction takes place, its class (fraud or
authentic) is predicted using the current ensemble. Suppose a fraudulent transaction is misclas-
sified as authentic. When a customer receives the bank statement, he or she identifies this error
and reports it to the authority. In this way, the actual labels of the data points are obtained and the
ensemble is updated accordingly.
27.2.3 Our Contributions
Our contributions can therefore be summarized as follows. We design and develop a generalized mul-
tipartition, multichunk ensemble technique that significantly reduces the expected classification error
over existing SPC ensemble methods. A theoretical analysis justifies the effectiveness of the approach.
We then formulate the malware detection problem as a data stream classification problem and identify
drawbacks of traditional malicious code detection techniques relative to our data mining approach.
We design and develop a scalable and cost-effective solution to this problem using a cloud
computing framework. Finally, we apply our technique to synthetically generated data as well as
342 Big Data Analytics with Applications in Insider Threat Detection
real botnet traffic and real malicious executables, achieving better detection accuracy than other
stream data-classification techniques. The results show that our ensemble technique constitutes a
powerful tool for intrusion detection based on data stream classification.
introduces a dynamic component to the problem that violates the static paradigm. We therefore
argue that effective malware detection must be increasingly treated as a data stream classification
problem in order to keep pace with attacks.
Many existing data stream classification techniques target infinite-length data streams that exhibit
concept drift ([AGGA06], [WANG03], [YANG05], [KOLT05], [HULT01], [FAN04], [GAO07],
[HASH09], [ZHAN9]). All of these techniques adopt a one-pass incremental update approach, but
with differing approaches to the incremental updating mechanism. Most can be grouped into two
main classes: single-model incremental approaches and hybrid batch incremental approaches.
Single-model incremental updating involves dynamically updating a single model with each new
training instance. For example, decision tree models can be incrementally updated with incom-
ing data [HULT01]. In contrast, hybrid batch incremental approaches build each model from a
batch of training data using a traditional batch learning technique. Older models are then peri-
odically replaced by newer models as the concept drifts ([WANG03], [BIFE09], ([YANG05],
[FAN04], [GAO07]). Some of these hybrid approaches use a single model to classify the unla-
beled data (e.g., [YANG05], [CHEN08]), while others use an ensemble of models (e.g., [WANG03],
[SCHO05]). Hybrid approaches have the advantage that model updates are typically far simpler
than in single-model approaches; for example, classifiers in the ensemble can simply be removed or
replaced. However, other techniques that combine the two approaches by incrementally updating
the classifiers within the ensemble can be more complex [KOLT05].
Accuracy-weighted classifier ensembles (AWEs) ([WANG03], [SCHO05]) are an important
category of hybrid incremental updating ensemble classifiers that use weighted majority voting for
classification. These divide the stream into equal-sized chunks, and each chunk is used to train a
classification model. An ensemble of K such models classifies the unlabeled data. Each time a new
data chunk is labeled, a new classifier is trained from that chunk. This classifier replaces one of the
existing classifiers in the ensemble. The replacement victim is chosen by evaluating the accuracy
of each classifier on the latest training chunk. These ensemble approaches have the advantage that
they can be built more efficiently than a continually updated single model and they observe higher
accuracy than their single-model counterparts [TUME96].
Our ensemble approach is most closely related to AWE, but with a number of significant
differences. First, we apply multipartitioning of the training data to build v classifiers from that
training data. Second, the training data consists of r consecutive data chunks (i.e., a multichunk
approach) rather than from a single chunk. We prove both analytically and empirically that both of
these enhancements, that is, multipartitioning and multichunk, significantly reduce the ensemble
classification error. Third, when we update the ensemble, v classifiers in the ensemble are replaced
by v newly trained classifiers. The v classifiers that are replaced may come from different chunks;
thus, although some classifiers from a chunk may have been removed, other classifiers from that
chunk may still remain in the ensemble. This differs from AWE in which removal of a classifier
means total removal of the knowledge obtained from one whole chunk. Our replacement strategy
also contributes to error reduction. Finally, we use simple majority voting rather than weighted vot-
ing, which is more suitable for data streams, as shown in [GAO07]. Thus, our multipartition, mul-
tichunk ensemble approach is a more generalized and efficient form of that implemented by AWE.
Our work extends our previously published work [MASU09]. Most existing data stream
classification techniques, including our previous work, assume that the feature space of the data
points in the stream is fixed. However, in some cases, such as text data, this assumption is not valid.
For example, when features are words, the feature space cannot be fully determined at the start of
the stream since new words appear frequently. In addition, it is likely that much of this large lexi-
con of words has low discriminatory power, and is therefore best omitted from the feature space.
It is therefore more effective and efficient to select a subset of the candidate features for each data
point. This feature selection must occur incrementally as newer, more discriminating candidate fea-
tures arise and older features become outdated. Therefore, feature-extraction and feature-selection
should be an integral part of data stream classification. In this chapter, we describe the design and
344 Big Data Analytics with Applications in Insider Threat Detection
Labeled chunk
D1 D2 D5 D5
Data
chunks Unlabeled chunk
Ensemble C C2 C5 Prediction
into v partitions, and train one model from each partition. This further reduces error because the
mean expected error of an ensemble of v classifiers is theoretically v times lower than that of a
single classifier [TUME96]. Therefore, both the multichunk and multipartition strategies contribute
to error reduction.
27.4.4 Hadoop/MapReduce Framework
We used the open-source Hadoop [APAC10] MapReduce framework to implement our experiments.
Here, we provide some of the algorithmic details of the Hadoop MapReduce feature-extraction and
feature-selection algorithm. The Map function in a MapReduce framework takes a key-value pair
as input and yields a list of intermediate key-value pairs for each.
All the Map tasks are processed in parallel by each node in the cluster without sharing data with
other nodes. Hadoop collates the output of the Map tasks by grouping each set of intermediate val-
ues V ⊆ RVal that share a common intermediate key k ∈ RKey. The resulting collated pairs (k, V)
are then streamed to Reduce nodes. Each reducer in a Hadoop MapReduce framework therefore
receives a list of multiple (k, V) pairs, issued by Hadoop one at a time in an iterative fashion. Reduce
can therefore be understood as a function having signature
Co-domain Val is the type of the final results of the MapReduce cycle.
In our framework, Map keys (MKey) are binary file identifiers (e.g., filenames), and Map values
(MVal) are the file contents in bytes. Reduce keys (RKey) are n-gram features, and their correspond-
ing values (RVal) are the class labels of the file instances whence they were found. Algorithm 27.2
shows the feature extraction procedure that Map nodes use to map the former to the latter. Lines
5–10 of Algorithm 27.3 tally the class labels reported by Map to obtain positive and negative instance
counts for each n-gram. These form a basis for computing the information gain of each n-gram in line
11. Lines 12–16 use a min-heap data structure h to filter all but the best S features as evaluated by the
information gain. The final best S features encountered are returned by lines 18–20.
The q reducers in the Hadoop system therefore yield a total of qS candidate features and their
information gains. These are streamed to a second reducer that simply implements the last half of
Algorithm 27.3 to select the best S features.
Input: list F of (g, L) pairs, where g is an n-gram and L is a list of class labels; total size t of original
instance set; total number p of positive instances
Output: S pairs (g, i), where i is the information gain of n-gram g
1: heap h/* empty min-heap*/
2: for all (g, L) in F do
3: t′ ← 0
4: p′ ← 0
5: for all l in L do
6: t′ ← t′ + 1
7: if l = + then
8: p ← p + 1
9: end if
10: end for
11: i ← Gˆ ( p′, t ′, p, t )/* see Equation 21*/
Big Data Analytics for Malware Detection 347
It is not always practical to use all n-gram features extracted from all the files corresponding to
the current chunk. The exponential number of such n-grams may introduce unacceptable memory
overhead, slow the training process, or confuse the classifier with large numbers of noisy, redun-
dant, or irrelevant features. To avoid these pitfalls, candidate n-gram features must be sorted accord-
ing to a selection criterion so that only the best ones are selected.
We choose the information gain as the selection criterion, because it is one of the most effective
criteria used in the literature for selecting the best features. The information gain can be defined as
a measure of the effectiveness of an attribute (i.e., feature) for classifying the training data. If we
split the training data based on the values of this attribute, then the information gain measures the
expected reduction in entropy after the split. The more an attribute reduces entropy in the training
data, the better that attribute is for classifying the data.
We have shown in [MASU11] as new features are considered, their information gains are com-
pared against the heap’s root. If the gain of the new feature is greater than that of the root, the root
is discarded and the new feature inserted into the heap. Otherwise, the new feature is discarded and
feature selection continues.
• The total number of extracted n-gram features might be very large. For example, the total
number of 4-grams in one chunk is around 200 million. It might not be possible to store all
of them in main memory. One obvious solution is to store the n-grams in a disk file, but this
introduces unacceptable overhead due to the cost of disk read/write operations.
• If colliding features in the hash table T are not sorted, then a linear search is required for
each scanned n-gram during feature extraction to test whether it is already in T. If they are
sorted, then the linear search is required during insertion. In either case, the time to extract
all n-grams is worst case quadratic in the total number N of n-grams in each chunk, an
impractical amount of time when N ≈ 108. Similarly, the nondistributed feature-selection
process requires a sort of the n-grams in each chunk. In general, this requires O(N log N)
time, which is impractical when N is large.
In order to efficiently and effectively tackle the drawbacks of the nondistributed feature-
extraction and feature-selection approach, we leverage the power of cloud computing. This allows
feature extraction, n-gram sorting, and feature selection to be performed in parallel, utilizing the
Hadoop MapReduce framework.
MapReduce [DEAN08] is an increasingly popular distributed programming paradigm used in
cloud computing environments. The model processes large datasets in parallel, distributing the
workload across many nodes (machines) in a share-nothing fashion. The main focus is to simplify
the processing of large datasets using inexpensive cluster computers. Another objective is ease of
usability with both load balancing and fault tolerance.
MapReduce is named for its two primary functions. The Map function breaks jobs down into
subtasks to be distributed to available nodes, whereas its dual, Reduce, aggregates the results of
completed subtasks. We will henceforth refer to nodes performing these functions as mappers and
reducers, respectively. The details of the MapReduce process for n-gram feature extraction and selec-
tion are explained in the appendix. In this section, we give a high-level overview of the approach.
Each training chunk containing N training files is used to extract the n-grams. These training
files are first distributed among m nodes (machines) by the HDFS (Figure 27.3, step 1). Quantity
m is selected by HDFS depending on system availability. Each node then independently extracts
n-grams from the subset of training files supplied to the node using the technique discussed in
Big Data Analytics for Malware Detection 349
Chunk i
(N executable files) From mapper
Reducer
To reducer
Section 4.1 (Figure 27.3, step 2). When all nodes finish their jobs, the n-grams extracted from each
node are collated (Figure 27.3, step 3).
For example, suppose Node 1 observes n-gram abc in one positive instance (i.e., a malicious train-
ing file), while Node 2 observes it in a negative (i.e., benign) instance. This is denoted by pairs abc, +
and abc, − under Nodes 1 and 2 (respectively) in Figure 27.3. When the n-grams are combined, the
labels of instances containing identical n-grams are aggregated. Therefore, the aggregated pair for
abc is abc, + −. The combined n-grams are distributed to q reducers (with q chosen by HDFS based
on system availability). Each reducer first tallies the aggregated labels to obtain a positive count and
a total count. In the case of n-gram abc, we obtain tallies of pabc = 1 and tabc = 2. The reducer uses
these tallies to choose the best S n-grams (based on Equation 21) from the subset of n-grams supplied
to the node (Figure 27.3, step 5). This can be done efficiently using a min-heap of size S; the process
requires O(W log S) time, where W is the total number of n-grams supplied to each reducer. In con-
trast, the nondistributed version requires O(W log W) time. Thus, from the q reducer nodes, we obtain
qS n-grams. From these, we again select the best S by running another round of the MapReduce cycle
in which the Map phase does nothing, but the Reduce phase performs feature selection using only
one node (Figure 27.3, step 6). Each feature in a feature set is binary; its value is 1 if it is present in
a given instance (i.e., executable) and 0 otherwise. For each training or testing instance, we compute
the feature vector whose bits consist of the feature values of the corresponding feature set. These
feature vectors are used by the classifiers for training and testing.
27.6 EXPERIMENTS
We evaluated our approach on synthetic data, botnet traffic generated in a controlled environment,
and a malware dataset. The results of the experiments are compared with several baseline methods.
27.6.1 Datasets
Synthetic Dataset. To generate synthetic data with a drifting concept, we use a moving hyperplane,
given by ∑ id=1 ai xi = a0 [WANG03]. If ∑ id=1 ai xi ≤ a0 , then an example is negative; otherwise, it is
350 Big Data Analytics with Applications in Insider Threat Detection
positive. Each example is a randomly generated d-dimensional vector {x1, …, xd}, where xi ∈ [0, 1].
Weights {a1, …, ad} are also randomly initialized with a real number in the range [0, 1]. The value
of a0 is adjusted so that roughly the same number of positive and negative examples are generated.
This can be done by choosing a0 = 1/ 2 ∑ id=1 ai . We also introduce noise randomly by switching the
labels of p percent of the examples, where p = 5 in our experiments. There are several parameters
that simulate the concept-drift. We use parameters identical to those in [WANG03]. In total, we
generate 250,000 records and four different datasets having chunk sizes 250, 500, 750, and 1000,
respectively. Each dataset has 50% positive instances and 50% negative instances.
Botnet Dataset: Botnets are networks of compromised hosts known as bots, all under the control
of a human attacker known as the botmaster [BARF06]. The botmaster can issue commands to the
bots to perform malicious actions, such as launching DDoS attacks, spamming, and spying. Botnets
are widely regarded as an enormous emerging threat to the Internet community. Many cutting-edge
botnets apply peer-to-peer (P2P) technology to reliably and covertly communicate as the botnet
topology evolves. These botnets are distributed and small, making them more difficult to detect and
destroy. Examples of P2P bots include Nugache [LEMO06], Sinit [STEW03], and Trojan.Peacomm
[GRIZ07].
Botnet traffic can be viewed as a data stream having both infinite length and concept-drift.
Concept drift occurs as the bot undertakes new malicious missions or adopts differing communica-
tion strategies in response to new botmaster instructions. We therefore consider our stream classifi-
cation technique to be well suited to detect the P2P botnet traffic.
We generate the real P2P botnet traffic in a controlled environment using the Nugache P2P bot
[LEMO06]. The details of the feature extraction process are discussed in Masud et al. [MASU08b].
There are 81 continuous attributes in total. The whole dataset consists of 30,000 records, represent-
ing 1 week’s worth of network traffic. We generate four different datasets having chunk sizes of
30 min, 60 min, 90 min, and 120 min, respectively. Each dataset has 25% positive (botnet traffic)
instances and 75% negative (benign traffic).
Malware Dataset: We extract a total of 38,694 benign executables from different Windows
machines, and a total of 66,694 malicious executables collected from an online malware reposi-
tory VX Heavens [VX10] that contains a large collection of malicious executables (viruses, worms,
trojans, and back-doors). The benign executables include various applications found at the Windows
installation folder, as well as other executables in the default program installation directory.
We select only the Win32 Portable Executables (PEs) in both cases. Experiments with the
ELF executables are a potential direction of future work. The collected 105,388 files (benign and
malicious) form a data stream of 130 chunks, each consisting of 2000 instances (executable files).
The stream order was chosen by sorting the malware by version and discovery date, simulating the
evolving nature of Internet malware. Each chunk has 1500 benign executables (75% negative) and
500 malicious executables (25% positive). The feature-extraction and feature-selection process for
this dataset is described in earlier sections.
Note that all these datasets are dynamic in nature. Their unbounded (potentially infinite-length)
size puts them beyond the scope of purely static classification frameworks. The synthetic data also
exhibits concept-drift. Although it is not possible to accurately determine whether the real datas-
ets have concept-drift, theoretically the stream of executables should exhibit concept drift when
observed over a long period of time. The malware data exhibits feature evolution as evidenced by
the differing set of distinguishing features identified for each chunk.
27.6.2 Baseline Methods
For classification, we use the Weka machine learning open-source package [HALL09]. We apply
two different classifiers: J48 decision tree and Ripper. We then compare each of the following
baseline techniques to our EMPC algorithm.
Big Data Analytics for Malware Detection 351
BestK: This is an SPC ensemble approach, where an ensemble of the best K classifiers is used.
The ensemble is created by storing all the classifiers seen so far and selecting the best K based on
the expected error on the most recent training chunk. An instance is tested using simple majority
voting.
Last: In this case, we only keep the classifier trained on the most recent training chunk. This can
be considered an SPC approach with K = 1.
AWE: This is the SPC method implemented using accuracy-weighted classifier ensembles
[WANG03]. It builds an ensemble of K models where each model is trained from one data chunk.
The ensemble is updated as follows. Let Cn be the classifier built on the most recent training chunk.
From the existing K models and the newest model Cn, the K best models are selected based on their
error on the most recent training chunk. Selection is based on weighted voting where the weight of
each model is inversely proportional to the error of the model on the most recent training chunk.
All: This SPC uses an ensemble of all the classifiers seen so far. The new data chunk is tested
with this ensemble by simple voting among the classifiers. Since this is an SPC approach, each
classifier is trained from only one data chunk.
We obtain the optimal values of r and v to be between 2 and 3, and 3 and 5, respectively, for most
datasets. Unless mentioned otherwise, we use r = 2 and v = 5 in our experiments. To obtain a fair
comparison, we use the same value for K (ensemble size) in EMPC and all baseline techniques.
Hadoop Distributed System Setup: The distributed system on which we performed our experi-
ments consists of a cluster of 10 nodes. Each node has the same hardware configuration: an Intel
Pentium IV 2.8 GHz processor, 4 GB main memory, and 640 GB hard disk space. The software
environment consists of a Ubuntu 9.10 operating system, the Hadoop-0.20.1 distributed computing
platform, the JDK 1.6 Java development platform, and a 100 MB LAN network link.
27.7 DISCUSSION
Our work considers a feature space consisting of purely syntactic features: binary n-grams drawn
from executable code segments, static data segments, headers, and all other content of untrusted
files. Higher level structural features such as call- and control-flow graphs, and dynamic features
such as runtime traces, are beyond our current scope. Nevertheless, n-gram features have been
observed to have very high discriminatory power for malware detection, as demonstrated by a large
body of prior work as well as our experiments. This is in part because n-gram sets that span the
entire binary file content, including headers and data tables, capture important low-level structural
details that are often abstracted away by higher level representations. For example, malware often
contains the handwritten assembly code that has been assembled and linked using nonstandard
tools. This allows attackers to implement binary obfuscations and low-level exploits not available
from higher level source languages and standard compilers. As a result, malware often contains
unusual instruction encodings, header structures, and link tables whose abnormalities can only be
seen at the raw binary level, not in assembly code listings, control flow graphs, or system API call
traces. Expanding the feature space to include these additional higher level features requires an
efficient and reliable method of harvesting them and assessing their relative discriminatory power
during feature selection, and is reserved as a subject of future work.
The empirical results reported in [MASU11] confirm our analysis that shows that multiparti-
tion, multichunk approaches should perform better than single-chunk, single-partition approaches.
Intuitively, a classifier trained on multiple chunks should have better prediction accuracy than a
classifier trained on a single chunk because of the larger training data. Furthermore, if more than
one classifier is trained by multipartitioning the training data, the prediction accuracy of the result-
ing ensemble of classifiers should be higher than a single classifier trained from the same training
data because of the error reduction power of an ensemble over single classifier. In addition, the
accuracy advantages of EMPC can be traced to two important differences between our work and
352 Big Data Analytics with Applications in Insider Threat Detection
that of AWE. First, when a classifier is removed during ensemble updating in AWE, all information
obtained from the corresponding chunk is forgotten; but in EMPC, one or more classifiers from an
earlier chunk may survive. Thus, EMPC ensemble updating tends to retain more information than
that of AWE, leading to a better ensemble. Second, AWE requires at least Kv data chunks, whereas
EMPC requires at least K + r − 1 data chunks to obtain Kv classifiers. Thus, AWE tends to keep
much older classifiers in the ensemble than EMPC, leading to some outdated classifiers that can
have a negative effect on the classification accuracy.
However, the higher accuracy comes with an increased cost in running time. Theoretically,
EMPC is at most rv times slower than AWE, its closest competitor in accuracy. This is also evident
in the empirical evaluation, which shows that the running time of EMPC is within 5 times that of
AWE (for r = 2 and v = 5). However, some optimizations can be adopted to reduce the runtime
cost. First, the parallelization of training for each partition can be easily implemented, reducing the
training time by a factor of v. Second, classification by each model in the ensemble can also be done
in parallel, thereby reducing the classification time by a factor of Kv. Therefore, the parallelization
of training and classification should reduce the running time at least by a factor of v, making the
r untime close to that of AWE. Alternatively, if parallelization is not available, parameters v and r can
be lowered to sacrifice prediction accuracy for lower runtime cost. In this case, the desired b alance
between runtime and prediction accuracy can be obtained by evaluating the first few chunks of the
stream with different values of v and r, and choosing the most suitable values.
REFERENCES
[AGGA06]. C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for On-Demand Classification of Evolving
Data Streams,” IEEE Transactions on Knowledge and Data Engineering 18 (5), 577–589, 2006.
[AHA91]. D. W. Aha, D. Kibler, M. K. Albert, “Instance-Based Learning Algorithms,” Machine Learning 6 (1),
37–66, 1991.
[APAC10]. Hadoop. hadoop.apache.org, 2010.
[BARF06]. P. Barford and V. Yegneswaran, “An Inside Look at Botnets,” Malware Detection, Advances in In-
formation Security, M. Christodorescu, S. Jha, D. Maughan, D. Song, and C. Wang, editors, Springer,
New York, NY, USA, pp. 171–192, 2006.
[BIFE09]. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, R. Gavalda, “New Ensemble Methods for Evolving
Data Streams,” In KDD’09: Proceedings of the 15th ACM International Conference on Knowledge
Discovery and Data Mining, Paris, France, pp. 139–148, 2009.
[BOSE92]. B. E. Boser, I. M. Guyon, V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” In
Proceedings of the 5th ACM Workshop on Computational Learning Theory, Pittsburgh, PA, pp. 144–152,
1992.
[CHEN08]. S. Chen, H. Wang, S. Zhou, P. S. Yu, “Stop Chasing Trends: Discovering High Order Models
in Evolving Data,” In ICDE’08: Proceedings of the 24th IEEE International Conference on Data
Engineering, Cancun, Mexico, pp. 923–932, 2008.
[COHE96]. W. W. Cohen, “Learning Rules that Classify E-mail,” In Proceedings of the AAAI Spring Symposium
on Machine Learning in Information Access, Portland, OR, pp. 18–27, 1996.
[COMP07]. Computer Economics, INC., Malware Report: The Economic Impact of Viruses, Spyware, Adware,
Botnets, and Other Malicious Code, http://www.computereconomics.com/article.cfm?id=1227, 2007.
[CRAN05]. J. R. Crandall, Z. Su, S. F. Wu, F. T. Chong, “On Deriving Unknown Vulnerabilities from Zero-Day
Polymorphic and Metamorphic Worm Exploits,” In CCS’05: Proceedings of the 12th ACM Conference
on Computer and Communications Security, Alexandria, VA, pp. 235–248, 2005.
[DEAN08]. J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,”
Communicatioins of the ACM 51 (1), 107–113, 2008.
[DOMI00]. P. Domingos, G. Hulten, “Mining High-Speed Data Streams,” In KDD’2000: Proceedings of the
6th ACM International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 71–80,
2000.
[FAN04]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In KDD’04:
Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining,
Seattle, WA, pp. 128–137, 2004.
[FREU96]. Y. Freund, R. E. Schapire, “Experiments with a New Boosting Algorithm,” In Proceedings of the
13th International Conference on Machine Learning, Bari, Italy, pp. 148–156, 1996.
[GAO07]. J. Gao, W. Fan, J. Han, “On Appropriate Assumptions to Mine Data Streams: Analysis and Practice,”
In ICDM’07: Proceedings of the 7th IEEE International Conference on Data Mining, Omaha, NE,
pp. 143–152, 2007.
[GRIZ07]. J. B. Grizzard, V. Sharma, C. Nunnery, B. B. Kang, D. Dagon, “Peer-to-Peer Botnets: Overview and
Case Study,” In HotBots’07: Proceedings of the 1st Workshop on Hot Topics in Understanding Botnets,
pp. 1–8, 2007.
[HALL09a]. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA Data
Mining Software: An Update,” ACM SIGKDD Explorations Newsletter 11 (1), 10–18, 2009.
[HAML09b]. K. W. Hamlen, V. Mohan, M. M. Masud, L. Khan, B. M. Thuraisingham, “Exploiting An
Antivirus Interface,” Computer Standards and Interfaces 31 (6), 1182–1189, 2009.
[HASH09]. S. Hashemi, Y. Yang, Z. Mirzamomen, M. R. Kangavari, “Adapted One-versus-All Decision
Trees for Data Stream Classification,” IEEE Transactions on Knowledge and Data Engineering 21 (5),
624–637, 2009.
[HULT01]. G. Hulten, L. Spencer, P. Domingos, “Mining Time-Changing Data Streams,” In KDD’01:
Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining, San
Francisco, CA, pp. 97–106, 2001.
[KOLT04]. J. Kolter, M. A. Maloof, “Learning to Detect Malicious Executables in the Wild,” In KDD’04:
Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining,
Seattle, WA, pp. 470–478, 2004.
[KOLT05]. J. Z. Kolter and M. A. Maloof, “Using Additive Expert Ensembles to Cope with Concept Drift,”
In ICML’05: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany,
pp. 449–456, 2005.
354 Big Data Analytics with Applications in Insider Threat Detection
28.1 INTRODUCTION
Inference is the process of forming conclusions from premises. The inferred knowledge is harm-
ful when the user is not authorized to acquire such information from legitimate responses that he/
she receives. Providing a solution to the inference problem where users issue multiple requests and
consequently infer unauthorized knowledge is an open problem. An inference controller is a device
that is used to detect or prevent the occurrence of the inference problem. However, an inference
controller will never know in full the inferences possible from the answers to a query request since
there is always some prior knowledge available to the querying user. This prior knowledge could be
any subset of all possible knowledge available from other external sources. The inference problem
is complex and, therefore, an integrated and/or incremental domain-specific approach is necessary
for its management. For a particular domain, one could take several approaches, such as (1) building
inference controllers that act during query processing, (2) building inference controllers that enforce
constraints during the knowledge base design, and (3) building inference controllers that provide
explanations to a system security officer. Over time, the provenance data as well as the data deduced
from the provenance data combined could become massive and therefore we need big data manage-
ment techniques for handling the inference problem.
This chapter discusses the implementation of these incremental approaches for a prototype infer-
ence controller for provenance in a medical domain. The inference controller that we have designed
and developed protects the sensitive information stored in a provenance database from unauthor-
ized users. The provenance is represented as a directed acyclic graph. This graph-based structure
of provenance can be represented and stored as an RDF graph [KLYN04], thereby allowing us to
further exploit various semantic web technologies. In our work, we have built a prototype to evalu-
ate the effectiveness of the proposed inference controller. We store the provenance information as
an Web Ontology Language (OWL) knowledge base and use OWL-compliant reasoners to draw
inferences from the explicit information in the provenance knowledge base. We enforce policy
constraints at the design phase, as well as at runtime.
Provenance is metadata that captures the origin of a data source; the history or ownership of a
valued object or a work of art or literature. It allows us to verify the quality of information in a data
store to repeat manipulation steps and to discover dependences among data items in a data store. In
addition, provenance can be used to determine the usefulness and trustworthiness of shared infor-
mation. The utility of shared information relies on: (i) the quality of the source of information and
(ii) the reliability and accuracy of the mechanisms (i.e., procedures and algorithms) used at each
step of the modification (or transformation) of the underlying data items. Furthermore, provenance
is a key component for the verification and correctness of a data item which is usually stored and
then shared with information users.
Organizations and individual users rely on information sharing as a way of conducting their
day-to-day activities. However, ease of information sharing comes with a risk of information
misuse. An electronic patient record (EPR) is a log of all activities, including patient visits to a hos-
pital, diagnoses and treatments for diseases, and processes performed by healthcare professionals
355
356 Big Data Analytics with Applications in Insider Threat Detection
on a patient. This EPR is often shared among several stakeholders (e.g., researchers, and insurance
and pharmaceutical companies). Before this information can be made available to any third party,
the sensitive information in an EPR must be circumvented or hidden before releasing any part of
the EPR. This can be addressed by applying policies that completely or partially hide sensitive attri-
butes within the information being shared. The protection of sensitive information is often required
by regulations that are mandated by a company or by laws, such as Health Insurance Portability and
Accountability Act (HIPAA) [ANNA03].
While the technologies that we have used are mainly semantic web technologies, we believe
that the amount of data that has to be handled by the inference controller could be massive. This is
because the data not only includes the data in the database, but also previously released data as well
as real-world information. Therefore, traditional database management techniques will be inad-
equate for implementing the inference controllers. As an example, we designed and implemented
inference controllers in the 1990s, and it took us almost two years for the implementation discussed
in [THUR93] and [THUR95]. Furthermore, we could not store all of the released data as well as the
real-world data. That is, we purged the data that was least recently used from the knowledge base.
We re-implemented the inference controllers with semantic web technologies in the late 2000s and
early 2010s, and it took us just a few months for these implementations. Furthermore, our knowl-
edge base was quite large and stored much of the released data and the real-world data. However,
for the inference controller to be truly effective, it needs to process massive amounts of data, and
we believe that we need a cloud-based implementation with big data management technologies. Our
initial implementation of a policy engine in the cloud, which is a form of the inference controller,
was discussed in Chapter 25. We need to implement he complete inference controller in the cloud
using big data technologies.
The organization of this chapter is as follows. Our system architecture will be discussed in
Section 28.2. Some background on data provenance as well as semantic web technologies will be
discussed in Section 28.3. Our system design with examples is presented in Section 28.4. Details
regarding the implementation of the inference controller are provided in Section 28.5. Implementing
the inference controller using big data management techniques is discussed in Section 28.6. Finally,
this chapter is concluded in Section 28.7. Details of our inference controller are given in our prior
book [THUR15].
Furthermore, tools such as D2RQ [BIZE03] could be used to convert traditional relational data into
RDF data, thus allowing users to view both types of data as RDF graphs.
In our design, we will assume that the available information is divided into two parts: the actual
data and provenance. Both the data and provenance can be represented as RDF graphs. The reader
should note that we do not make any assumptions about how the actual information is stored. A user
may have stored data and provenance in two different triple stores or in the same store. In addition,
a user’s application can submit a query for access to the data and its associated provenance or vice
versa. Figure 28.1 presents the design of our proposed inference controller over provenance. We
next present a description of the major modules in Figure 28.1.
User-Interface Manager: A user-interface manager is responsible for processing a user’s requests,
authenticating a user and providing suitable responses back to a user. The interface manager also
provides an abstraction layer that allows a user to interact with the system. A user can therefore
poses either a data query or a provenance query to this layer. The user-interface manager also
determines whether the query should be evaluated against the data or its provenance.
The user interacts with the provenance inference controller via an interface layer. This layer
accepts a user’s credentials and authenticates the user. The interface manager hides the actual inter-
nal representation of an inference controller from a user by providing a simple question–answer
mechanism. This mechanism allows a user to pose standard provenance queries such as why a
data item was created, where in the provenance graph it was generated, and how the data item was
generated and when and where it was created. This layer also returns results after they are examined
against a set of policies. Figure 28.2 shows a more detailed view of the interface manager that allows
a user to interact with the underlying provenance store(s) via the inference controller. The interface
manager’s role is to authenticate users, process input queries, and check for errors that may occur
during query processing. In addition, it carries out other functions; for example, it performs some
preprocessing operations before submitting a query to the inference controller layer.
Policy Manager: A policy manager is responsible for ensuring that the querying user is authorized
to use the system. It evaluates policies against a user’s query and associated query results to ensure
that no confidential information is released to unauthorized users. The policy manager may enforce
policies against data or its associated provenance. Each data type may have its own policy manager,
for example, data may be stored in a different format from provenance data. Hence, we may require
separate implementations of the policy manager for data and provenance. Figure 28.3 shows the
Query Result
User-interface manager
controller
Policy manager Expert Inference engine
advice
Domain dependent
XML
... DB
DB RDF RDF
Interface manager
User-interface layer
Query processing
Decisions
Inference controller Inferencing
Policy enforcement
Data controller
Data manager
Data Provenance
details of the policy manager. The policy manager interacts with the user via the query-processing
module. Each query passed to the policy manager from the query-processing module is evaluated
against a set of policies. As previously mentioned, these policies can be encoded as access control
rules via any access control mechanism or other suitable policy languages ([CADE11a], [CADE11b]).
They can be expressed as rules that operate directly over a directed graph or they can be encoded
as description logic (DL) [LEVY98] constraints or using web rule languages, such as the Semantic
Web Rule Language (SWRL) [HORR04]. The policy layer is responsible for enforcing any high-level
policy defined by an application user or administrator. The policies are not restricted to any particular
Inference layer:
SWRL rules DL rules This layer builds rules that allow inference to be
processer processer done over the provenance graph.
Policy manager
security policy definition, model, or mechanism. In fact, we can support different policies, for exam-
ple, role-based access control (RBAC), access control based on context, such as time (TRBAC) and
location (LBAC). The policy manager also handles RBAC policies specified in OWL and SWRL
[CADE10a]. In addition, it handles certain policies specified in OWL for inference control such as
association-based policies. Besides the traditional and well-established security models built on top
of access control mechanisms, the inference controller also supports redaction policies. Redaction
policies are based on sharing data for ongoing mutual relationships among businesses and stakehold-
ers. Redaction policies are useful when the results of a query are further sanitized. For example, the
literal value of an assertion by an RDF triple in a result graph may contain the nine-digit social secu-
rity number of employees, as 999-99-9999, but for regulatory purposes, the four-digit format 9999 is
the correct format for disclosure. The redaction policies can also be used to redact (block out or hide)
any node in an RDF triple (i.e., block out a literal value, hide a resource, etc.).
Finally, the policy layer interacts with various reasoners in the inference layer, which offer fur-
ther protection against inference attacks. The inference layer enforces policies that are in the form
of DL constraints [ZHAN09], OWL restrictions [MCGU04a], or SWRL rules. Note that some of
the access control policies can be expressed as inference rules (for more expressive power) or as
queries via query rewriting ([OULM10a, OULM10b]), or in the form of view definitions [RIZV04].
The policy manager therefore has many layers equipped with security features, thus ensuring that
we offer the maximal protection over the underlying provenance store.
The Query-Processing Module: It is responsible for accepting a user’s query from the user inter-
face, parsing it, and submitting it to the policy manager. In addition, the module also evaluates
results against the user-defined set of policies and rules, after which the results are returned to the
user via the user-interface layer. The query-processing module can accept any standard provenance
query, as well as any query written using SPARQL [PRUD06]. The query-processing module also
provides feedback to the user via the user interface. This feedback includes errors due to query
syntax in addition to the responses constructed by the underlining processes of the policy and
inference controller layers.
Inference Engine: The inference engine is the heart of the inference controller. The engine is
equipped to use a variety of inference strategies, each requiring a reasoner. Since there are many
implementations of reasoners available, our inference controller offers an added feature of flexibil-
ity, whereby we can select a reasoner from amongst a variety of OWL-compliant reasoning tools
based on different reasoning tasks or domains. For example, decisions in a medical domain may
require all the facts in a triple store or only the facts related to a particular EPR in the triple store.
Therefore, one can limit a task involving inferences to the local information (available in the EPR
and the related provenance).
A modular approach [CADE10b] can improve the efficiency of the inference controller. The
approach given in [CADE10b] allows each inference process to be executed on a separate proces-
sor; therefore, we can take advantage of designs based on partitioning and parallelism. For exam-
ple, the code implementing a strategy based on heuristic reasoning [THUR90] could be executed
in parallel with the code that implements a strategy based on inference by semantic association
[THUR93]. Furthermore, an inference engine typically uses software programs that have the capa-
bility of reasoning over a relevant subset of facts that are in some data representation of a domain,
for example, a relational data model or an RDF graph representation.
Data Controller: The data controller is a suite of software programs that stores and manages
access to data. The data could be stored in any format such as in a relational database, in XML files
or in an RDF store. The controller accepts requests for information from the policy layer if a policy
allows the requesting user to access the data items in the data stores (e.g., the triple stores). This
layer then executes the request over the stored data and returns the results back to the policy layer
where they are re-evaluated before being returned to the user-interface layer.
Provenance Controller: The provenance controller is used to store and manage provenance
information that is associated with data items that are present in the data controller. In the case when
360 Big Data Analytics with Applications in Insider Threat Detection
we represent provenance as an RDF graph, the provenance controller stores information in the form
of a logical graph structure in any appropriate data representation format (e.g., RDF serialization
format). The provenance controller also records the ongoing activities associated with the data items
stored in the data controller. The provenance controller takes as input a graph query and evaluates
it over the provenance data. The provenance controller then returns a resultant RDF graph back to
the policy layer where it is re-examined using a set of policies before returning it to the querying
user. A re-examination of the resulting RDF graph allows the policy designer to transform the result
graph further by applying other graph transformation techniques (e.g., redaction, sanitization, etc.)
that can alter triples in the resulting RDF graph. Note that the original provenance graphs are not
altered during the transformation; instead, copies of the original graph (created by queries) undergo
transformations. This protects the integrity of the provenance data; the effect of modifying the
provenance changes the provenance.
Policy Managers: An application user may wish to write policies at a high level using domain-
specific business rules. Thus, an application user can continue using his/her business policies inde-
pendent of our software implementation of the provenance inference controller. A suitable policy
parser module could handle the parsing of high-level policies and their transformation into low-level
policy objects used by the system. Therefore, modifying or extending a policy manager module is
facilitated since it is not hard-wired to any implementation of other modules in the architecture.
Inference Tools: Newly published data, when combined with existing public knowledge, allows
for complex and sometimes unintended inferences. Therefore, we need semi-automated tools for
detecting these inferences prior to releasing provenance information. These tools should give data
owners a fuller understanding of the implications of releasing the provenance information, as well
as helping them adjust the amount of information they release in order to avoid unwanted inferences.
The inference controller is a tool that implements some of the inference strategies that a user could use
to infer confidential information that is encoded into a provenance graph. Our inference controller lever-
ages existing software tools that perform inferencing, for example, Pellet [SIRI07], Fact++ [TSAR06],
Racer [HAAR01], Hermit [SHEA08], CWM [BERN00], and third party plugins [CARR04]. A modu-
lar design also takes advantage of theories related to a modular knowledge base in order to facilitate
collaborative ontology construction, use, and reuse ([BAO06], [FARK02], [BAO04]). In addition, there
exists a trade-off between expressivity and decidability; therefore, a policy designer or an administra-
tor should have flexibility when selecting an appropriate reasoner software for a particular application.
In addition to the reasoner, the policy designer should take into consideration the expressiveness of
the representational language for the concepts in an application domain. For example, one may prefer
urgency in a medical or intelligence domain when making appropriate decisions and therefore decide
on an optimized reasoner for a representational language (e.g., RDF, RDFS, OWL-DL).
open provenance model (OPM) recommendation [MORE11]. In addition, RDF allows the integra-
tion of multiple databases describing the different pieces of the lineage of a resource (or data item)
and naturally supports the directed structure of provenance. This data model has also been success-
fully applied for provenance capture and representation ([DING05, ZHAO08]). In addition to RDF,
[KLYN04], RDF Schema (RDFS) can be used for the reasoning capabilities.
Other representation schemes are the OWL [MCGU04a,b] and the SWRL [HORR04]. OWL
is an ontology language that has more expressive power and reasoning capabilities than RDF and
RDFS. It has an additional vocabulary along with a formal semantics. The formal semantics in
OWL are based on DLs, which are a decidable fragment of first-order logics. OWL consists of a
Tbox that comprises the vocabulary that defines the concepts in a domain, and an Abox that is made
up of assertions (facts about the domain). The Tbox and Abox make up an OWL knowledge base.
The SWRL extends the set of OWL axioms to include Horn-like rules, and it extends these rules
to be combined with an OWL knowledge base. Using these languages allows us to later perform
inference over the provenance graph. Therefore, we could determine the implicit information in the
provenance graph.
We need to disclose provenance information in order to ensure that the user gets high-quality
information. Provenance data has a unique characteristic that makes it different from traditional
data [BRAU08]. This characteristic is the directed acyclic graph (DAG) structure of provenance
that captures single-data items and the causal relationships between them. Additionally, the DAG
structure complicates any efforts to successfully build an inference controller over provenance data
and surprisingly this area has been unexplored by the research community. Although the research
community has applied inference over provenance data, in particular, the inference web that has
used provenance to provide proofs as justifications for data items [MCGU04b], it has not considered
inferencing from the point of view of provenance security.
graph-transformation approach as redaction policies (see [CADE11b] for further details). This graph
transformation technique can also be used to modify the triple patterns in an SPARQL query; this
is called query rewriting. As previously mentioned, the inference controller will therefore use a
combination of policies. When appropriate, we will: protect data by using access control policies to
limit access to the provenance data; use redaction policies to share the provenance information; and
also use the graph-transformation technique for sanitizing the initial query results.
Inferences may be obtained during two stages:
1. Data collection: This includes data in the data stores that is accessible to users and real-
world knowledge (which is not represented in the data stores) about an application domain.
2. Reasoning with the collected data: This is the act of deriving new information from the
collected data.
The data collection and the reasoning stages are performed repeatedly by the adversary (i.e.,
by a human user or an autonomous agent) until the intended inference is achieved or the adversary
gives up. Each query attempts to retrieve data from the internal data stores, but the adversary may
also collect data from external data stores as part of the background-knowledge-acquisition process.
The data that adversaries want to infer may include the existence of a certain entity in the data
stores (i.e., the knowledge base of the facts about a domain) or the associations among data (i.e., the
relationships among the facts).
In some cases, instead of inferring the exact data (facts) in a knowledge base (precise inference),
users may be content with a set of possible data values (imprecise inference or approximate infer-
ence). For instance, assume that a user wants to infer the disease of a patient from the patient’s
record, which is a part of the provenance graph. Further, assume that the provenance captures the
record’s history that records that a heart surgeon performed an operation on the patient. Revealing
the fact that a heart surgeon is part of the provenance could enable the user to infer that the patient
has some disease related to heart problems but not necessarily the exact nature of the surgery or
the exact disease of the patient.
the inference layer. The query is then submitted and executed over the knowledge base. We model
this as two machines (or automated tools): the user (e.g., an automated agent) and the controller.
We assume that the user builds a machine, M′, that contains the history of all the answers given
to the user, the modified background knowledge with relevant domain information and a prior set
of rules about the system. Further, the user can infer a subset of the private information using M′.
Likewise, we build a machine M″ that simulates the inferencing mechanism of the user, but with
certain modifications to compensate for any differences. This machine, M″, combines the history
of all previous answers, the current query and associated provenance and the rules that enforce the
security policies. We use M″ to determine certain inferences occurring in M′. The major difference
between M′ and M″ is the user’s background information. M′ and M″ contain different sets of rules
and M″ keeps a repository of a user’s input queries. This repository (or query log) is a rich source of
information about the context of the user. For example, if the logged queries could compromise the
knowledge base, the user is flagged as an attacker.
The inferencing capabilities of M″ are best realized by a language with formal semantics. The
Resource Description Framework (RDF), RDFS and the Web Ontology Language (OWL) are
knowledge representation languages that fit this criterion; these languages all use the RDF data
model. RDF data model is also a natural fit for a directed graph such as provenance. Also, to realize
policy rules using SWRL and DLs, the provenance is stored in an OWL knowledge base.
The queries are written in the SPARQL language. These queries are extended with regular expres-
sions in order to select graph patterns for both the user’s query and the protected resources. In order
to write the policy constraints (as rules), we use a mixture of queries, DL rules and SWRL. These
constraints (rules) specify the concepts, triples and facts that are to be protected. The concepts are
the definitions (or descriptions) of resources in a provenance graph; these are normally written for-
mally to avoid ambiguity in policy specification languages such as DLs. Each DL concept can also
be successfully defined by an SPARQL query or an SWRL rule. In some cases, the constraints may
require more expressive languages for defining the concept and so we sometimes choose SWRL
rules over DL rules.
In some cases, it may be possible to return only the answers that comply with the policy con-
straints. One approach is to replace a set of triples satisfying a query with another set of triples by
applying transformation rules over the first set of triples. Another approach may be to lie about
the contents in the knowledge base. Yet another approach is to use polyinstantiation similar to that
in multilevel secure databases, where users at different clearance levels see different versions of
reality [STAC90].
Approaches for modifying the graph patterns in an SPARQL query make use of different tech-
niques, for example, SPARQL filters and property functions, graph transformations and match/
apply pattern. In order to determine the type of triple with respect to a security classification, the
inference engine would use a domain ontology to determine the concept each data item belongs to
as well as a query modification based on an SPARQL BGP transformation.
There is a difference between a query engine that simply queries an RDF graph but does not
handle rules and an inference engine that also handles rules. In the literature, this difference is not
always clear. The complexity of an inference engine is a lot higher than a query engine. The reason
is that rules permit us to make sequential deductions. In the execution of a query, these deductions
are to be constructed. This is not necessary in the case of a query engine. Note that there are other
examples of query engines that rely on a formal model for directed labeled graphs such as DQL
[FIKE02] and RQL [KARV12].
Rules also support a logic base that is inherently more complex than the logic in the situation
without rules. For an RDF query engine, only the simple principles of entailment on graphs are
necessary. RuleML is an important effort to define rules that are usable for the World Wide Web.
The inference web [MCGU04a,b] is a recent realization that defines a system for handling different
inference engines on the semantic web.
procedure, etc. Since the procedures are described by actual documents on the Web, the gener-
ated workflow structures encode a set of guidelines that are also known to the users of the system.
However, most real-world hospitals follow guidelines related to a patient’s privacy, so our fictitious
hospital generates provenance workflows whose contents correspond to the confidential data found
in patients’ records. Therefore, the record (i.e., an artifact), the agent who generated a version of a
record, the time when the record was updated and the processes that contributed to the changes of
the record are part of the provenance. Furthermore, the laws governing the release of provenance
(i.e., the contents of the generated workflow) are enforced by constraints which are implemented as
semantic web rules in our prototype system. The use of a fictitious hospital here reflects the fact that
real data and provenance from real hospitals are difficult to obtain and are usually not released in
their original form, since they are protected by domain and regulatory laws.
www.whitepages.com
Patient generator
Physician generator
www.ratemds.com
www.ratemd.com provides structured information about doctors at a specified zip code. The patient
generator extracts the attributes of a person from a set of web pages. A provenance workflow gen-
erator updates the record for a patient. The recorded provenance is not publicly available and there-
fore can be treated as confidential data in the system. The intent here is to give the querying user
an opportunity to guess the information in a patient’s record and the associations between each
electronic version of a patient’s record. This information includes the patient’s disease, medications,
procedures or tests, physicians, etc. Provenance data is more challenging than the traditional data in
a database (or the multilevel databases described in [THUR93], [STAC90], and [CHEN05] because
the inference controller not only anticipates inferences involving the users prior knowledge, but also
the inferences associated with the causal relationships among the provenance entities.
RDF policy
engine/inference
controller
Cloud platform:
Hadoop/MapReduce,
spark, storm
such as the NoSQL databases (e.g., HBase and CouchDB) need to be utilized to design a big data
management system that can not only manage massive amounts of data but also be able to carry
out inferencing and prevent unauthorized violations due to inference. Figure 28.5 illustrates our
approach to Big Data Management and Inference Control.
REFERENCES
[ANNA03]. G. J. Annas, “HIPAA Regulations—A New Era of Medical-Record Privacy?,” The New England
Journal of Medicine, 348 (15), 1486–1490, 2003.
[BAI07]. J. Bai, J. Y. Nie, G. Cao, H. Bouchard, “Using Query Contexts in Information Retrieval,” In
SIGIR’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, July 23–27, Amsterdam, The Netherlands, pp. 15–22, 2007.
[BAO04]. J. Bao and V. G. Honavar, “Ontology Language Extensions to Support Localized Semantics,
Modular Reasoning, and Collaborative Ontology Design and Ontology Reuse.” Technical Report,
Computer Science, Iowa State University 2004.
[BAO06]. J. Bao, D. Caragea, V. G. Honavar, “Modular Ontologies—A Formal Investigation of Semantics
and Expressivity,” In ASWC 2006: Proceedings of the 1st Semantic Web Conference, September 3–7,
Beijing, China, pp. 616–631, 2006.
A Semantic Web-Based Inference Controller for Provenance Big Data 369
[BEER97]. C. Beeri, A. Y. Levy, M. C. Rousset, “Rewriting Queries Using Views in Description Logics,”
In PODS’97: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, May 11–15, Tucson, AZ, USA, pp. 99–108, 1997.
[BERN00]. T. Berners-Lee and others. CWM: A General Purpose Data Processor for the Semantic Web, 2000,
http://www.w3.org/2000/10/swap/doc/cwm.html.
[BIZE03]. C. Bizer, “D2R MAP—A Database to RDF Mapping Language,” WWW (Posters), May 20–24,
2003, Budapest, Hungary, 2003.
[BRAU08]. U. Braun, A. Shinnar, M. Seltzer, “Securing Provenance,” In Proceedings of the 3rd Conference on
Hot Topics in Security, USENIX Association, 2008.
[CADE10a]. T. Cadenhead, M. Kantarcioglu, B. Thuraisingham, “Scalable and Efficient Reasoning
for Enforcing Role-Based Access Control,” In DBSec’10: Proceedings of the 24th Annual IFIP WG
11.3 Working Conference on Data and Applications Security and Privacy, June 21–23, Rome, Italy,
pp. 209–224, 2010.
[CADE10b]. T. Cadenhead, M. Kantarcioglu, B. Thuraisingham, “An Evaluation of Privacy, Risks and Utility
with Provenance,” In SKM’10: Secure Knowledge Management Workshop, New Brunswick, NJ, 2010.
[CADE11a]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, “A Language for Provenance
Access Control” In CODASPY’11: Proceedings of the 1st ACM Conference on Data and Application
Security and Privacy, February 21–23, San Antonio, TX, USA, pp. 133–144, 2011.
[CADE11b]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, “Transforming Provenance
Using Redaction,” In SACMAT’11: Proceedings of the 16th ACM Symposium on Access Control Models
and Technologies, June 15–17, Innsbruck, Austria, pp. 93–102, 2011.
[CADE11c]. T. Cadenhead, M. Kantarcioglu, B. Thuraisingham, “A Framework for Policies over Provenance,”
In TaPP’11: 3rd USENIX Workshop on the Theory and Practice of Provenance, Heraklio, Crete, Greece,
2011.
[CADE12]. T. Cadenhead, V. Khadilkar, M. Kantarcioglu, B. Thuraisingham, “A Cloud-Based RDF Policy
Engine for Assured Information Sharing,” In SACMAT’12: Proceedings of the 17th ACM Symposium on
Access Control Models and Technologies, June 20–22, Newark, NJ, USA, pp. 113–116, 2012.
[CARR04]. J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, K. Wilkinson, “Jena: Implementing
the Semantic Web Recommendations,” In ACM WWW 2004, New York, NY, pp. 74–83, 2004.
[CARR05]. J. J. Carroll, C. Bizer, P. Hayes, P. Stickler, “Named Graphs, Provenance and Trust,” In ACM
WWW, Chiba, Japan, pp. 613–622, 2005.
[CHEN05]. X. Chen and R. Wei, “A Dynamic Method for Handling the Inference Problem in Multilevel Secure
Databases,” In IEEE ITCC 2005, Vol. II, Las Vegas, NV, pp. 751–756, 2005.
[CORR97]. A. Corradini, U. Montanari, F. Rossi, H. Ehrig, R. Heckel, M. Löwe, “Algebraic Approaches to
Graph Transformation, Part I: Basic Concepts and Double Pushout Approach,” Handbook of Graph
Grammars and Computing by Graph Transformation, 1, 163–245, 1997.
[CORR10]. G. Correndo, M. Salvadores, I. Millard, H. Glaser, N. Shadbolt, “SPARQL Query Rewriting for
Implementing Data Integration Over Linked Data,” In ACM EDBT, Article #4, Lausanne, Switzerland,
2010.
[DING05]. L. Ding, T. Finin, Y. Peng, P. P. Da Silva, D. L. McGuinness, “Tracking RDF Graph Provenance
Using RDF Molecules,” Technical Report, 2005. http://ebiquity.umbc.edu/paper/html/id/263/
Tracking-RDF-Graph-Provenance-using-RDF-Molecules.
[FARK02]. C. Farkas and S. Jajodia, “The Inference Problem: A Survey,” ACM SIGKDD Explorations
Newsletter, 4 (2), 6–11, 2002.
[FIKE02]. R. Fikes, P. Hayes, I. Horrocks, “DQL—A Query Language for the Semantic Web,” Knowledge
Systems Laboratory, 2002.
[FINI08]. T. Finin, A. Joshi, L. Kagal, J. Niu, R. Sandhu, W. Winsborough, B. Thuraisingham, “ROWLBAC:
Representing Role Based Access Control in OWL,” In ACM SACMAT 2008, Estes Park, CO, pp. 73–82,
2008.
[HAAR01]. V. Haarslev and R. Möller, “RACER System Description,” IJCAR 2001: Automated Reasoning,
Seattle, WA, pp. 701–705, 2001.
[HARR10]. S. Harris and A. Seaborne, “SPARQL 1.1 Query Language,” W3C Working Draft, 14, 2010.
[HOLL08]. D. A. Holland, U. Braun, D. Maclean, K. K. Muniswamy-Reddy, M. Seltzer, “Choosing a Data
Model and Query Language for Provenance,” In Provenance and Annotation of Data and Processes:
Proceedings of the 2nd International Provenance and Annotation Workshop (IPAW ‘08), June 17–18,
2008, Salt Lake City, UT, ed. Juliana Freire, David Koop, and Luc Moreau. Berlin: Springer. Special
Issue. Lecture Notes in Computer Science 5272.
370 Big Data Analytics with Applications in Insider Threat Detection
[STAC90]. P. D. Stachour and B. Thuraisingham, “Design of LDV: A Multilevel Secure Relational Database
Management System,” Knowledge and Data Engineering, IEEE Transactions on, 2 (2), 190–209, 1990.
[STOU09]. P. Stouppa and T. Studer, “Data Privacy for ALC Knowledge Bases,” In Intl Symposium of the
LFCS 2009, Deerfield Beach, FL, pp. 409–421, 2009.
[SYAL09]. A. Syalim, Y. Hori, K. Sakurai, “Grouping Provenance Information to Improve Efficiency of Access
Control” Advances in Information Security and Assurance, Third Intl Conference and Workshops, ISA,
Seoul, Korea, pp. 51–59, 2009.
[THUR90]. B. Thuraisingham, “Novel Approaches to Handle the Inference Problem” Proceedings of the 3rd
RADC Database Security Workshop, New York, pp. 58–67, 1990.
[THUR93]. B. Thuraisingham, W. Ford, M. Collins, J. O’Keeffe, “Design and Implementation of a Database
Inference Controller,” Data & Knowledge Engineering, 11(3), 271–297, 1993.
[THUR95]. B. Thuraisingham, W. Ford. “Security Constraints in a Multilevel Secure Distributed Database
Management System.” IEEE Transactions on Knowledge and Data Engineering, 7 (2), 274–293, 1995.
[THUR15]. B. Thuraisingham, T. Cadenhead, M. Kantarcioglu, V. Khadilkar, Secure Data Provenance and
Inference Control with Semantic Web. CRC Press, Boca Raton, FL, 2015.
[TSAR06]. D. Tsarkov and I. Horrocks, “FaCT++ Description Logic Reasoner: System Description,”
International Joint Conference on Automated Reasoning, Seattle, WA, pp. 292–297, 2006
[ZHAN09]. R. Zhang, A. Artale, F. Giunchiglia, B. Crispo, “Using Description Logics in Relation Based
Access Control,” In CEUR Workshop Proceedings, Grau, B. C., Horrocks, I., Motik, B., & Sattler, U.
editors, 2009.
[ZHAO08]. J. Zhao, C. Goble, R. Stevens, D. Turi, “Mining Taverna’s Semantic Web of Provenance,”
Concurrency and Computation: Practice and Experience, 20 (5), 463–472, 2008.
[ZHAO10]. J. Zhao, Open Provenance Model Vocabulary Specification. Latest version: http://open-biomed.
sourceforge.net/opmv/ns.html 2010.
Conclusion to Part IV
Part IV, consisting of six chapters, described some of the experimental systems we have designed
and developed that illustrate the key points of both big data management and analytics (BDMA) and
big data security and privacy (BDSP) systems.
In Chapter 23, we presented a framework capable of handling enormous amounts of resource
description framework (RDF) data that can be used to represent big data systems such as social
networks. Our framework is based on the Hadoop/MapReduce technologies and implements a
SPARQL query processor that can handle massive amounts of data. We also provided a brief over-
view of our security prototype that we built on top of the query processing system. In Chapter 24, we
described the design of the big data analytics system called InXite. InXite will be a great asset to the
analysts who have to deal with massive amounts of data streams in the form of billions of blogs and
messages among others. For example, by analyzing the behavioral history of a particular group of
individuals as well as details of concepts such as events, analysts will be able to predict behavioral
changes in the near future and take necessary measures. We also discussed the use of cloud comput-
ing and various big data tools in the implementation of InXite. Chapter 25 described our design and
implementation of a cloud-based information sharing system called CAISS. CAISS utilizes several
of the technologies we have developed as well as open source tools. We also described the design
of an ideal cloud-based assured information sharing system called CAISS++. In Chapter 26, we
described techniques to protect our data by encrypting it before storing on cloud computing servers
like Amazon S3. Our approach is novel as we propose to use two key servers to generate and store
the keys. Also, we assure more security than some of the other known approaches as we do not store
the actual key used to encrypt the data. This assures the protection of our data even if one or both
key servers are compromised. Our implementation utilizes Blackbook, a semantic web-based data
integration framework and allows data integration from various data sources. In Chapter 27, we
formulated the intrusion detection problems as classification problems for infinite-length, concept-
drifting data streams. Concept drift occurs in these streams as attackers react and adapt to defenses.
We formulated both malicious code detection and botnet traffic detection as such problems, and
introduced extended, multiple partition, multiple chunk, a novel ensemble learning technique for
automated classification of infinite-length, concept-drifting streams. Finally, In Chapter 28, we
described the first of a kind inference controller that will control certain unauthorized inferences
for provenance data represented as RDF graphs. We also argued that inference control is an area
that will need the use of BDMA systems for managing the data as well as reasoning about the data.
Now that we have described some of the experimental BDMA and BDSP systems we have devel-
oped in Part IV and focused on the details of stream data analytics in Parts II and III, we are now
ready to describe several directions for BDMA and BDSP systems in Part V.
Part V
Next Steps for BDMA and BDSP
Introduction to Part V
Parts II and III focused on stream data analytics with applications in insider threat detection. There
was also a special emphasis on handling massive amount of data streams and the use of cloud
computing. We described various stream data analytics algorithms and provided our experimental
results. Part IV discussed some of the experimental systems we have designed and developed on big
data management and analytics (BDMA) and big data security and privacy (BDSP). While Parts II
through IV focused on BDMA and BDSP with respect to the design and development of the systems
with applications, in Part V, we describe some of our exploratory work we have carried out as well
as plans for enhancing our work in BDMA and BDSP.
Part V, consisting of seven chapters, describes the various exploratory systems including Internet
of Things (IoT) systems and experimental infrastructures. In Chapter 29, we discuss aspects of con-
fidentiality, privacy, and trust for the semantic web and describe how they relate to big data systems
such as social media systems. In Chapter 30, we integrate the various parts of a big data system
into an automatic framework for carrying out analytics while at the same time ensuring security. In
particular, we integrate the analytics techniques with the privacy and security techniques and at the
same time preserve features such as scalability, efficiency, and interoperability in developing this
framework. In Chapter 31, we discuss our approach to designing a cyber defense framework for IoT
systems based on a layered architecture. We also discuss the use of BDMA systems for securing
IoT applications. In Chapter 32, we focus on a particular IoT system and that is a connected smart-
phone system. These connected smartphone devices generate massive amounts of data and can be
considered to be an IoT system. We discuss how big data analytics may be applied for detecting
malware in smartphones. We also discuss an experimental and education infrastructure for secur-
ing smartphones. In Chapter 33, we illustrate the key points in big data analytics and security for
a particular vertical domain and that is healthcare. In particular, we describe a planned case study
where there is a need to manage and analyze massive amounts of data securely. In Chapter 34, we
describe our planned experimental infrastructure and education programs for BDMA and BDSP.
Finally, in Chapter 35, we summarize the discussion on BDSP and the applications of BDMA for
cyber security at the NSF workshop we hosted on BDSP.
While we have explored additional systems and plans for BDMA and BDSP, we believe that the
systems and plans we have described in Part V provide a representative sample of our exploratory
work.
29 Confidentiality, Privacy, and
Trust for Big Data Systems
29.1 INTRODUCTION
Security has many dimensions including confidentially, privacy, trust, availability, and depend-
ability among others. Our work has examined confidentiality, privacy, and trust (CPT) aspects of
security for big data systems such as social media systems and cloud data systems where the data
is represented using semantic web technologies and how they relate to each other. Confidentiality
is essentially secrecy. Privacy deals with not disclosing sensitive data about the individuals. Trust is
about the assurance one can place on the data or on an individual. For example, even though John
is authorized to get salary data, can we trust John not to divulge this data to others? Even though
the website states that it will not give out social security numbers of individuals, can we trust the
website? Our prior work has designed a framework called CPT based on semantic web technologies
that provides an integrated approach to addressing CPT [THUR07]. In this chapter, we will revisit
CPT and discuss how it relates to big data such as social media data.
The organization of this chapter is as follows. Our definitions of CPT as well as the current
status on administering the semantic web will be discussed in Section 29.2. This will be followed
by a discussion of our proposed framework for securing the social media data that we call CPT in
Section 29.3. Next, we will take each of the features, CPT, and discuss various aspects as they relate
to social media in Sections 29.4 through 29.6, respectively. We have used social media systems as
an illustrative example for big data systems. An integrated architecture for CPT as well as infer-
ence and privacy control will be discussed in Section 29.7. Relationship to a big data system such as
a social media system is discussed in Section 29.8. Finally, this chapter is summarized and future
directions are given in Section 29.9.
Figure 29.1 illustrates the concepts of this chapter. It should be noted that while we have focused
on social media data for illustration purposes, the techniques can be applied to any type of big data
system. The reason is the fact that such systems have reasoning capabilities and can learn from
experiences, and, therefore, can be prone to both privacy attacks and attacks due to security viola-
tions via inference.
379
380 Big Data Analytics with Applications in Insider Threat Detection
Confidentiality,
privacy and trust
(CPT)
consequences occur. The server’s privacy policy can simply choose to state that it will correct
the problem upon discovery, but if the user never knows it until the data is shared publicly, cor-
recting it to show the data is private will not simply solve the problem. Accountability should
be addressed, where it is not the server’s decision, but rather the lawmaker’s decisions. When
someone breaks a law, or does not abide by contractual agreements, we do not turn to the accused
and ask what punishment they deem necessary. Instead, we look to the law and apply each law
when applicable.
Another point of contention is trust and inference. Before beginning any discussions of privacy, a
user and a server must evaluate how much the other party can be trusted. If neither party trusts each
other, how can either party expect the other to follow a privacy policy? Currently P3P only uses tags
to define actions; it uses no web rules for inference or specific negotiations regarding confidentiality
and privacy. With inference, a user can decide if certain information should not be given because it
would allow the distrusted server to infer information that the user would prefer to remain private
or sensitive.
client’s confidentiality. One other key aspect is that all of these events must occur in a timely manner
such that security is not compromised.
29.3.2 CPT Process
Now that the needs of the client and server have been discussed, focus will be placed on the actual
process of our system CPT. First, a general overview of the process will be presented. After the
reader has garnered a simple overview, this chapter will continue to discuss two systems—Advanced
CPT and Basic CPT—based on the general process previously discussed. The general process of
CPT is to first establish a relationship of trust and then negotiate privacy and confidentiality poli-
cies. Figure 29.2 shows the general process.
Notice that both parties partake in establishing trust. The client must determine the degree to
which it can trust the server in order to decide how much trust to place in the resources supplied
by the server and also to negotiate privacy policies. The server must determine the degree to which
it can trust the client in order to determine what privileges and resources it can allow the client to
access as well as how to present the data. The server and client will base their decisions of trust on
credentials of each other. Once trust is established, the client and server must come to an agree-
ment of privacy policies to be applied to the data that the client provides the server. Privacy must
follow trust because the degree to which the client trusts the server will affect the privacy degree.
The privacy degree affects what data the client chooses to send. Once the client is comfortable with
the privacy policies negotiated, the client will then begin requesting data. Based on the initial trust
agreement, the server will determine what and when the client views these resources. Based on its
own confidentiality requirements and confidentiality degree, the client will make decisions regard-
ing confidentiality and what data can be given to the user. It is also important to note that the server
and client must make these decisions and then configure the system to act upon these decisions. The
basic CPT system will not advise the client or server in any way regarding outcomes of any deci-
sions. Figure 29.3 illustrates the communication between the different components.
Client Server
Establish trust
Negotiate privacy
Request data
Client Server
Request interaction
Send credential expectations
Send credentials
Server sets
confidentiality
Send appropriate data degree
step of sending credentials and establishing trust is the same as the basic system except that both
parties consult with their own TIE. Once each party makes a decision, the client receives the privacy
policies from the server and then uses these policies in configuration with PIE to agree, disagree, or
negotiate. Once the client and server have come to an agreement about the client’s privacy, the client
will send a request for various resources. Based on the degree of trust that the server has assigned
to a particular client, the server will determine what resources it can give to the client. However,
in this step the server will consult the CIE to determine what data is preferable to give to the client
and what data, if given, could have disastrous consequences. Once the server has made a conclu-
sion regarding data that the client can receive, it can then begin transmitting data over the network.
Client Server
Request interaction
Send credential expectations
Send credentials
Server sets
confidentiality
degree
policies may not be acceptable to the client. It is impossible for the server to evaluate each client and
determine how to implement an individual privacy policy without first consulting the client. Thus,
the PIE is unnecessary on the server’s side. The PIE must guide the client in negotiating privacy
policies. In order to guide the client through negotiations, the inference engine must be able to
determine how the server will use data the client gives it as well as who else will have access to the
submitted data. Once this is determined, the inference engine must evaluate the data given by the
client to the server. If the inference engine determines that this data can be used to infer other data
that the client would prefer to remain private, the inference engine must warn the client and then
allow the client to choose the next appropriate measure of either sending or not sending the data.
Once the client and server have agreed on the privacy policies to be implemented, the client will
naturally begin requesting data and the server will have to determine what data to send, based on
confidentiality requirements. It is important to note that the CIE is located only on the server side.
The client has already negotiated its personal privacy issues and is ready to view the data thus leav-
ing the server to decide what the next appropriate action is. The CIE must first determine what data
will be currently available to the client, based on the current trust assignment. Once the inference
engine has determined this, the inference engine must explore what policies or data can be poten-
tially inferred if the data is given to the client. The primary objective of the CIE is to ponder how the
client might be able to use the information given to it and then guide the server through the process
of deciding a client’s access to resources.
Inference engine/
Confidentiality confidentiality controller
Policies,
ontology
rules
XML/RDF docs
Semantic web web pages
engine databases
is augmented by an inference controller that examines the policies specified as ontologies and rules,
and utilizes the inference engine embedded in the web rules language, reasons about the applica-
tions and deduces the security violations via inference. In particular, we focus on the design and
implementation of an inference controller where the data is represented as RDF documents.
It should be noted that prior to the work discussed in this book, we designed and developed a
preliminary confidentiality controller in 2005. Here, we utilized two popular semantic web tech-
nologies in our prototype called Intellidimension RDF Gateway and Jena (see [INTE] and [JENA]).
RDF Gateway is a database and integrated web server, utilizing RDF and built from the ground up
rather than on top of existing web servers or databases [RDF]. It functions as a data repository for
RDF data and also as an interface to various data sources, external or internal, that can be queried.
Jena is a Java application programming package to create, modify, store, query, and perform other
processing tasks on RDF/XML documents from Java programs. RDF documents can be created
from scratch or preformatted documents can be read into memory to explore various parts. The
node-arc-node feature of RDF closely resembles how Jena accesses an RDF document. It also has
a built-in query engine designed on top of RDFQL (RDF Query Language) that allows querying
documents using standard RDFQL query statements. Our initial prototype utilized RDFQL while
our current work has focused ion SPARQL queries.
Using these technologies, we specify the confidentiality policies. The confidentiality engine
ensures that the policies are enforced correctly. If we assume the basic framework, then the confi-
dentiality engine will enforce the policies and will not examine security violations via inference. In
the advanced approach, the confidentiality engine will include what we call an inference controller.
While our approach has been to store the data in RDF, as the amount of data to be managed could
become very large over the years, we need to apply the big data management technologies discussed
in Chapter 7. Figure 29.5 illustrates an inference/confidentiality controller for the semantic web that
has been the basis of our book.
Inference engine/
privacy controller
Privacy policies,
ontology
rules
For example, in order to extract information about various individuals and perhaps prevent and/or
detect potential terrorist attacks, data mining tools are being examined. We have heard much about
national security versus privacy in the media. This is mainly due to the fact that people are now
realizing that to handle terrorism, the government may need to collect data about individuals and
mine the data to extract information. Data may be in relational databases or it may be text, video,
and images. This is causing a major concern with the civil liberties union ([THUR02], [THUR05]).
From a technology policy of view, a privacy controller could be considered to be identical to the
confidentiality controller we have designed and developed. The privacy controller is illustrated in
Figure 29.6. However, it is implemented at the client side. Before the client gives out information to a
website, it will check whether the website can divulge aggregated information to the third party and
subsequently result in privacy violations. For example, the website may give out medical records
without the identity so that the third party can study the patterns of flu or other infectious diseases.
Furthermore, at some other time, the website may give out the names. However, if the website gives
out the link between the names and diseases, then there could be privacy violations. The inference
engine will make such deductions and determine whether the client should give out personal data
to the website.
As we have stated earlier, privacy violations could also result due to data mining and analysis.
In this case, the challenge is to protect the values of the sensitive attributes of an individual and
make public the results of the mining or analysis. This aspect of privacy is illustrated in Figure 29.7.
A CPT framework should handle both aspects of privacy. Our work on privacy aspects of social
networks addresses privacy violations that could occur in social networks due to data analytics.
It should be noted that the amount of data collected about the individuals might grow rapidly due
to better data collection technologies. This data may be mined and that could result in privacy
Randomized/
Randomizer/
Original data perturbed
pertubator
data
FIGURE 29.7 Privacy control for social network mining and analysis.
Confidentiality, Privacy, and Trust for Big Data Systems 387
breaches. Therefore, we need privacy enhanced big data analytics techniques to be integrated with
our reasoning system.
Inference engine/
privacy controller
Privacy policies
ontologies Client engine Client DB
rules
Trust engine for client and server
Inference engine/
confidentiality controller
Confidentiality
Semantic web engine XML,RDF docs
policies,
web pages
ontologies,
database
rules
and enforced by the trust engine. Figure 29.10 illustrates an integrated architecture for ensuring
CPT for the semantic web. The web server as well as the client have trust management modules. The
web server has a confidentiality engine, whereas client has a privacy engine. The inference control-
ler is the first towards an integrated CPT system with XML, RDF, and web rules technologies. Some
details of the modules are illustrated in Figure 29.11. Note that a version of an inference controller
was discussed in Chapter 28.
In Figure 29.11, ontologies, CPT policies, and credentials are given to the expert system such
that the expert system can advise the client or server who should receive access to what particular
resource and how these resources should further be regulated. The expert system will send the poli-
cies to the WCOP (web rules, credentials, ontologies, and policies) parser to check for syntax errors
and validate the inputs. The information contained within the dashed box is a part of the system that
is only included in the Advanced TP&C system. The inference engines (e.g., TIE, PIE, and CIE)
will use an inference module to determine if classified information can be inferred.
29.8 CPT WITHIN THE CONTEXT OF BIG DATA AND SOCIAL NETWORKS
CPT are crucial services that must be built into a big data system such as a social network.
Confidentiality policies will enable the members of the network to determine what information is
to be shared with their friends in the network. Privacy policies will determine what a network can
release about a member, provided these policies are accepted by the member. Trust policies will
provide a way for members of a network to assign trust values to the others. For example, a member
may not share all the data with his/her friends in the network unless he trusts the friends. Similarly,
Confidentiality, Privacy, and Trust for Big Data Systems 389
Credentials
Inference engines
User interface
a network may enforce certain privacy policies and if one does not approve of these policies or not
trust the network, he/she may not join the network. Therefore, we see that many of the concepts
discussed in the previous sections are directly applicable to social networks.
If the social networks are represented using semantic web technologies such as RDF graphs,
then the reasoning techniques inherent in technologies such as RDF and OWL can be used to rea-
son about the policies and determine whether any information should be shared with members. In
addition to CPT policies, social networks also have to deal with information sharing policies, That
is, member John of a network may share data with member Jane, provided Jane does not share with
member Mary. We have carried out an extensive investigation of assured information sharing in
the cloud and are extending this work to social media data and other big data systems. Figure 29.12
Inference engine/
privacy controller
Privacy policies
ontologies Client engine Client DB
rules
Trust engine for client and server
Inference engine/
confidentiality controller
Confidentiality
policies, XML, RDF docs
ontologies Semantic web engine web pages
rules database
illustrates the adaptation of the CPT framework for big data systems such as social media data
systems.
REFERENCES
[AGRA00]. A. Rakesh and R. Srikant, “Privacy-Preserving Data Mining,” SIGMOD Conference, Dallas, TX,
pp. 439–450, 2000.
[ANTO08]. G. Antoniou and F. V. Harmelen, A Semantic Web Primer, MIT Press, Cambridge, MA, 2008.
[DENK03]. G. Denker, L. Kagal, T. Finin, M. Paolucci, and K. Sycara, “Security for DAML Web Services:
Annotation and Matchmaking,” In Proceedings of the International Semantic Web Conference, Sanibel
Island, FL, pp. 335–350, 2003.
[FINI02]. T. Finin and A. Joshi, “Agents, Trust, and Information Access on the Semantic Web,” ACM SIGMOD
Record, (4), 30–35, 2002.
[INTE]. Intellidimension, the RDF Gateway, http://www.intellidimension.com/.
[JENA]. Jena, http://jena.sourceforge.net/.
[KAGA03]. L. Kagal, T. Finin, A. Joshi, “A Policy Based Approach to Security for the Semantic Web,” In
Proceedings of the International Semantic Web Conference, Sanibel Island, FL, 2003.
[KANT03]. M. Kantarcioglu, and C. Clifton, “Assuring Privacy When Big Brother is Watching,” In Proceedings
of Data Mining Knowledge Discover (DMKD), San Diego, CA, pp. 829–93, 2003.
[RDF]. RDF Primer, http://www.w3.org/TR/rdf-primer/.
[SWRL]. Semantic Web Rules Language, 2004. http://www.w3.org/Submission/SWRL/.
[THUR02]. B. Thuraisingham, “Data Mining, National Security and Privacy,” ACM SIGKDD Explorations
Newsletter, 4 (2), 1–5, December 2002.
[THUR05]. M. B. Thuraisingham, “Privacy Constraint Processing in a Privacy-Enhanced Database Management
System,” Data and Knowledge Engineering, 55 (2), 159–188, 2005.
[THUR07]. B. Thuraisingham, N. Tsybulnik, A. Alam, “Administering the Semantic Web: Confidentiality,
Privacy, and Trust Management,” International Journal of Information Security and Privacy, 1 (1), 129–
134, 2007.
[W3C]. World Wide Web Consortium, www.w3c.org.
[YU03]. T. Yu, and M. Winslett, “A Unified Scheme for Resource Protection in Automated Trust Negotiation,”
In Proceedings of IEEE Symposium on Security and Privacy, Oakland, CA, pp. 110–122, 2003.
30 Unified Framework
for Secure Big Data
Management and Analytics
30.1 OVERVIEW
In this chapter, we integrate the various parts of a big data system into an automatic framework
for carrying out analytics but at the same time ensuring security. In particular, we integrate the
analytics techniques with the privacy and security techniques discussed in the previous parts. In
developing this framework, we preserve features such as scalability, efficiency, and interoperability.
This framework can be used to execute various policies, including access control policies, redaction
policies, filtering policies, and information-sharing policies, as well as inference strategies. Our
framework can also be used as a testbed for evaluating different policy sets over big data (e.g., social
media) graphs. Our recent work discussed in [THUR15] proposes new mechanisms for developing
a unifying framework of data provenance expressed as RDF graphs. Some of our design techniques
was also discussed in Chapter 28. These methods were applied for social media systems represented
using semantic web technologies, and the approach was discussed in [THUR16]. In this chapter, we
adapt the discussions in our previous works (e.g., [THUR15] and [THUR16]) for big data systems.
The framework we present in this chapter is in the design stages. Specifically, we give guidelines
for policy processing for big data systems, as well as metadata that includes data provenance both
for access control and inference control, as well as information sharing. We can integrate features
such as risk-based access control and inference into such a framework. In addition, we can also
incorporate privacy-aware data analytics for big data systems such as social media systems. Our
ultimate goal is to develop big data systems that not only carry out access control and inference
control but also information sharing and risk-based policy processing, as well as privacy-aware
analytics. Before we present our framework, we will describe various aspects of integrity and data
provenance for big data systems. Our framework will incorporate several features such as security,
privacy, trust, and integrity for big data systems. While confidentiality, privacy, and trust have been
discussed in Chapter 29, in this chapter, we will focus on integrity and data provenance and dis-
cuss how the various components can be put together into a unified framework for secure big data
systems.
The organization of this chapter is as follows. In Section 30.2, we discuss aspects of integrity and
data provenance for big data systems. In Section 30.3, we discuss our framework. Aspects of what
we call our global inference controller will be discussed in Section 30.4. Such an inference control-
ler will handle unauthorized inference during access control as well as during data sharing. This
chapter is summarized in Section 30.5. Figure 30.1 illustrates the contents of this chapter.
391
392 Big Data Analytics with Applications in Insider Threat Detection
Unified
framework for
BDMA
Integrity
Integrated Security and
management
architecture for privacy
and provenance
big data control
for big data
FIGURE 30.1 Unified framework for big data management and analytics.
the consistency of the data has to be ensured. When a transaction aborts, it has to be ensured that the
database is recovered from the failure into a consistent state. Integrity constraints are rules that have
to be satisfied by the data. Rules include “salary value has to be positive” and “age of an employee
cannot decrease over time.” More recently, integrity has included data quality, data provenance,
data currency, real-time processing, and fault tolerance.
Integrity management is essential for big data systems such as social media systems in order to
provide accurate and timely information to its users. For example, when users want to share infor-
mation with their friends, they may want to share a certain version or the most recent version of
the data. Furthermore, the member of the network may copy data from other sources and post it on
their social media pages. In such situations, it would be useful to provide the sources of the infor-
mation as well as from where the information was derived. In this section, we discuss aspects of
integrity for big data as well as implementing integrity management as cloud services. For example,
how do we ensure the integrity of the data and the processes? How do we ensure that data quality
is maintained?
Aspects
of
integrity
Concurrency
Integrity Integrity Data quality
control
of the of the and
and
agents websites provenance
recovery
Data Recovery: When transactions abort before they complete execution, the database should
be recovered to a consistent state such as it was before the transaction started execution. Several
recovery techniques have been proposed to ensure the consistency of the data.
Data Authenticity: When the data is delivered to the user, its authenticity has to be ensured. That
is, the user should get accurate data and the data should not be tampered with. We have conducted
research on ensuring the authenticity of XML data during third party publishing [BERT04].
Data Completeness: Data that a user receives should not only be authentic but also be complete.
That is, everything that the user is authorized to see has to be delivered to the user.
Data Currency: Data has to be current. That is, data that is outdated has to be deleted or archived
and the data that the user sees has to be current data. Data currency is an aspect of real-time pro-
cessing. If a user wants to retrieve the temperature, he has to be given the current temperature, not
the temperature that is 24 hours old.
Data Accuracy: The question is how accurate is the data? This is also closely related to data qual-
ity and data currency. That is, accuracy depends on whether the data has been maliciously corrupted
or whether it has come from an untrusted source.
Data Quality: Is the data of high quality? This includes data authenticity, data accuracy, and
whether the data is complete or certain. If the data is uncertain, then can we reason with this uncer-
tainty to ensure that the operations that use the data are not affected? Data quality also depends on
the data source.
Data Provenance: This has to do with the history of the data, that is, from the time the data
originated such as emanating from the sensors until the present time when it is given to the general.
The question is who has accessed the data? Who has modified the data? How has the data traveled?
This will determine whether the data has been misused.
Integrity Constraints: These are rules that the data has to satisfy, such as the age of a person can-
not be a negative number. This type of integrity has been studied extensively by the database and
the artificial intelligence communities.
Fault Tolerance: As in the case of data recovery, the processes that fail have to be recovered.
Therefore, fault tolerance deals with data recovery as well as process recovery. Techniques for fault
tolerance include check pointing and acceptance testing.
Real-time Processing: Data currency is one aspect of real-time processing where the data has to
be current. Real-time processing also has to deal with transactions meeting timing constraints. For
example, stock quotes have to be given within say 5 min. If not, it will be too late. Missing timing
constraints could cause integrity violations.
Data Provenance
the policies. For example, at the unclassified level, we may say that the source is trustworthy, but at the
secret level, we know that the source if not trustworthy. The inference controllers that we have devel-
oped could be integrated with the theories of interceding developed for data quality to ensure security.
Next, let us examine data provenance. For many of the domains including medical and health-
care, as well as defense, where the accuracy of the data is critical, we need to have a good under-
standing as to where the data came from and who may have tampered with the data. As stated in
[SIMM05], data provenance, a kind of metadata, sometimes called “lineage” or “pedigree” is the
description of the origin of a piece of data and the process by which it arrived in a database.” Data
provenance is information that helps determine the derivation history of a data product, starting
from its original source.
Provenance information can be applied to data quality, auditing, and ownership, among others.
By having records of who accessed the data, data misuse can be determined. Usually, annotations
are used to describe the information related to the data (e.g., who accessed the data? where did the
data come from?). The challenge is to determine whether one needs to maintain coarse-grained
provenance data or fine-grained provenance data. For example, in a course-grained situation, the
tables of a relation may be annotated, whereas in a fine-grained situation, every element may be
annotated. There is, of course, the storage overhead to consider for managing provenance. XML,
RDF, and OWL have been used to represent provenance data, and this way the tools developed for
the semantic web technologies may be used to manage the provenance data.
There is much interest in using data provenance for misuse detection. For example, by main-
taining the complete history of data such as who accessed the data, when and where was the data
accessed, one can answer queries such as “who accessed the data between January and May 2010.”
Therefore, if the data is corrupted, one can determine who corrupted the data or when the data
was corrupted. Figure 30.3 illustrates the aspects of data provenance. We have conducted exten-
sive research on representing and reasoning about provenance data and policies represented using
semantic web technologies ([CADE11a], [CADE11b], [THUR15]).
EMP.AGE>0.
In XML this could be represented as the following:
<Condition Object=”//Employe/Age”>
<Apply FunctionId="greater-than">
<AttributeValue DataType="http://www.w3.org/2001/
XMLSchema#integer">0
</AttributeValue>
</apply>
</Condition>
Data Quality Policy: The quality of the data in the employee table is LOW.
In the relational model, this could be represented as
EMP.Quality = LOW.
In XML, this policy could be represented as
<Condition Object=”//Employe/Quality”>
<Apply FunctionId="equal">
<AttributeValue DataType="http://www.w3.org/2001/
XMLSchema#string">LOW
</AttributeValue>
</Apply>
</Condition>
Data Currency: An example is: The salary value of EMP cannot be more than 365 days old. In a
relational representation, this could be represented as
<Condition Object=”//Employe/Salary”>
<Apply FunctionId="AGE">
<Apply FunctionId="less-than-or-equal">
<AttributeValue DataType="http://www.w3.org/2001/
XMLSchema#integer">365
</AttributeValue>
</Apply>
</Apply>
</Condition>
The above examples have shown how certain integrity policies may be specified. Note that there are
many other applications of semantic web technologies to ensure integrity. For example, in order to
ensure data provenance, the history of the data has to be documented. Semantic web technologies
such as XML are being used to represent say the data annotations that are used to determine the
396 Big Data Analytics with Applications in Insider Threat Detection
quality of the data or whether the data has been misused. That is, the data captured is annotated
with metadata information such as what the data is about, when it was captured, and who captured
it. Then, as the data moves from place to place or from person to person, the annotations are updated
so that, at a later time, the data may be analyzed for misuse. These annotations are typically repre-
sented in semantic web technologies, such as XML, RDF, and OWL.
Another application of semantic web technologies for integrity management is the use of ontolo-
gies to resolve semantic heterogeneity. That is, semantic heterogeneity causes integrity violations.
This happens when the same entity is considered to be different at different sites and therefore
compromises integrity and accuracy. Through the use of ontologies specified in say OWL, it can be
expressed that a ship in one site and a submarine in another are one and the same.
Semantic web technologies also have applications in making inferences and reasoning under
uncertainty or mining [THUR15]. For example, the reasoning engines based on RDF, OWL, or
say rules may be used to determine whether the integrity policies are violated. We have discussed
inference and privacy problems and building inference engines in earlier chapters. These techniques
have to be investigated for the violation of integrity policies.
Big data
Develop
services Apply
high
(e.g., social integrity
integrity big
media policies
data services
services
User
interface
manager
Other
Policy Inference modules:
Risk manager
manager engine e.g. adversarial
miner
Access Information
Redaction
control sharing
manager
manager manager
Data Provenance
controller controller
User interface
Integrated
controller
Access Information
control sharing
Redaction
the data item. This layer then executes the request over the stored data and returns results back to
the policy layer (and/or the inference engine layer) where it is re-evaluated based on a set of policies.
Provenance/Metadata Controller: The provenance/metadata controller is used to store and man-
age provenance/metadata information, that is, associated with data items that are present in the data
controller. In the case when we select a graph representation of provenance, the provenance control-
ler stores information in the form of logical graph structures in any appropriate data representation
format. This controller also records the ongoing activities associated with the data items stored in
the data controller. This controller takes as input a graph query and evaluates it over the provenance
information. This query evaluation returns a subgraph back to the inference controller layer where
it is re-examined using a set of policies.
User-Interface Manager: The user-interface module provides a layer of abstraction that allows
a user to interact with the system. The user interacts with the system via a user-interface layer.
This layer accepts a user’s credentials and authenticates the user. Our interface module hides the
actual internal representation of our system from a user by providing a simple question−answer
mechanism. This mechanism allows the user to pose standard provenance queries such as why
a data item was created, where in the provenance graph, it was generated, how the data item was
generated and when and what location it was created, etc. This layer also returns results after they
have been examined against a set of policies. Essentially, the user-interface manager is responsible
for processing the user’s requests, authenticating the user, and providing suitable responses back to
the user. The interface manager also provides an abstraction layer that allows a user to interact with
the system. A user can therefore pose either a data query or a provenance/metadata query to this
layer. The user-interface manager also determines whether the query should be evaluated against
the traditional data or provenance.
Policy Manager: The policy module is responsible for enforcing any high-level policy defined by a
high-level application user or administrator. The policies are not restricted to any particular security
policy definition, model or mechanism. In fact, we can support different access control policies, for
example, role-based access control (RBAC), access control based on context such as time (TRBAC),
location (LBAC), etc. Besides the traditional and well-established security models built on top of
access control mechanisms, we also support redaction policies that are based on sharing data for
the ongoing mutual relationships among businesses and stakeholders. The policy layer also interacts
with any reasoners in the inference layer which offer further protection against inference attacks. The
inference layer enforces policies that are in the form of DL constraints, OWL restrictions or SWRL
Unified Framework for Secure Big Data Management and Analytics 399
rules. We also observe that some of the access control policies can be expressed as inference rules or
queries via query rewrite or views. Our policy module therefore has many layers equipped with secu-
rity features, thus ensuring we are enforcing the maximal protection over the underlying provenance
store. The policy module also handles the information-sharing policies.
Essentially, the policy manager is responsible for ensuring that the querying user is authorized to
use the system. It evaluates the policies against a user’s query and associated query results to ensure
that no confidential information is released to unauthorized users. The policy manager may enforce
the policies against the traditional data or against the provenance data. Each data type may have its
own policy manager; for example, the traditional data may be stored in a different format from the
provenance data. Hence, we may require different implementations for each policy manager.
Inference Engine: The inference engine is the heart of the inference controller. The engine is
equipped to use a variety of inference strategies that are supported by a particular reasoner. Since
there are many implementations of reasoners available, our inference controller offers an added
feature of flexibility, whereby we can select from among any reasoning tool for each reasoning task.
We can improve the efficiency of the inference controller since each inference strategy (or a combi-
nation of strategies) could be executed on a separate processor. An inference engine typically uses
software programs that have the capability of reasoning over some data representation, for example,
a relational data model or an RDF graph model representation.
The inference problem is an open problem and a lot of research has been pivoted around its imple-
mentations based on traditional databases ([MARK96], [HINK97]). However, since provenance has
a logical graph structure, it can also be represented and stored in a graph data model, therefore it is
not limited to any particular data format. Although our focus in this chapter is on building an infer-
ence controller over the directed graph representation of provenance, our inference controller could
be used to protect the case when provenance is represented and stored in a traditional relational
database model. Also, the use of an RDF data model does not overburden our implementation with
restrictions, since other data formats are well served by an RDF data model. Furthermore, there are
tools to convert say relational data into RDF and vice versa (see e.g., [D2RQ]).
Query Manager: The query processing module is responsible for accepting a user’s query, pars-
ing it and submitting it to the provenance knowledge base. After the query results are evaluated
against a set of policies, it is returned to the user via the user-interface layer. The query processing
module can accept any standard provenance query as well as any query written in the SPARQL
format. The querying user is allowed to view the errors that are due to the syntax of a query, as well
as the responses constructed by the underlying processes of the inference controller.
Information Sharing Manager: The information-sharing manager will implement the informa-
tion-sharing policies. For example, if organization A wants to share data with organization B, then
the information-sharing controller will examine the policies via the policy manager, determine
whether there are any unauthorized inferences by communicating with the inference engine, and
determine whether data is to be given to organization B.
Access Control Manager: This access control module is responsible for determining whether the
user can access the data. The access control policies are obtained via the policy manager. The infer-
ence engine will determine whether any unauthorized information will be released by carrying out
reasoning. The results are given to the user via the user-interface manager.
Redaction Manager: This module will determine which data has to be redacted before it is given
to the user. It operates in conjunction with the access control manager. It also examines the informa-
tion that has been released previously and determines whether the new information obtained as a
result of executing the query should be given to the user.
Risk Analyzer: The risk analyzer will compute the risks for releasing the information and makes
a determination whether the information should be released to the user. It interacts with other mod-
ules, such as the access control manager, the redaction manager, and the information-sharing man-
ager, in making this determination. The results of the risk manager are then given to the access
control manager, the redaction manager, and the information-sharing manager to execute the results.
400 Big Data Analytics with Applications in Insider Threat Detection
Adversarial Data Miner: This module will implement the strategies to mine the adversary to see
what his/her motives are. An adversary could be a human or some malicious code. In particular, such
a data miner will determine how to thwart the adversary as well as apply game theoretic reasoning in
determining what information is to be released to the user. It will work jointly with the inference engine.
In addition, Pellet provides all the standard inference services that are traditionally provided by
DL reasoners. These are
• Consistency checking: This ensures that an ontology does not contain any contradictory
facts. The OWL 2 Direct Semantics provide the formal definition of ontology consistency
used by Pellet.
• Concept satisfiability: This determines whether it is possible for a class to have any
instances. If a class is unsatisfiable, then defining an instance of that class will cause the
whole ontology to be inconsistent.
Unified Framework for Secure Big Data Management and Analytics 401
• Classification: This computes the subclass relations between every named class to create
the complete class hierarchy. The class hierarchy can be used to answer queries such as
getting all or only the direct subclasses of a class [SIRI07].
• Realization: This finds the most specific classes that an individual belongs to; that is, real-
ization computes the direct types for each of the individuals. Realization can only be per-
formed after classification since direct types are defined with respect to a class hierarchy
[SIRI07]. Using the classification hierarchy, it is also possible to get all the types for each
individual.
The global inference controller has to reason with big data. Its operation has to be timely.
Therefore, we propose a cloud-based implementation of such an inference controller. Our ultimate
goal is to implement the entire inference controller in the cloud.
REFERENCES
[BERN87] P. Bernsteinet al., Concurrency Control and Recovery in Database Systems. Addison-Wesley, MA,
1987.
[BERT04] E. Bertinoet al., “Secure Third Party Publication of XML Documents,” IEEE Transactions on
Knowledge and Data Engineering, 16 (10), 1263–1278, 2004.
[BIZE03] C. Bizer, D2R MAP-A database to RDF mapping language. (WWW Posters) 2003.
[CADE11a] T. Cadenhead, V. Khadilkar, M. Kantarcioglu, and B.M. Thuraisingham, “A Language for
Provenance Access Control,” In CODASPY’ 2011: Proceedings of the 1st ACM Conference on Data and
Application Security and Privacy, pp. 133–144, San Antonio, TX, USA, 2011.
[CADE11b] T. Cadenhead, V. Khadilkar, M. Kantarcioglu, and B. M. Thuraisingham, “Transforming
Provenance Using Redaction,” In SACMAT’2011: Proceedings of the 16th ACM Symposium on Access
Control Models and Technologies, pp. 93–102, Innsbruck, Austria, 2011.
[CWM] Closed World Machine, http://www.w3.org/2001/sw/wiki/CWM.
[D2RQ] D2RQ: Accessing Relational Databases as Virtual RDF Graphs, http://d2rq.org/.
[HAAR01] V. Haarslev and R. Möller, “RACER System Description,” In IJCAR’01: Proceedings of the 1st
International Joint Conference on Automated Reasoning, pp. 701–706, Springer-Verlag, London, 2001.
402 Big Data Analytics with Applications in Insider Threat Detection
[HINK97] T. H. Hinke, H. S. Delugach, and R. P. Wolf, “Protecting Databases from Inference Attacks,”
Computers & Security, 16 (8), 687–708, 1997.
[MARK96] D. G. Marks, “Inference in MLS Database Systems,” IEEE Transacations on Knowledge Data
Engineering, 8 (1), 46–55, 1996.
[PON] R. K. Pon and A. F. Cárdenas, “Data Quality Inference,” In IQIS ‘05 Proceedings of the 2nd International
Workshop on Information Quality in Information Systems, Baltimore, MD, pp. 105–111, 2005.
[SHEA08] R. Shearer, B. Motik, and I. Horrocks, “HermiT: A Highly-Efficient OWL Reasoner,” OWLED’08:
Proceedings of the 5th OWLED Workshop on OWL, 432 (91), 378, 2008.
[SIMM05] Y.L. Simmhan, B. Plale, and D. Gannon, “A Survey of Data Provenance in E-Science,” Indiana
University Technical Report, ACM SIGMOD Record, 34 (3), 31−36, 2005.
[SIRI07] E. Sirin, B. Parsia, B.C. Grau, A. Kalyanpur, and Y. Katz, “Pellet: A practical Owl-Dl Reasoner,” Web
Semantics: Science, Services and Agents on the World Wide Web, 5 (2), 51–53, 2007.
[STAD07] J. Staddon, P. Golle, and B. Zimny, “Web-Based Inference Detection,” In Proceedings of 16th
USENIX Security Symposium, Article No. 6, Boston, MA, 2007.
[THUR15] B. Thuraisinghamet al., Secure Data Provenance and Inference Control with Semantic Web
Technologies. CRC Press, Boca Raton, FL, 2015.
[THUR16] B. Thuraisinghamet al., Analyzing and Securing Social Networks. CRC Press, Boca Raton, FL,
2016.
[TSAR06] D. Tsarkov and I. Horrocks, “FaCT++ Description Logic Reasoner: System Description,”
In IJCAR’06: Proceedings of the 3rd International Joint Conference on Automated Reasoning,
pp. 302–307, Seattle, Washington, D.C., 2006.
31 Big Data, Security, and
the Internet of Things
31.1 INTRODUCTION
In this chapter, we will continue to discuss the next steps in BDMA and BDSP. In particular, we
will provide an overview of the Internet of things (IoT) and various security issues for IoT as well
as discuss the big data problem for IoT systems. IoT is one of the rapidly going technologies today
with multiple devices from a few to millions connected through the cyberspace. These devices
may include computers, controllers, smartphones, and embedded devices ranging from those used
in smart grids to smart homes. Managing the connections of these devices as well as handling the
large amounts of data generated by the devices has become a daunting challenge for corporations.
For example, AT&T has stated that “as of Q1 2016 there are 27.8 million devices on the AT&T
Network [ATT] and they expect the numbers of connected devices to grow from tens of millions to
hundreds of millions to billions.” Figure 31.1 shows a sample topology for IoT. Data is first collected
from sensors and then aggregated. Finally, this data is analyzed.
The increasing complexity of cyberspace due to the IoT with heterogeneous components, such as
different types of networks (e.g., fixed wired networks, mobile cellular networks, and mobile ad hoc
networks), diverse computing systems (e.g., sensors, embedded systems, smartphones, and smart
devices), and multiple layers of software (e.g., applications, middleware, operating systems [Oss],
and hypervisors), results in massive security vulnerabilities as any of the devices or the networks
or the data generated could be attacked. Adversaries will increasingly move into the cyberspace for
IoT and will target all cyber-based infrastructures, including energy, transportation, and financial
and health care infrastructures. Providing cyber security solutions for managing cyber conflicts
and defending against cyber attacks for IoT in such a complex landscape is thus a major challenge.
We discuss our approach of a cyber-defense framework for IoT systems based on a layered archi-
tecture. In particular, we discuss a layered security framework for IoT applications. The goals are to
(i) develop techniques for secure networks (both wired and wireless), hardware, software, and sys-
tems as well as data sources when faced with attacks and (ii) develop analytics solutions for detect-
ing the attacks. It should be noted that there are several reported efforts on IoT security. But many of
these projects focus on just one aspect (e.g., hardware, network, software, or data). We believe that
we need an integrated framework to solve the challenging security problems.
While securing each layer of our framework is important for security, we believe that the IoT
challenges are mainly data challenges. This is because the IoT devices generate data and combining
all the devices together will make it a big data problem. This data has to be analyzing and secured.
In addition, threat/attack data is also collected for IoT devices from all the layers. That is, the hard-
ware, network, and systems layers generate threat/attack data that has to be integrated and analyzed
to determine whether there are anomalies. Therefore, much of our focus in this chapter is on data-
related security issues for IoT.
The organization of this chapter is as follows. Use cases covering various systems are given in
Section 31.2. A layered security framework is discussed in Section 31.3. Data security challenges
are discussed in Section 31.4. Scalable analytics for security applications are discussed in Section
31.5. This chapter is summarized in Section 31.6. It should be noted that there is no one architecture
for an IoT. The architecture will depend on whether the IoT is for smart home or a smart grid or
some other critical infrastructure. Therefore, the security solutions will depend on the particular
architecture; for example, when the network is attacked, a device in a smart grid may be able to
403
404 Big Data Analytics with Applications in Insider Threat Detection
Data integrator
Integrated data
Cloud
Data analytics
Database
switch to another network or the system may divert the critical resources to another location. Such
agility may not be possible say for a smart home IoT. Therefore, such considerations have to be taken
into account when devising security solutions for IoT.
like surveillance cameras and satellite data, and mobile sensors like GPS-equipped vehicles and
automatic vehicle location techniques. In order to meet the requirements of modern transportation
algorithms, these sensors are collecting data at unprecedented levels of granularity, and because
sensing is passive, users are generally unaware of the privacy risks.
Smart home: A typical home area network (HAN) can connect a set of devices such as fridges,
smart meters, printers, thermostats, streaming clients, and set top boxes. The diversity of devices
connected to these networks is expected to increase rapidly in the next couple of years. As new
devices enter the market, the proliferation of heterogeneous technologies poses serious challenges
for interactions among devices following different specifications and standards.
Smart homes will provide several new opportunities and functionalities for users, but at the same
time, they present several security challenges that need to be addressed. These challenges include
(1) secure interoperability among a wide range of manufacturers and service providers, (2) trust and
authentication, (3) usability of security solutions, and (4) access control.
Consider, for example, a home network that controls appliances, heating and cooling, and light-
ing. In most networks available to consumers, there is typically a lack of fine-grained access control;
access to the central controller implies total control of the system. However, adding such fine-
grained access control runs the danger of making the system unusable.
There have been many examples of security problems, including smart refrigerators that give
Google Calendar user passwords to untrusted devices, smartphone apps that control locks and tem-
perature in homes that allow impersonators, home cameras sharing private images with anyone,
insurance dongles in cars accepting software updates from untrusted services, and home alarm
systems that allow attackers to intercept and modify messages.
Control systems: Control systems are computer-based systems that monitor and control physical
processes. These systems represent a wide variety of networked information technology (IT) sys-
tems connected to the physical world. Depending on the application, these control systems are also
called process control systems (PCS), supervisory control and data acquisition (SCADA) systems
(in industrial control or in the control of the critical infrastructures), distributed control systems
(DCS), or cyber-physical systems (CPS; to refer to embedded sensor and actuator networks).
Control systems are usually composed of a set of networked agents, consisting of sensors, actua-
tors, control processing units such as programmable logic controllers (PLCs) and communication
devices. For example, the oil and gas industry uses integrated control systems to manage refining
operations at plant sites, remotely monitor the pressure and flow of gas pipelines, and control the
flow and pathways of gas transmission. Water utilities can remotely monitor well levels and control
the wells pumps; monitor flows, tank levels, or pressure in storage tanks; monitor pH, turbidity, and
chlorine residual; and control the addition of chemicals to the water.
Several control applications can be labeled as safety-critical; their failure can cause irreparable
harm to the physical system being controlled and to the people who depend on it. SCADA systems,
in particular, perform vital functions in national critical infrastructures, such as electric power
distribution, oil and natural gas distribution, water and wastewater treatment, and transportation
systems. They are also at the core of health care devices, weapons systems, and transportation man-
agement. The disruption of these control systems could have a significant impact on public health
and safety and lead to large economic losses.
Control systems are now at a higher risk to cyber attacks because their vulnerabilities are
increasingly becoming exposed and available to an ever-growing set of motivated and highly skilled
attackers.
Smart grid: Smart grid refers to multiple efforts around the globe to modernize aging power grid
infrastructures with new technologies, enabling a more intelligently networked automated system.
The goal of a smart grid is to deliver energy with greater efficiency, reliability, and security, and
provide more transparency and choice to electricity consumers.
This modernization leverages recent advances in IT, wireless communications, and embedded
systems (sensors and actuators with processing power, capable of communicating with each other).
406 Big Data Analytics with Applications in Insider Threat Detection
These new technologies will provide real-time monitoring of the health of the power grid, collect
and analyze data for better analytics and control, and accommodate the integration of new forms
of energy supply (such as renewable sources) and delivery (energy storage and dynamic pricing).
While the smart grid promises many benefits, it raises many new security and privacy chal-
lenges. With the large-scale deployment of ubiquitous, remotely accessible networked devices to
monitor and control the grid, it will be easier for attackers to find vulnerable points, including new
smart devices, to access various parts of the grid and so the attack surface of the power grid will
be vastly increased. Also, the new functionalities provided by the new devices such as the remote
disconnect option provided by many smart meters may be exploited by the attackers resulting in
major security risks to the system.
There are also many new privacy concerns related to smart grid deployments. The fine-grained
energy usage data collected by new devices including smart meters, smart appliances, and electric
cars will result in new privacy threats to consumers, especially because of the large-scale, more
detailed, and more frequent collection of the usage data.
IoT devices
Streaming data
Applications
Data
Network
Software Threat
analysis
System
Hardware
smart devices in a home network may be lightweight, those in a smart grid (e.g., for telecom, oil and
gas, and power) may have more heavyweight systems.
Next, we have the network layer. IoT devices have a variety of network stacks and communi-
cation options, from embedded wireless communications in IEEE 802.15.4 to lightweight IPv6
packets for devices with small communication frames such as those provided by 6LoWPAN, new
routing protocols such as the new IETF standard for ROLL and lightweight IP-based network stacks
like LwIP. Networking additions to IoT devices are new and the security analysis of all these proto-
cols is still ongoing. Network security, firewalls, and intrusion detection for IoT gateway hubs will
be essential in future IoT deployments. Software-defined networking (SDN) solutions can be lever-
aged for orchestration and to launch desired defenses on demand. Due to the developments with
SDN, we can monitor communications at a smart hub or IoT router and reconfigure the network in
case we see threats or new vulnerabilities. Our work will continue this research by looking at cur-
rent specifications and identifying threats and possible improvements to guarantee secure network
operations. Specifically, we will focus on proactive security using multipath coding.
Next is the data layer. This layer ensures features such as access control to the resources and that
the privacy of the individuals who access the IoT system are protected. For example, issues such
as privacy-preserving data management and novel success control models need to be investigated
for data management. Another aspect of data management is analyzing all the data collected. This
aspect is represented in the threat management layer. That is, the devices are connected to the data
centers via the wireless network and the data collected by the devices are sent to the data center for
analysis. The applications (discussed in the use cases) drive the hardware, software, systems, and
networks.
Our security investigation will focus on each of the layers under varying assumptions for IoT.
The challenges for each of the layers as well as for threat analysis will be discussed in Section 31.4.
For IoT users to allow data sharing, they must have trust that their data will be protected in the
manner that they intend and will not be vulnerable to attacks. Also, users must be able to compre-
hend the policy management options so that they understand their choices and the possible con-
sequences of their selections. Because these options involve the data that can be collected and the
contexts in which they can be collected, as well as who will be allowed access to their data, policy
specification by the user will necessarily be a complex process. Consequently, the interactions with
the interface necessary for a user to define his/her policies and set the data management options
need to be as straightforward as possible and allow for personalization. Not all users may desire to
have detailed control over their personal information, so the interface needs to support decisions at
different levels of granularity.
Our approach to securing IoT data will consist of multiple components. The first component
consists of an access control system supporting the specification and enforcement of access control
policies for IoT data as well as the evolution and merging of different policies ranging from event-
based sharing (e.g., share only when heart attack is detected) to emergency data release for critical
situations (e.g., access to precise locations of certain cars when a traffic accident is detected). The
second component is a data analytics system able to work on sanitized data. An important issue is
to assess the quality of data analytics models derived from sets of sanitized data. More importantly,
it is crucial to determine whether and how a data-mining model built from a large volume of IoT
data coming from many devices can then be refined by using a nonanonymized specific IoT device
to provide value to users. For example, based on the aggregate electricity usage data collected from
a given region and the user’s specific data, the most cost-effective plan can be recommended to the
user. The goal would be to create global data-mining models based on the sanitized data of multiple
IoT devices and then specialize these models for each specific IoT device to find optimal configura-
tions for a given task.
Initially, we will build a framework where IoT data streams coming from different sources
will be analyzed locally. Using the computational power of the IoT devices (many devices will
have some computation power and small storage), important statics will be computed locally
for detecting various events. In addition, as needed, important data and statistics will be sent
to a cloud-based service in encrypted format for storage. Later on, these data and statistics will
be shared based on the policies and events registered by the user. We will also develop cloud-
based secure data-mining techniques. Based on the data-sharing policies, the data that will be
submitted to data-mining models will be automatically extracted. In certain scenarios, the data
will also be aggregated using the information coming from the peers (i.e., information com-
ing from nearby IoT devices). In some cases, the information will be sanitized by adding noise
before sharing with the cloud-based services (e.g., by leveraging randomized response-based
d ifferential-privacy techniques [DU03]). This locally computed information will be shared
using secret sharing-based techniques with multiple servers located on the public cloud using
service such as Tor [TOR] so that linking back the data to a particular IoT device will be harder.
Furthermore, hacking into any one server will not disclose any sensitive information. Once the
data is sent to multiple servers using secret sharing mechanisms, secret sharing-based secure
protocols will be executed among the servers to build the global data-mining models from IoT
data. To scale to a large number of users, we need to combine effective sampling and random-
ized response techniques in conjunction with secret sharing-based secure multiparty ideas. For
example, we may want to combine IoT data from a certain subpopulation (e.g., smart meters) to
build a linear regression model to understand the relationship between power usage and income
on Sundays in say North Dallas.
example, on the one hand, smart meters and other sensors continuously create physical sys-
tem stream data (e.g., video, audio, pressure, voltage, altitude, etc.); on the other hand, (cyber)
data can include logs, NetFlow data, content metadata telemetry, etc. These new sensors are
constantly collecting more data, and as a result we have the need to analyze heterogeneous cor-
related information. Examples include (1) video signal and audio signal to detect events, target
tracking, etc., (2) sensor signal (e.g., power consumption) and cyber signals (e.g., network intru-
sion detection data) to identify faults or potential attacks, and (3) multiple video signals from
different angles having different and partial views of a scene. Furthermore, cyber threat/attack
detection in a dynamic, heterogeneous IoT environment is a challenging task. The traditional
detection approach based on prior knowledge from domain experts (e.g., signature-based detec-
tion) may capture some types of existing malicious activities (including threats) but may not be
effective against untypical or stealthy attacks. As a more advanced approach, a unified learning
framework can learn from data and incidents on the fly which may better detect zero-day or
untypical malicious activities.
As stated in Section 31.1, Figure 31.1 shows a generic IoT topology where data is first collected
from sensors (data acquisition) and devices. Second, the collected data is through a gateway (data
aggregation) which may also involve some basic and localized data analysis. Third, the processed
data is sent to a cloud for more complex analytics. These analytics will involve data from multiple
gateways. An example of real-life implementation may be using Raspberry Pi as a gateway to
collect temperature data from TMP36 sensors and send it to cloud. In the cloud, a big streaming
analytics framework like Spark running on a Google Compute Engine will be used for further
processing and to retrieve meaningful insights about the data. End users will be able to view the
analytics result in real (or near-real) time through apps. The tasks for analytics are briefly enumer-
ated below:
Our goal is to build a comprehensive streaming analytics framework with applications to anom-
aly detection for IoT threat detection. Characteristics of streaming data are as follows:
• Heterogeneity: A sensor network deployed over a geographical area may consist of diverse
types of sensors nodes with different degrees of dissimilarity of sourced data. This may
add complexity in data representation and evaluation.
• Scalability: Mining on large networks is computationally intensive and requires a signifi-
cant amount of resources. This makes real-time data analytics challenging.
• Concept drift and Concept evolution: Concept drift occurs in data streams when the under-
lying data distribution changes over time [WANG03]. Concept-evolution occurs when new
classes evolve in streams [MASU11]. Furthermore, the topology of a network may also
change (e.g., due to mobility). Therefore, statistical models representing the data streams
distribution should adapt continuously to reduce misclassification errors.
• Infinite length: As streaming data occurs continuously, the streams can be assumed to be
of unbounded length, making main memory data storage difficult.
• Energy-efficient communication: Communicating data is among the most energy-expen-
sive routines across different types of wireless sensor networks (WSN). For instance,
receiving and transmitting data in WSN consisting of Mica2 nodes running TinyDB appli-
cations constitutes about 59% of the total energy consumption [SHNA04]. Reducing the
amount of data transmitted and/or energy consumed per transmission could lead to longer
network lifetimes and significantly impact network functions. Therefore, we need to design
energy-efficient communication for battery-powered wireless networks to significantly
affect network performance and lifetime.
Neither multistep methodologies and techniques nor multiscan algorithms suitable for typical
knowledge discovery and data mining can be readily applied to data streams due to well-known
limitations such as bounded memory, online data processing, and the need for one-pass techniques
(i.e., forgotten raw data). In spite of the success and extensive studies of stream-mining techniques,
there is no effort (to the best of our knowledge) that focuses on a unified study of new challenges
introduced by evolving data streams such as change detection, novelty detection, feature evolution/
heterogeneity, scalability, and energy-aware communication.
In Figure 31.3, a scenario of anomaly detection over IoT network traffic is depicted. The sensors
and IoT devices will be communicated by the rest of the Internet through their respective gateways.
At the gateway, we will apply machine-learning techniques on those traffic data to detect anoma-
lous traffic locally. The learning process will be continuous; feedback from the prediction module
along with the stream of data will be used to build and update the model to detect the anomaly.
Captured data will be IP packets or MQTT (MQTT, http://mqtt.org/) messages. Recall that MQTT
is a simple lightweight publish-subscribe messaging protocol used in IoT which runs on top of TCP/
IP protocol. There will be multiple such gateways with each generating their own views of anoma-
lous traffic detection. Those views can be combined in the cloud for refinement producing a generic
threat model. There might be some gateways where local threat detection will not be conducted.
They will share data with the cloud and in cloud, an aggregated threat model will be detected based
on the data from several such gateways.
We will focus on particular communication scenarios in the context of wireless networks’ energy
efficiency. In many cases of WSN deployments, for instance, clusters of spatially proximate nodes
sense and potentially transmit very highly correlated, almost identical values to the sink [VURA06].
The scenario underlying the operation of cooperative transmission networks and related distributed
transmit-beamforming physical layer protocols are analogous: a group of nodes transmits the same
message in a carefully coordinated fashion. In both these scenarios, a group of source nodes has
access to the same or highly correlated pieces of data that need to be sent to a common sink. We
will introduce a novel scheme, encoded sensing (ES) that substantially reduces the energy required
Big Data, Security, and the Internet of Things 411
Ensemble Predict
i.e., IP/MQTT
Local threat Data collection
packets
detection
for transmission of data in various types of wireless networks, where these settings hold. For ana-
lytics, a number of heterogeneous models will be maintained where models may be from relational
learning, signal processing, or stochastic processes. Classification will be performed by efficiently
aggregating the combined classifier resulting from heterogeneous models.
REFERENCES
[ATT]. “AT&T on Securing the Internet of Things: A Layered Approach Tackles Sophisticated New Threats,”
CyberTrend, 14 (6), Sandhills Publishing, June 2016.
[BARR08]. M. Barreno, A. Cardenas, J. D. Tygar, “Optimal ROC Curve for a Combination of Classifiers,” In
Proceedings of NIPS 2008, Vancouver, British Columbia, Canada, pp. 57–64, 2008.
412 Big Data Analytics with Applications in Insider Threat Detection
[BIEM11]. A. Beimel, “Secret-Sharing Schemes: A Survey.” In WCC’11 Proceedings of the Third International
Conference on Coding and Cryptology, Qingdao, China, pp. 11–46, 2011.
[DU03]. K. Du and Z. Zhan, “Using Randomized Response Techniques for Privacy-Preserving, Data Mining,”
In KDD ‘03 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Washington, DC, pp. 505-510, August 24–27, 2003.
[MASU11]. M. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on Knowledge
and Data Engineering, 23 (6), 859–874, 2011.
[SHNA04]. V. Shnayder, M. Hempstead, B. R. Chen, G. W. Allen, M. Welsh, “Simulating the Power Consumption
of Large-Scale Sensor Network Applications,” In ACM ’04: Proceedings of the 2nd International
Conference on Embedded Networked Sensor Systems, Baltimore, MD, pp. 188–200, November 2004.
[SWE02]. L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 10 (5), 557–570, 2002.
[TOR]. The Tor Project, https://www.torproject.org/.
[VURA06]. M. C. Vuran and I. F. Akyildiz, “Spatial Correlation-Based Collaborative Medium Access Control
in Wireless Sensor Networks,” IEEE/ACM Transactions on Networking, 14 (2), 316–329, 2006.
[WANG03]. H. Wang, W. Fan, P. S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble
Classifiers,” In Proceedings of the Ninth ACM SIGKDD KDD, Washington, DC, pp. 226–235, August
24–27, 2003.
32 Big Data Analytics for Malware
Detection in Smartphones
32.1 INTRODUCTION
As stated in Chapter 31, Internet of things (IoT) systems generate massive amounts of data that have
to be managed, integrated, and analyzed to extract useful patterns and trends. However, the perva-
sive nature of these devices is also prone to attack. That is, it is not only the device that is attacked
but also the data that is generated and integrated possibly in a cloud. In Chapter 31, we discussed
some of the security challenges for IoT devices in general. In this chapter, we will focus on a par-
ticular IoT system, that is, a connected smartphone system. These connected smartphone devices
generate massive amounts of data and can be considered to be an IoT system. We discuss how big
data analytics may be applied for detecting malware in smartphones.
The smartphone has rapidly become an extremely prevalent computing platform, with just over
968 million devices sold in 2013 around the world [GART14b], a 36% increase in the fourth quarter
of 2013. In particular, Android devices accounted for 78.4% of the market share, an increase of
12% year-on-year. This popularity has not gone unnoticed by malware authors. Despite the rapid
growth of the Android platform, there are already well-documented cases of Android malware, such
as DroidDream [BRAD11] which was discovered in over 50 applications on the official Android
market in March 2011. Furthermore, a study by [ENCK11] found that Android’s built-in security
features are largely insufficient, and that even nonmalicious programs can (unintentionally) expose
confidential information. A study of 204,040 Android applications conducted in 2011 found 211
malicious applications on the official Android market and alternative marketplaces [ZHOU12]. In
addition, sophisticated Trojans have been reported recently [UNUC13], spreading via mobile bot-
nets. Various researchers around the globe track reported security threats [SOPH14], wherein well
over 300 Android malware families have been recorded.
On the other hand, smartphone apps on App Stores have been on a steady rise, with app down-
load reaching 102 billion in 2013 [GART13] with a total revenue of $26 billion. This shows an ever
increasing popularity in smartphone apps used in a multitude of applications including banking
among others. In addition, private companies, military, and government organizations also develop
apps to be used for processing and storing extremely strategic data including control jets, tanks, or
machine guns. These applications make such apps targets for malicious attacks, where an attacker
can gain information that negatively affects the peace and security of the users or the general popu-
lation at large. This shows that it is prudent to empower app users with an ability to estimate the
security thread of using an app in their smartphone. In addition, there is also a need to educate
developers on various security threats and defense mechanisms and encourage them to incorporate
these into the app design methods. A recent report [GART14c] suggested that by 2016, 25% of the
top 50 banks would have an app. In addition, it is also reported in [GART14a] that 75% of mobile
security breaches result from app misconfiguration.
To address the limitations of current secure mobile platforms, we have been conducting research
as well as infrastructure development efforts in securing the connected smartphones for the past
6 years. In particular, we have designed and developed solutions for behavior-based intrusion detec-
tion/mitigation for mobile smartphones. In addition, we are also investigating privacy aspects for
smartphones as well as integrating our secure mobile computing framework with the cloud. We are
integrating the research in an experimental infrastructure for our students and developing a curricu-
lum for them which will eventually be a part of an IoT education program.
413
414 Big Data Analytics with Applications in Insider Threat Detection
Big data
analytics for
malware
detection in
smartphones
Behavioral Application of
feature Reverse data analytics
extraction and engineering and reverse
analysis engineering for
smartphones
The organization is as follows. Our approach is discussed in Section 32.2. The experimental
evaluation efforts will be discussed in Section 32.3. The infrastructure we are developing is dis-
cussed in Section 32.4. Finally, our education program for connected smartphones will be dis-
cussed. The concepts discussed in this chapter are illustrated in Figure 32.1.
calls that being monitored and properly processed provide ample information for understanding the
process behavior. However, system calls represent the lowest level of behavior semantics and mere
aggregation of system calls has inherent limitations in terms of behavior analysis. Instrumental
behavior analysis must involve all levels of the semantic pyramid, from its foundation to applica-
tion program interface (API) functions and to its highest level, that is, functionality defined as
a sequence of operations achieving well-recognized results in the programs environment. In our
approach functionalities constitute the basis of the behavioral model.
We need to achieve the expressiveness of behavioral signatures, that is, crucial for the success of
IDS in detecting new realizations of the same malware. Since most malware incidents are deriva-
tives of some original code, a successful signature must capture invariant generic features of the
entire malware family, that is, the signature should be expressive enough to reflect the most possible
malware realizations. We also need to address possible behavioral obfuscation, that is, attempts to
hide the malicious behavior of software, including the multipartite attacks perpetrated by a coor-
dinated operation of several individually benign codes. This is an emerging threat that, given the
extensive development of behavior-based detection, is expected to become a common feature of
future information attacks. Finally, we need to develop an efficient model building process utilizing
system call data and incorporating unsupervised learning (where no training is required) along with
supervised learning (where training is required), as well as mathematically rigorous and heuristic
data mining procedures. Some of the work we have carried with respect to big data analytics for
malware detection in smartphones will be discussed next.
M (S, G ) = DL (G|S ) + DL (S )
416 Big Data Analytics with Applications in Insider Threat Detection
where G is the entire graph, S is the substructure being analyzed, DL(G|S) is the description length
of G after being compressed by S, and DL(S) is the description length of the substructure being
analyzed. The description length DL(G) of a graph G is the minimum number of bits necessary to
describe G. This framework is not easily extensible to dynamic/evolving streams (dynamic graphs)
because the framework is static in nature. Our work relies on their normative substructure-finding
methods but extends it to handle dynamic graphs or stream data by learning from evolving streams.
Recently we have tested this graph-based algorithm with stream analysis on insider threat data and
observed better results relative to traditional approaches ([PARV11a], [PARV11b]). Therefore, we
intend to apply both unsupervised and supervised learning in the graph-based technique.
Behavioral
New and binary
malware or data Feature
Temporal extraction Model
benign database update
Data and selection
executable
Classification
Behavioral model
Unknown Feature
and binary Classify
executable extraction
data
Malware Benign
Class Keep
Remove
Novel
Quarantine and analyze
different ways by different techniques, all of which have the same goal: to keep the classification
model up-to-date with the most recent concept ([MASU11b], [MASU11b], [MASU10], [SPIN08],
[HULT01]).
Data stream classifiers can be broadly divided into two categories based on how they update the
classification model, namely, single model incremental approaches [HULT01], and ensemble tech-
niques. Ensemble techniques have been more popular than their single model counterparts because
of their simpler implementation and higher efficiency [MASU11a]. Most of these ensemble tech-
niques use a chunk-based approach for learning ([MASU11b], [MASU10], [SPIN08]), in which they
divide the data stream into chunks, and train a model from one chunk. We refer to these approaches
as “chunk-based” approaches. In our work we investigate both techniques to update models.
32.2.4 Risk-Based Framework
Machine learning-based approaches need to be complemented by a risk framework. Certain param-
eters need to be set based on real-life experiments. We believe that we can find optimal values
418 Big Data Analytics with Applications in Insider Threat Detection
for each parameter by running different simulations combined with a risk analysis framework.
Essentially, for different domains, the cost of having a false positive and false negative could be dif-
ferent. In addition, software that is used for some critical tasks need to be thoroughly analyzed (e.g.,
a software that is accessing top secret data). On the other hand, we may not need to be as stringent if
the software tested by our tool runs on unclassified data. This observation implies that for different
use cases, we may need to set different parameter values. To create a risk-based parameter setting
framework, we create an interface where a user could enter the information related to the software
that is being tested by our tool. Based on the given information (e.g., what kind of data the software
will be run on, whether it is sandboxed by a virtual machine while it is used), and our previous
runs on real data, we come up with optimal parameters to minimize the risks for given software by
adjusting the false positive and false negative rates of our tool. It should be noted that we have con-
ducted extensive research on risk-based security analysis ([CELI07], [CANI10]) as well as related
areas ([HAML06], [WART11]). We utilize this experience in developing the risk framework.
At the heart of our approach is the classification model (Figure 32.2). This model is incremen-
tally updated with feature vector data obtained from new benign/malware applications. When a
new benign executable or malware instance is chosen as a training example, it is analyzed and its
behavioral and binary patterns are recorded. This recorded data will be stored in a temporal data-
base that holds the data for a batch of N applications at a time. Note that the temporal database can
be stored in the Android device itself in a lightweight way or at the server side. When a batch of
data has been processed, it is discarded, and a new batch is stored. Each batch of data will undergo
a feature extraction and selection phase. The selected features are used to compute feature vectors
(one vector per executable). These vectors are used to update the existing classification model using
an incremental learning technique. When an unknown application appears in the system, at first its
runtime and network behavior (e.g., system call traces) are monitored and recorded. This data then
undergoes a similar feature extraction phase and feature vector is created. This feature vector is
classified using the classification model. Based on the class label, appropriate measures are taken.
We also complement our behavioral-based malware detection algorithms with reverse engineer-
ing techniques so that we can provide a compressive solution to the malware evolution problem.
the implementation of each component. In many cases, the experiments simply verify that we are
improving the system with the new part.
We have collected the malware dataset from two sources, one is publicly available (VX
Heavens—http://vx.netlux.org) and the other is a restricted access repository for malware research-
ers only (Project malfease—http://malfease.oarci.net/), to which we have access. VX Heavens con-
tains close to 80,000 malware samples, and Project malfease also contains around 90,000 malware
samples. Furthermore, these repositories are being enriched with new malware every day. VX
Heavens also serves many malware generation engines with which we may generate a virtually
infinite number of malware samples. Using these malware samples, we can easily construct a data
stream such that new types of malware appear in the stream at certain (not necessarily uniform)
intervals. Evaluation of the data occurs in either or both of the following ways:
1. We partition the dataset into two parallel streams (e.g., a 50–50 division), where one stream
is used to train and update the existing classifier and the other stream is used to evaluate
the performance of the classifier in terms of true positive, false positive, successful novel
class detection rate, and so on. For example, one suitable partitioning simply separates the
stream of odd-indexed members from the even-indexed ones.
2. A single, nonpartitioned stream may be used to train and evaluate as follows. The initial
classification model can be trained from the first n data chunks. From the n + 1st chunk,
we evaluate the performance of the classifier on the instances of the chunk (in terms of
true positive, false positive, successful novel class detection, etc.) Then that chunk is used
to update the classifier.
Below we discuss some of the systems we have developed and our initial evaluation. It should be
noted that at present the evaluation is at the component stage. Our ultimate goal is to integrate the
various components and carry out the evaluation of the system as a whole.
in a data mining algorithm based on the analysis desired by the user. The analysis is performed and
a complete report is generated at the end for the user to view. This includes a rated decision to clas-
sify the app as a malware. The complete information as desired by the user can be used for further
analysis or study regarding the app. This virtual lab would be then used by integrating it with cyber
security courses, as shown in Figure 32.3. Our goal is to process user requests in real time or near
time and support multiuser requests simultaneously. For this, we utilize our cluster (which is essen-
tially a cloud) and utilize parallel processing (NoSQL systems). For example, Spark can be used to
process requests. Spark has faster processing power than its counterpart such as Storm and Hadoop/
MapReduce (see e.g., [SOLA14a], [SOLA14b], and [SOLA14c]).
As evident, the online system serves parallel analysis requests. Such a deployment would require
multiple emulator-based feature extraction (using dynamic analysis) and ensemble of stochastic
models to be maintained. This would involve emulator and model management system that would
be developed to handle these scenarios, extending our previous work on large-scale vulnerability
detection techniques.
Virtual lab
App store
Report and
User analysis
Development App
data
Static analysis
Dynamic analysis (Taint)
Data mining
Cyber Integration
security
education
Network security
System security and binary code
analysis
Big data analytics and management
FIGURE 32.3 Architectural diagram for virtual lab and integration with cyber security courses.
Big Data Analytics for Malware Detection in Smartphones 423
32.4.1.3 An Intelligent Fuzzier for the Automatic Android GUI Application Testing
The recent proliferation of mobile phones has led to many software vendors shifting their services
to the mobile domain. There are >700,000 Android applications in the Google Play market alone
with over 500 million activated devices [WOMA12]. On the other hand, software inevitably con-
tains bugs (because of its complexity), some of which often lead to vulnerabilities. Identifying the
buggy applications (i.e., apps.) in such huge Android market calls for new efficient and scalable
solutions.
application tooling framework. The ViewServer provides a set of commands that can be used to
query the Android WindowManager [GOOG2], which handles the display of UI elements and the
dispatch of input events to the appropriate element.
User space
IME or
update app Client app
Policy engine
Checkpoint/
rollback Daemon for
logic communication Messages
control
Kernal
module Kernal space
checkpoint or rollback logic is used to perform logical checks for each client application component
in the case of a rollback. A kernel module installed in the kernel space is used for the checkpoint/
rollback mechanism for saving and restoring the states of this app or client app as needed.
One approach is to use an oblivious IME sandbox that prevents IME applications from leaking
sensitive user input. The key idea is to make an IME application oblivious to sensitive input by
running the application transactionally, wiping off sensitive data from untrusted IME applications
when sensitive input is detected. Specifically, we can checkpoint the state of the IME application
before each input transaction. Then, user input can be analyzed by the policy engine to detect
whether it is sensitive. If it is, the IME application state can be rolled back to the saved checkpoint,
making it oblivious of what the user entered. Otherwise, the checkpoint can be discarded.
32.4.2 Curriculum Development
32.4.2.1 Extensions to Existing Courses
In order to integrate the virtual lab with these courses, we would design various modules that would
be integrated with its existing projects. An overview is provided in Figure 32.5. These modules are
Modules Courses
Feature extraction
and (Un)supervised
Capstone course learning Big data management
and analytics
Security and
privacy preserving Secure encrypted
big data analytics stream analytics
Data and
and management application security
Performance
overhead of
trusted execution Secure cloud
environments computing
derived from our existing research and/or our virtual lab development. For example, the modules,
“APK file analysis” and “taint analysis,” may be integrated our regular graduate course, “digital
forensic analysis” and “system security and binary code analysis.” The “feature extraction” module
is being integrated with our “big data management and analytics” course.
Below is a sample of our cyber security courses that are relevant to our wok in smartphones.
REFERENCES
[BARR14] C. Barreto, J.A. Giraldo, A.A. Cárdenas, E. Mojica-Nava, N. Quijano, “Control Systems for the
Power Grid and Their Resiliency to Attacks,” IEEE Security and Privacy, 12 (6), 15–23, 2014.
[BLAS10] T. Blasing, A.-D. Schmidt, L. Batyuk, S.A. Camtepe, S. Albayrak, “An Android Application
Sandbox System for Suspicious Software Detection,” In 5th International Conference on Malicious and
Unwanted Software (Malware 2010) (MALWARE’2010), Nancy, France, 2010.
[BRAD11] T. Bradley, DroidDream Becomes Android Market Nightmare. March 2011, http://www.pcworld.
com/article/221247/droiddream_becomes_android_market_nightmare.html.
[BURG11] I. Burguera, U. Zurutuza, S. Nadjm-Tehrani, “Crowdroid: Behavior-Based Malware Detection
System for Android,” In Workshop on Security and Privacy in Smartphones and Mobile Devices 2011—
SPSM 2011, ACM, Chicago, IL, pp. 15–26, October 2011.
[CANI10] M. Canim, M. Kantarcioglu, B. Hore, S. Mehrotra, “Building Disclosure Risk Aware Query
Optimizers for Relational Databases,” In Proceedings of the VLDB Endowment, Singapore, Vol. 3, No. 1,
September 2010.
[CELI07] E. Celikel, M. Kantarcioglu, B.M. Thuraisingham, E. Bertino, “Managing Risks in RBAC Employed
Distributed Environments,” OTM Conferences, November 25–30, Vilamoura, Portugal, pp. 1548–1566,
2007.
[CHAN14] S. Chandra, Z. Lin, A. Kundu, L. Khan, “A Systematic Study of the Covert Channel Attacks in
Smartphones,” 10th International Conference on Security and Privacy in Communication Networks,
Beijing, China, 2014.
[CHUA11] S.L. Chua, S. Marsland, H.W. Guesgen, “Unsupervised Learning of Patterns in Data Streams Using
Compression and Edit Distance,” In Proceedings of the 22nd International Joint Conference on Artificial
Intelligence, Barcelona, Spain, Vol. 2, pp. 1231–1236, 2011.
[EBER07] W. Eberle, L.B. Holder, “Anomaly Detection in Data Represented as Graphs,” Intell. Data Anal., 11 (6),
663–689, 2007.
[ENCK10] W. Enck, P. Gilbert, B.-G. Chun, L.P. Cox, J. Jung, P. McDaniel, A.N. Sheth, “Taintdroid: An
Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones, In OSTI ’10:
Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, Berkeley,
CA, pp. 1–6, USENIX Association, 2010.
[ENCK11] W. Enck, D. Octeau, P. McDaniel, S. Chaudhuri, “A Study of Android Application Security,”
USENIX Security Symposium, San Francisco, CA, pp. 21–21, 2011.
[ENCK14] W. Enck, P. Gilbert, B.-G. Chun, L.P. Cox, J. Jung, P. McDaniel, A.N. Sheth, “TaintDroid: An
Information Flow Tracking System for Real-Time Privacy Monitoring on Smartphones,” Communications
of the ACM, 57 (3), pp. 99–106, 2014.
[GART13] Gartner. 2013. Gartner Says Mobile App Stores Will See Annual Downloads Reach 102 Billion in
2013. September. http://www.gartner.com/newsroom/id/2592315.
[GART14a] Gartner. Gartner Says 75 Percent of Mobile Security Breaches Will Be the Result of Mobile
Application Misconfiguration. May, 2014. http://www.gartner.com/newsroom/id/2753017.
430 Big Data Analytics with Applications in Insider Threat Detection
[GART14b] Gartner. 2014. Gartner Says Annual Smartphone Sales Surpassed Sales of Feature Phones for the
First Time in 2013. February, 2014. http://www.gartner.com/newsroom/id/2665715.
[GART14c] Gartner. 2014. Gartner Says by 2016, 25 Percent of the Top 50 Global Banks Will have Launched
a Banking App Store for Customers. June. http://www.gartner.com/newsroom/id/2758617.
[GREE15] G. Greenwood, E. Bauman, Z. Lin, L. Khan, B. Thuraisingham, “DLSMA: Detecting Location
Spoofing in Mobile AppsDLSMA: Detecting Location Spoofing in Mobile Apps,” Technical Report,
University of Texas at Dallas.
[GOOG1] Google. n.d. UI/Application Exerciser Monkey. http://developer.android.com/tools/help/monkey.
html.
[GOOG2] Google. n.d. WindowManager. http://developer.android.com/reference/android/view/
WindowManager.html.
[GUY] R. Guy. n.d. Local server for Android’s HierarchyViewer. https://github.com/romainguy/ViewServer.
[HAML06] K.W. Hamlen, G. Morrisett, F.B. Schneider, “Certified In-Lined Reference Monitoring on.NET.
PLAS 2006,” In Proceedings of the 2006 Workshop on Programming Languages and Analysis for
Security, PLAS 2006, Ottawa, Ontario, Canada, pp. 7–16, 2006.
[HORN11] P. Hornyack, S. Han, J. Jung, S. Schechter and D. Wetherall, “These Aren’t the Droids You’re
Looking For, Retrofitting Android to Protect Data from Imperious Applications,” In CCS, Chicago, IL,
pp. 639–652, 2011.
[HULT01] G. Hulten, L. Spencer, P. Domingos, “Mining Time-Changing Data Streams,” In KDD ‘01
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, San Francisco, CA, pp. 97–106, August 26–29, 2001.
[JIN03] Y. Jin, L. Khan, L. Wang, M. Awad, “Image Annotations by Combining Multiple Evidence & Wordnet,”
In MULTIMEDIA ‘05 Proceedings of the 13th Annual ACM International Conference on Multimedia,
Hilton, Singapore, pp. 706–715, November 6–11, 2005, ACM.
[LIN10] Z. Lin, X. Zhang, D. Xu, “Reverse Engineering Input Syntactic Structure from Program Execution and
Its Applications,” IEEE Transactions on Software Engineering, 36(5), 688–703, 2010.
[MASU10] M.M. Masud, Q. Chen, L. Khan, C. C. Aggarwal, J. Gao, J. Han, B.M. Thuraisingham, “Addressing
Concept-Evolution in Concept-Drifting Data Streams,” In Proceedings of ICDM ’10, Sydney, Australia,
pp. 929–934.
[MASU11a] M.M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE TKDE, 23(1), 859–874,
2011.
[MASU11b] M.M. Masud, T.M. Al-Khateeb, L. Khan, C.C. Aggarwal, J. Gao, J. Han, B.M. Thuraisingham,
“Detecting Recurring and Novel Classes in Concept-Drifting Data Streams.” In Proceedings of ICDM ’11,
Vancouver, BC, pp. 1176–1181.
[MASU12] M.M. Masud, W. Clay, G. Jing, L. Khan, H. Jiawei, K.W. Hamlen, N.C. Oza, “Facing the Reality
of Data Stream Classification: Coping with Scarcity of Labeled Data,” Knowledge and Information
Systems, 33 (1), 213–244, 2012.
[PARV11a] P. Parveen, J. Evans, B. Thuraisingham, K.W. Hamlen, L. Khan, “Insider Threat Detection Using
Stream Mining and Graph Mining,” In Proceedings of the 3rd IEEE International Conference on
Information Privacy, Security, Risk and Trust (PASSAT 2011), October, Boston, MA, MIT Press, 2011.
[PARV11b] P. Parveen, Z.R. Weger, B. Thuraisingham, K. Hamlen, L. Khan, “Supervised Learning for Insider
Threat Detection Using Stream Mining,” In Proceedings of 23rd IEEE International Conference on Tools
with Artificial Intelligence (ICTAI2011), November 7–9, Boca Raton, FL (Best Paper Award), 2011.
[PORT10] G. Portokalidis, P. Homburg, K. Anagnostakis, H. Bos, “Paranoid Android: Versatile Protection
for Smartphones,” In Proceedings of the 26th Annual Computer Security Applications Conference
(ACSAC’10), pp. 347–356, New York, NY, ACM, 2010.
[ROBO] Robotium. n.d. Robotium, http://robotium.com/.
[SPIN08] E.J. Spinosa, A.P. de Leon, F. de Carvalho, J. Gama, “Cluster-Based Novel Concept Detection in
Data Streams Applied to Intrusion Detection in Computer Networks,” In Proceedings of ACM SAC, pp.
976–980, 2008.
[SHAB10] A. Shabtai, U. Kanonov, Y. Elovici, “Intrusion Detection for Mobile Devices Using the Knowledge-
Based, Temporal Abstraction Method,” Journal of System Software, 83, 1524–1537, August 2010.
[SAHS12] J. Sahs and L. Khan, “A Machine Learning Approach to Android Malware Detection,” Intelligence and
Security Informatics Conference (EISIC), 2012 European. IEEE, Odense, Denmark, pp. 141–147, 2012.
[SOPH14] Sophos. “Security Threat Report,” Sophos, 2014. http://www.sophos.com/en-us/threat-center/
medialibrary/PDFs/other/sophos-security-threat-report-2014.pdf.
Big Data Analytics for Malware Detection in Smartphones 431
[SOUN14] D. Sounthiraraj, J. Sahs, G. Greenwood, Z. Lin, L. Khan, “SMV-Hunter: Large Scale, Automated
Detection of ssl/tls Man-in-the-Middle Vulnerabilities in Android Apps,” In Proceedings of the 19th
Network and Distributed System Security Symposium. San Diego, CA, 2014.
[SOLA14a] M. Solaimani, L. Khan, B. Thuraisingham, “Real-Time Anomaly Detection Over VMware
Performance Data Using Storm,” In The 15th IEEE International Conference on Information Reuse and
Integration (IRI), San Francisco, CA, 2014.
[SOLA14b] M. Solaimani, M. Iftekhar, L. Khan, B. Thuraisingham, J.B. Ingram, “Spark-Based Anomaly
Detection Over Multi-Source VMware Performance Data In Real-Time,” In Proceedings of the IEEE
Symposium Series on Computational Intelligence (IEEE SSCI 2014), Orlando, FL, 2014.
[SOLA14c] M. Solaimani, M. Iftekhar, L. Khan, B. Thuraisingham, “Statistical Technique for Online Anomaly
Detection Using Spark Over Heterogeneous Data from Multi-Source VMware Performance Data,” In the
IEEE International Conference on Big Data 2014 (IEEE BigData 2014), Washington DC, 2014.
[U1] 50 Malware applications found on Android Official Market. http://m.guardian.co.uk/technology/
blog/2011/mar/02/android-market-apps-malware?cat=technology&type=article.
[U2] Google Inc. Android market. https://market.android.com/.
[U3] Juniper Networks Inc, “Malicious Mobile Threats Report 2010/2011,” Technical Report, Juniper
Networks, Inc., 2011.
[U4] R.T. Llamas, W. Stofega, S.D. Drake, S.K. Crook, “Worldwide Smartphone, 2011–2015 Forecast and
Analysis,” Technical Report, International Data Corporation, 2011.
[UNUC13] R. Unuchek, The Most Sophisticated Android Trojan. June, 2013. https://securelist.com/blog/
research/35929/the-most-sophisticated-android-trojan/
[WART11] R. Wartell, Y. Zhou, K.W. Hamlen, M. Kantarcioglu, B.M. Thuraisingham, “Differentiating Code
from Data in x86 Binaries,” ECML/PKDD (3), 522–536, 2011.
[WELC84] T.A. Welch, “A Technique for High-Performance Data Compression,” Computer, 17 (6), 8–19,
1984.
[WOMA12] B. Womack, Google Says 700,000 Applications Available for Android. October, 2012. http://www.
businessweek.com/news/2012-10-29/google-says-700-000-applications-available-for- android-devices.
[ZHOU12] Y. Zhou, Z. Wang, W. Zhou, X. Jiang, “Hey, You, Get Off of My Market: Detecting Malicious Apps
in Official and Alternative Android Markets,” NDSS, 2012.
33 Toward a Case Study in
Healthcare for Big Data
Analytics and Security
33.1 INTRODUCTION
While the previous two chapters focused on security issues for Internet of things (IoT) system as
well as discussed a sample system which was a connected smartphone system, in this section we
will discuss a planned case study that we are beginning to explore for healthcare systems which we
consider to be another example of an IoT system. As the use and combination of multiple big data-
sets become ubiquitous and the IoT becomes a reality, we increasingly need to deal with large vol-
umes of heterogeneous datasets. Some of these datasets are discrete data points, others are images
or gridded datasets (e.g., from meteorological models or satellites). Some are observations, some are
demographic or social data, and others are business transactions. Some of these datasets are avail-
able for public access; others require varying levels of access control. Some of these datasets need
to be streamed in real time with low latency, others have more relaxed latency requirements. Some
of these datasets are structured, others are unstructured.
The previous chapters in this book have discussed the various concepts and techniques for big
data management and analytics. In addition, we have also applied our techniques for various appli-
cations such as cyber security. We have also discussed various experimental big data systems such
as semantic web-based query processing and cloud-centric assured information sharing. In this
chapter, we will illustrate the key points in big data analytics and security for a particular vertical
domain, that is, healthcare. It should be noted that the solutions we have discussed in this chapter are
yet to be developed. Our purpose is to illustrate how the concepts can be applied to design practical
big data systems. In particular, we will describe a planned case study in this chapter and show how
the big data analytics and security techniques can be applied in the healthcare domain where there
is a need to manage and analyze massive amounts of data securely. It should be noted that while we
have used the Veterans Administration (VA) application as an example in the system that we are
proposing to develop, our system can be applied to any related application.
The organization of this chapter is as follows. The motivation for the planned case study is dis-
cussed in Section 33.2. Some methodologies to be used are discussed in Section 33.3. The limita-
tions of current systems and the high-level design of future systems are discussed in Section 33.4.
This chapter is summarized in Section 33.5. Figure 33.1 illustrates our system architecture.
33.2 MOTIVATION
33.2.1 The Problem
Around 3 years ago, the World Health Organization released a report stating that globally seven
million premature deaths in 2012 were linked to air pollution [WHO]. To pick just one pollutant
as an illustration, the many health impacts of airborne particulate matter (PM) with a diameter of
2.5 microns have been extensively studied; they depend in part on their abundance at ground level
in the atmospheric boundary layer where they can be inhaled. With the increasing awareness of
the health impacts of air quality, there is a growing need to characterize the spatial and temporal
433
434 Big Data Analytics with Applications in Insider Threat Detection
Data
Real-time
analytics
Data
Data mining
variations of the global abundance of ground-level pollution over the last two decades. Once the
air quality is characterized, it is imperative that we then use this information in a proactive way to
try and prevent further avoidable health issues and to improve policies. Prevention is better than
cure. The VA is the country’s largest healthcare provider, caring for seven million veterans and
their families. We need enhancements to the VA decision support tools in the area of public health
and air quality utilizing hourly in-situ observations from 55 countries, global population density at
10-km resolution, and multiple global NASA Satellite and Earth Science data on a daily basis from
August 1997 to present. With the increasing awareness of the many health impacts of PM ranging
from general mortality to specific respiratory, cardiovascular, cancer, and reproductive conditions,
to name but a few, there is a growing and pressing need to have global daily estimates of the abun-
dance of ground-level air quality.
The existing MyHealtheVet Decision Support Tool is the VA’s Personal Health Record sys-
tem. It was designed for veterans, active duty service members, their dependents, and caregivers.
MyHealtheVet helps veterans partner with their healthcare team. It provides veterans opportunities
and tools to make informed decisions. The four key features of the existing MyHealtheVet tool for
veterans are
1. Keep track of all past and upcoming visits and set preferences on how to receive appoint-
ment reminders.
2. Keep track of medications and refill prescriptions. Veterans see all the details of each
medication: refill status, refill submit date, fill date, medication name, prescription number,
the VA facility where veterans receive medication, and the refills remaining.
3. Secure messaging allows a two-way conversation between veterans and the VA healthcare
team.
4. Healtheliving assessment. The summary report in the Healtheliving assessment shows vet-
erans the positive effect of making changes. With graphic displays, it offers veterans the
chance to see the impact of specific changes.
We believe that the current MyHealtheVet system does not address the needs of the veterans.
Therefore, what is needed are the following:
Toward a Case Study in Healthcare for Big Data Analytics and Security 435
1. Enhance the existing MyHealtheVet decision support tool to provide timely alerts when the
current environmental conditions could trigger health incidents for an individual veteran.
For example, one in 12 people (about 25 million or 8% of the U.S. population) has asthma,
including many veterans. Poor air quality can trigger an asthma event. A timely reminder
to carry an inhaler and avoid unnecessary strenuous activity on a day with poor air qual-
ity could preclude an asthma event and emergency room (ER) visit. Likewise, worldwide,
chronic obstructive pulmonary disease (COPD) affects 329 million people or nearly 5% of
the population. In 2011, it ranked as the fourth leading cause of death, killing over 3 mil-
lion people. Veterans with COPD could be sent timely alerts on days with poor air quality.
2. Prepare and manage a prototype logistical planning tool for VA ERs and walk-in clin-
ics. The tool will estimate the likely ERs admissions and required supplies based on the
observed relationship over the last decade between air quality and the VA ERs admissions
across the entire USA and supply usage. This tool has to be eventually made operational.
While we have used the air pollution monitoring domain for the planned case study, the tech-
niques that need to be designed and developed are not only applicable to a particular domain but
can span across multiple domains such as cyber security (e.g., analyzing attack data), healthcare,
and geospatial applications. In particular, the framework we plan to develop will be able to accom-
modate geospatial data from disparate sources, preprocess them, and produce actionable insights
via offline and real-time analytics.
privacy and security-aware scientific data storing and retrieval, and (4) deal with automated/semi-
automated analysis of data-enabled knowledge discovery processes.
Such tools will be beneficial to various communities including those working in public health
and air quality. The technology that we need to develop will benefit VA decision support tools that
need to analyze massive amounts of streaming big data with online responses. The study will fur-
ther advance online stream data mining since such a technique provides an ideal platform for testing
innovative stream data analysis ideas.
33.3 METHODOLOGIES
The data sources are diverse (see Figure 33.2) and produce large volumes of data which are high-
dimensional, multi-DBMS due to the heavy usage of query-centric scientific workload format,
sparse, and can be either structured or unstructured.
Under these constraints, the ideal choice for data storage and retrieval is array-centric and NoSQL
databases due to their simplicity of design, horizontal scalability, and availability. Conventional
relational database management system is a poor fit for complex analysis of scientific data and hence
we must leverage specialized database systems such as SciDB. SciDB’s multidimensional array data
model and its ability to natively integrate complex analytics directly with data management make it
an ideal candidate for complex scientific data. Cassandra and HBase are two other viable data stor-
age platforms with optimized read–write performance for massive datasets. Since our framework
needs to support random reads and writes, Hadoop/MapReduce may not be a good choice. We
need to explore the right combination of these different data stores. The data collected and stored
from the various sources must be correlated in such a manner that privacy of sensitive data is not
breached. Therefore, explicit access control policies must be embedded in the querying and retrieval
mechanisms without any substantial impact on performance. In addition, we need to support com-
plex data analytics. Data analytics can be categorized in two ways: real time and offline. Real-time
analytics must be able to process streaming data in near real time to facilitate immediate integration
Data mining
User User
with decision support systems. Apache Storm is a distributed real-time computation system which
can reliably process fast, large streams of data in arbitrarily complex ways. For offline analytics,
SciDB offers an abstraction layer that separates data analytics from low-level data manipulation
details which can be leveraged to explore new statistical and machine learning techniques that gen-
erate more actionable insights.
Figure 33.3 illustrates the architecture of the methodologies. Scientific data will be first stored
into the NoSQL system as discussed above. For this, data may move from the streaming environment
to the NoSQL system (i.e., persistent storage). In other words, data will move from an online to an
offline environment. Second, our query processing mechanism on scientific data will interact with a
middleware that enforces security and privacy policies. Finally, data analytics has to be carried out
at real time and offline in a scalable manner. It would be desirable to integrate the database systems
and analytics platforms with VA Informatics of the Timely Health Indicators using Remote Sensing
& Innovation for the Vitality of the Environment Medical Environment Engine and VA decision
support tools. In addition, a web portal is needed to allow users to issue relevant queries. The queries
will trigger the real-time analytics and generate responses that can be directly applied to decision-
making, namely, whether an alert should be sent out to patients/authorities given the quality of air
on the present day. The responses produced by the system will follow standard formats to support
integration with standalone messaging platforms such as email servers and text-messaging systems.
values will be associated with latitude/longitude values along with timestamps. In other words, as
time passes, for each unique latitude/longitude value, we will get multiple versions of parameter
values. On the other hand, health-related records maintained by the VA informatics may be unstruc-
tured. We assume that our scientific data will follow write-once, read-many-times model since we
do not expect the past observations to be changed frequently. Furthermore, some data are continu-
ously arriving (streaming).
Therefore, the first challenge is to store these heterogeneous data efficiently (storage). These
scientific data can be accessed/queried by latitude/longitude values and/or timestamps. In addition,
a query can be posed via a set of parameter values. Hence, the next challenge is to retrieve relevant
data efficiently based on queries from storage (query processing). Storage and query processing are
intertwined.
Our goal is to use open-source software with commodity hardware without sacrificing perfor-
mance. For this, we will investigate NoSQL and in-memory databases. We need to come up with an
efficient mechanism to store and handle large, ever-growing scientific data on commodity hardware.
data, geospatial data, sensor data, and other multifaceted complex data [ARRA14]. The array
model also accelerates linear algebra operations by 10–100 times. SciDB’s physical storage model
is customized for both sparse and dense data which translates to faster data access. SciDB-R and
SciDB-py libraries allow analysts to use the languages R and Python to leverage the native analyti-
cal power of SciDB. SciDB’s all-in-one, seamlessly integrated analytics package is unmatched in
other contemporary big data frameworks.
Our scientific data is multidimensional (e.g., each longitude/latitude value is associated with
multiple environmental parameters and multiple versions of those parameters). SciDB uses a mul-
tidimensional array as its basic storage and processing unit. Storage of these arrays is handled by a
partitioning scheme called chunking [SCID14] that distributes subsets of the array uniformly across
all instances of the database where each instance is responsible for storing and updating that subset
locally. Chunk assignment follows a hash-based scheme. Chunks that are frequently accessed are
maintained in-memory to speed up querying and computation.
In-memory database SAP HANA can process queries faster and provide instant answers to com-
plex queries; however, since it is proprietary, the total cost of ownership is high.
Apache Spark [ZAHA12] is a fast and general engine for large-scale data processing including
stream data. It can run programs up to 100× faster than Hadoop/MapReduce in memory or 10×
faster on a disk. However, Spark is still an Apache incubator project and at this point its machine-
learning library MLlib does not cover a significant spectrum of analytical tools. In contrast, SciDB
is a mature, complete, and stable product that offers much higher flexibility and richness in analyt-
ics. Furthermore, in terms of adoption by industries and stability, Apache Storm [STOR14] remains
ahead of Spark.
abundance. These analytics can be carried out based on statistical analytics and/or real-time analyt-
ics (see Section 33.3). All analyses will be based on whatever relevant data was fetched. Hence, data
are moving from one component (e.g., NoSQL store) to another component (e.g., analytics).
33.4.2 Privacy and Security Aware Data Management for Scientific Data
33.4.2.1 The Problem and Challenges
Over the years, there has been much work (including our own) toward developing privacy- and
security-aware data management solutions for large data. Unfortunately, none of the existing work
tries to capture the entire life cycle of the scientific data. Basically, existing works try to address
the security issues that arise in different types of NoSQL systems. For example, in our past work
[THUR10], we developed a system that combines the HDFS [BORT10] with Hive [HIVE14] to pro-
vide secure storage and processing solutions for very large datasets. Furthermore, we use a XACML
[MOSE05] policy-based security mechanism to provide fine-grained access control over the data.
As discussed before, traditional users of large scientific datasets need to use multiple data process-
ing tools. For example, the data may be collected via a stream processing system using different
types of sources with security and privacy requirements and then it may be stored on different data
storage systems such as SciDB and HBase/Cassandra. Therefore, any security- and privacy-aware
data management system should track the data and how it is moved and processed.
We need to create a simple yet powerful security- and privacy-aware data management layer
specifically tailored for scientific data processing. Instead of trying to change each data manage-
ment tool, we need to build a security middleware on top of the existing systems that can track and
enforce policies across multiple data storage platforms. Unlike other existing work, including NSF
SATC funded project CNS-1237235 on privacy-preserving research data sharing and our own NSF
SATC funded project CNS-1228198 on distributed policy enforcement among data management
systems belonging to different organizations, our focus in this project is on addressing security and
privacy challenges in systems that belong to the same organization. Here, we do not consider the
security and policy enforcement issues in sharing research data across multiple institutions.
Given these rich sets of tools that have different data schemas, we need to develop security and
privacy management tools that can be used across multiple systems. Furthermore, we cannot afford
to change each and every system to enforce the needed security and privacy policies since these
systems change frequently and new data management tools continuously developed.
• Single source derivation: If a source Si is annotated with security label “si” and data mining
task Tj is using the output of Si and is not a “sanitization task,” then Tj output data is also
annotated with security label “si.”
• Sanitization task output derivation: If a task Tj is a sanitization task, regardless of its inputs’
security labels, the output of task Tj is labeled based on the sanitization definition of the
task (e.g., given a task that sanitizes patient data according to safe harbor rules of HIPAA,
the output of the task may be consider not privacy-sensitive any more).
442 Big Data Analytics with Applications in Insider Threat Detection
• Multisource derivation: If a task Tj uses inputs from different nodes (in this context, node
could be a data source or output of a previous task) Ni1, Ni2, Ni3, …, and each of them
respectively annotated with tags “si1,” “si2,” “si3,” …, “sik,” then task Tj is tagged with secu-
rity label “sx” where security label “sx” is the tag with the highest priority among the list
of security labels.
The above security label propagation rules could be considered conservative since we tag the
output of a task with the highest security label associated with potential inputs. However, based on
the application requirements the rules can be suitably modified. That is, a domain expert can restate
how the security labels are propagated in domain-specific cases.
In our previous work [RACH12], we used a similar label propagation mechanism to understand
how sensitive information flows in a typical workflow setting. To address similar challenges, we
created an Web Ontology Language [OWL12] datatype property called tag, whose domain is class
tuple and range is an integer value (xsd:integer). We used Semantic Web Rules Language [RACH12]
rules for the propagation of the security labels. As a part of this work, we will explore how to effi-
ciently capture such rules.
Using the above workflow tagging mechanisms, we plan to implement traditional mandatory
control policies and basic role-based access control (RBAC) policies easily without changing the
internals of the underlying systems. Basically, we use a thin wrapper to provide access control for
the underlying data storage systems. We assume that each data source (e.g., array in SciDB or bolt
in a streaming data system such as Storm) defined will be also associated with a security label.
In addition, any sanitization tasks should be defined by the system administrator. To simplify the
implementation, we assume that new tasks can only be issued by users who have administrative
access to entire system. These admin users could than define basic RBAC policies to allow fine-
grained access control to data mining results.
The solution discussed could be seen as a straightforward application of RBAC where the output
of each task in the data mining workflow is seen as a separate object. Given the initial security
labels and propagation rules, the security label of the entire workflow will be automatically inferred
and policies will be automatically enforced. At the same, each user in the system is assigned to cer-
tain roles. Finally, each role is associated with different security labels that it can access by setting
role to security label mappings. Clearly, such approach can be used to implement basic mandatory
control policies as well as RBAC policies.
datasets. Therefore, enhancing the power of existing Markov logic technology is a key subgoal of
this project.
1
Pr( x ) =
Z
exp
∑
i
wi N i ( x ) where Z =
∑ ∑
x
exp
i
wi N i ( x )
where Ni(x) is the number of groundings of fi that are true in x and Z is a normalization constant, also
called the partition function, which ensures that the distribution sums to 1.
The two key tasks in Markov logic are weight learning, which is the task of learning from data
the weights attached to the first-order logic formulas and inference, which is the task of answer-
ing queries posed over the learned model given observations (evidence). For instance, in the social
network given above, an example inference task is computing the probability that Ana has cancer,
given that she is friends with Bob who smokes (evidence) and the learning task is updating the
weights attached to the two first-order formulas given data. For both these tasks, we could use and
enhance the power of Alchemy 2.0 [GOVE2013], a state-of-the-art open-source software for learn-
ing and inference in Markov logic. Alchemy 2.0, which is maintained and developed (Dr. Gogate
and his team at UT Dallas), has been downloaded over 1800 times and its predecessor Alchemy 1.0
444 Big Data Analytics with Applications in Insider Threat Detection
[KOK08], which is maintained and developed by University of Washington, has been downloaded
>15,000 times. Especially, we plan to use the weight learning approaches from data to see whether
high levels of PM2.5 correlate with certain diseases by exploring the weights of the rules of the form
“high levels of PM2.5 implies asthma attack.”
Alchemy 2.0 is based on a concept called lifted probabilistic inference ([GOGA11b], [POOL03]);
the idea is to perform inference at the more compact first-order or lifted level rather than at the
propositional level. Propositional algorithms essentially ground the MLN yielding a Markov net-
work and use probabilistic inference algorithms that do not take advantage of relational structure to
perform inference. For instance, consider a friends-smokers social network having 1 billion (=106)
people. In this case, the second first-order formula will yield 1012 groundings and as a result infer-
ence at the propositional (ground) level is clearly infeasible (existing inference and learning algo-
rithms for graphical models do not scale to such levels). However, in many cases, lifted inference
can take advantage of symmetries in the first-order representation and answer such queries without
generating all of the 1012 groundings. Thus, Alchemy 2.0, which uses lifted inference, is potentially
more scalable than Alchemy 1.0.
Alchemy 1.0 and 2.0 have been applied in several domains with promising results, for example,
citation matching [SING06], link prediction ([GOGA11b], [RICH06]), and information extraction
[POON07]. Inspired by this scientific progress, we need to integrate Alchemy 2.0 into a full-blown
application which has all the characteristics of a complex machine learning system: operates online;
requires personalized solutions; is rich and diverse; and so on. Thus, our approach will serve as an
ideal test-bed for testing and evaluating the capabilities of Markov logic and Alchemy 2.0.
1. Existing MLN inference tools perform poorly on applications involving mixed continuous
and discrete datasets.
2. Learning algorithms do not scale well to massive datasets, involving billions of entries,
especially when the Markov network associated with the MLN is densely connected as in
our application.
3. Existing tools do not support scalable online inference, a must have for real-time analytics.
In the next three subsections, we outline an approach that addresses these three limitations.
([GOGA11b], [JHA10]). For other variables and when inference is not possible in closed form, we
need to explore using dynamic discretization and particle-based strategies described in [IHLE09]. In
prior work, we extended advanced, Rao-Blackwellised importance sampling algorithms that com-
bine exact inference and sampling to mixed discrete and continuous Gaussian domains ([GOGA05]).
We need to lift this and other sampling algorithms by using lifted inference rules developed in our
previous work ([GOGA11a], [GOGA12a], [JHA10], [VENU12]). The main idea is to replace the
propositional sampling step which samples individual random variables in the Markov network by a
lifted sampling step which partitions the set of variables into several groups each containing a set of
indistinguishable random variables and samples each group in one go. When the number of groups,
which are identified by looking at symmetries in the first-order representation without grounding
the MLN, is small, this approach can be quite efficient.
example, one condition that we can enforce is that the size of the lifted AND/OR graph ([DECH07],
[GOGA10]) after processing evidence is bounded by a constant.
Unfortunately, exact compilation techniques, such as the one described above, will not scale well
to large datasets. Therefore, we need to develop approximate compilation techniques which com-
press and store the output of sampling-based algorithms [GOGA12b].
having a few labeled instances and a large number of unlabeled instances. This model is built as
micro-clusters using a semisupervised clustering technique and classification is performed using
k-nearest neighbor algorithms. An ensemble of these models is used to classify unlabeled data.
During semisupervised clustering in our former work ([MASU11a], [MASU08]), we assigned a
penalty when instances having the same class label belonged to different clusters. We need to extend
the constraints beyond class labels, for example, we need to take into consideration spatial locality.
We could also utilize subspace clustering instead of k-means. Since data are high dimensional, a set
of points may form a cluster in a subset of dimensions instead of across all dimensions. Identification
of such clusters is the goal of subspace clustering ([AHME10], [JING07]).
In addition, we need to consider efficient techniques for building traditional logistic regression
models that includes PM2.5 levels and health conditions. Using log-likelihood models, we need to
explore whether high PM2.5 levels are highly correlated with certain health conditions.
stream data as tuples from several queuing systems (e.g., Kafka and RabbitMQ) and emit those
tuples to bolt. In addition, raw data will be moved from online to persistent storage (NoSQL system)
using bolt. In bolt for the analytics part, we need to implement our prediction and classification tasks
where multiple instances of bolt will be run in parallel.
REFERENCES
[WHO] World Health Organization, http://www.who.int/mediacentre/news/releases/2014/air-pollution/en/.
[ACCU11] Apache Accumulo, https://accumulo.apache.org/.
[AGGA2009] C.C. Aggarwal, “On Classification and Segmentation of MassiveAudio Data Streams,”
Knowledge and Information System, 20, 137–156, 2009.
[AHME10] M.S. Ahmed, L. Khan, M. Rajeswari, “Using Correlation Based Subspace Clustering for Multi-
label Text Data Classification,” ICTAI Arras, France, pp. 296–303, 2010.
[ARRA14] Why an array database. http://www.paradigm4.com/why-an-array-database/.
[BESA75] J. Besag, “Statistical Analysis of Non-Lattice Data, The Statistician, 24, 179–195, 1975.
[BIFE009] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, R. Gavalda`, “New Ensemble Methods for Evolving
Data Streams,” In Proceedings of ACMSIGKDD 15th International Conference on Knowledge
Discovery and Data Mining, Paris, France, pp. 139–148, 2009.
[BORT10] D. Borthakur, “HDFS,” http://hadoop.apache.org/common/docs/current/hdfs_design.html, 2010.
[BROE11] G. Van den Broeck, N. Taghipour, W. Meert, J. Davis, L. De Raedt, “Lifted Probabilistic Inference
by First-Order Knowledge Compilation,” In Proceedings of the 22nd International Joint Conference on
Artificial Intelligence, Jul. 16–22, Barcelona, Catalonia, Spain, pp. 2178–2185, 2011.
[CARS10] D. Carstoiu, A. Cernian, A. Olteanu, “Hadoop Hbase-0.20.2 Performance Evaluation,” In New
Trends in Information Science and Service Science (NISS), Gyeongju, South Korea, pp. 84–87, 2010.
[CHAN08] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R.
Gruber, “Bigtable: A Distributed Storage System for Structured Data. ” ACM Transactions on Computer
Systems, 26 (2), Article No. 4, 2008.
Toward a Case Study in Healthcare for Big Data Analytics and Security 449
[CHEN08] S. Chen, H. Wang, S. Zhou, P. Yu, “Stop Chasing Trends: Discovering High Order Models in
Evolving Data,” In Proceedings of IEEE 24th International Conference on Data Engineering (ICDE),
Cancun, Mexico, pp. 923–932, 2008.
[CHOW68] C. Chow and C. Liu. “Approximating Discrete Probability Distributions with Dependence Trees.”
Information Theory, IEEE Transactions 14 (3), 462–467, 1968.
[DARW02a] A. Darwiche and P. Marquis, “A Knowledge Compilation Map.” Journal of Artificial Intelligence
Research, 17, 229–264, 2002.
[DARW02b] A. Darwiche, “A Logical Approach to Factoring Belief Networks,” In Proceedings of the 8th
International Conference on Principles and Knowledge Representation and Reasoning, Apr. 22–25,
Toulouse, France, pp. 409–420, 2002.
[DARW03] A. Darwiche, “A Differential Approach to Inference in Bayesian Networks,” Journal of the ACM,
50, 280–305, 2003.
[DEAN04] S. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In
Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), San
Francisco, CA, 2004.
[DECA07] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S.
Sivasubramanian, P. Vosshall, W. Vogels, “Dynamo: Amazon’s Highly Available Key-Value Store,”
In Proceedings of the 21st ACM Symposium on Operating System Principles (SOSP), Stevenson,
Washington, DC, 2007.
[DECH07] R. Dechter and R. Mateescu, “AND/OR Search Spaces for Graphical Models,” Artificial
Intelligence, 171 (2–3), 73–106, 2007.
[DOMI09] P. Domingos and D. Lowd, Markov Logic: An Interface Layer for Artificial Intelligence, Morgan
& Claypool, San Rafael, CA, 2009.
[FAN04] W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In Proceedings of
ACM SIGKDD 10th International Conference on Knowledge Discovery and Data Mining, Seattle, WA,
Aug. 22–25, pp. 128–137, 2004.
[GAO07] J. Gao, W. Fan, J. Han, “On Appropriate Assumptions to Mine Data Streams,” In Proceedings of
IEEE 7th International Conference on Data Mining (ICDM), Omaha, NE, pp. 143–152, 2007.
[GETO07] L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, MIT Press,
Cambridge, MA, 2007.
[GHEM06] S. Ghemawat, H. Gobioff, S. Leung, “The Google File System,” In Proceedings of 19th ACM
Symposium on Operating Systems Principles (SOSP), Lake George, NY, 2003.
[GOGA05] V. Gogate and R. Dechter, “Approximate Inference Algorithms for Hybrid Bayesian Networks with
Discrete Constraints,” In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence,
AUAI Press, Edinburgh, Scotland, pp. 209–216, 2005.
[GOGA10] V. Gogate and P. Domingos, “Exploiting Logical Structure in Lifted Probabilistic Inference,” In
AAAI 2010 Workshop on Statistical Relational Learning, Atlanta, GA, 2010.
[GOGA11a] V. Gogate and P. Domingos, “Approximation by Quantization,” In Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, pp. 247–255, 2011.
[GOGA11b] V. Gogate and P. Domingos, “Probabilistic Theorem Proving,” In Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence, pp. 256–265, 2011.
[GOGA12a] V. Gogate, A. Jha, D. Venugopal, “Advances in Lifted Importance Sampling,” In Communications
of the ACM, 59 (7), 107–115, 2012.
[GOGA12b] V. Gogate and R. Dechter, “Importance Sampling-Based Estimation Over AND/OR Search
Spaces for Graphical Models,” Artificial Intelligence, 184–185, 38–77, 2012.
[GOGA13] V. Gogate and D. Venugopal, “The Alchemy 2.0 System for Statistical Relational AI,” Technical
Report, Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 2013.
https://code.google.com/p/alchemy-2/.
[HASH09] S. Hashemi, Y. Yang, Z. Mirzamomen, M. Kangavari, “Adapted One-Versus-All Decision Trees
for Data Stream Classification,” IEEE Transactions on Knowledge and Data Engineering, 21 (5), 624–
637, 2009.
[HIVE14] Apache Hive. http://wiki.apache.org/hadoop/Hive.
[HUSA11a] M.F. Husain, J.P. McGlothlin, L. Khan, B.M. Thuraisingham, “Scalable Complex Query
Processing over Large Semantic Web Data Using Cloud,” IEEE CLOUD, 187–194, 2011.
[HUSA11b] M. Husain, M.M. Masud, J. McGlothlin, L. Khan, “Greedy Based Query Processing for Large
RDF Graphs Using Cloud Computing,” IEEE Transactions on Knowledge and Data Engineering, 23(9),
1312–1327, 2011.
450 Big Data Analytics with Applications in Insider Threat Detection
[IHLE09] A. Ihler, A. Frank, P. Smyth, “Particle-Based Variational Inference for Continuous Systems,” In
Proceedings of NIPS 2009, Vancouver, British Columbia, Canada, pp. 826–834, 2009.
[IMPA14] Cloudera Impala. http://en.wikipedia.org/wiki/Cloudera_Impala.
[JHA10] A. Jha, V. Gogate, A. Meliou, D. Suciu, “Lifted Inference from the Other Side: The Tractable
Features,” In Proceedings of the 24th Annual Conference on Neural Information Processing Systems,
Vancouver, Canada, pp. 973–981, 2010.
[JING07] L. Jing, M.K. Ng, and J.Z. Huang, “An Entropy Weighting K-Means Algorithm for Subspace
Clustering of High-Dimensional Sparse Data,” IEEE Transactions on Knowledge and Data Engineering,
19 (8), 1026–1041, 2007.
[KOOLL09] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT
Press, Cambridge, MA, 2009.
[KAUT97] H. Kautz, B. Selman, and Y. Jiang, “A General Stochastic Approach to Solving Problems with
Hard and Soft Constraints,” The Satisfiability Problem: Theory and Applications, D. Gu, J. Du, P.
Pardalos, editors, American Mathematical Society, New York, NY, pp. 573–586, 1997.
[KOK08] S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, D. Lowd, J. Wang, and P. Domingos, “The
Alchemy System for Statistical Relational AI,” Technical Report, Department of Computer Science and
Engineering, University of Washington, Seattle, WA, 2008. http://alchemy.cs.washington.edu.
[LAKS09] A. Lakshman and P. Malik, “Cassandra: Structured Storage System on a P2P Network,” In
Proceedings of 28th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing
(PODC), Calgary, Alberta, Canada, 2009.
[LERN02] U. Lerner, Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford
University, 2002.
[LIU01] J.S. Liu, Monte Carlo Strategies in Scientific Computing, Springer Publishing Company, Incorporated,
Springer-Verlag, New York, 2001.
[MASU08] M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify
Evolving Data Streams: Training with Limited Amount of Labeled Data,” In Proceedings of 2008 IEEE
International Conference on Data Mining (ICDM 2008), Pisa, Italy, pp. 929–934, December, 2008.
(Acceptance Rate: 19.9%).
[MASU09a] M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection with
Classification for Concept-Drifting Data Streams,” In Proceedings of European Conference on Machine
Learning and Knowledge Discovery in Databases (ECML PKDD), Bled, Slovenia, pp. 79–94, 2009.
[MASU11a] M.M. Masud, J. Gao, L. Khan, J. Han, K.W. Hamlen, N.C. Oza, Facing the Reality of Data
Stream Classification: Coping with Scarcity of Labeled Data,” International Journal of Knowledge and
Information Systems (KAIS), 33 (1), pp. 213–244, 2011, Springer, 2011.
[MASU11b] M.M. Masud, J. Gao, L. Khan, J. Han, and B.M. Thuraisingham, “Classification and Novel
Class Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on
Knowledge and Data Engineering, 23 (6), 859–874, 2011.
[MCGL10] J.P. McGlothlin and L.R. Khan, Materializing and Persisting Inferred and Uncertain Knowledge
in RDF Datasets. AAAI, Atlanta, GA, 2010.
[MOSE05] Tim Moses, “eXtensible Access Control Markup Language (XACML) Version 2.0 http://docs.
oasis-open.org/xacml/2.0/access_control-xacml-2.0-core-spec-os.pdf, 2005.
[OWL12] OWL, Web Ontology Language, http://www.w3.org/TR/owl2-quick-reference/.
[PATT13] E. Pattuk, M. Kantarcioglu, V. Khadilkar, H. Ulusoy, S. Mehrotra, “Bigsecret: A Secure Data
Management Framework for Key-Value Stores,” In IEEE CLOUD, Santa Clara, CA, 2013.
[PIG14] Apache Pig. https://cwiki.apache.org/confluence/display/PIG/Index.
[POOL03] D. Poole, “First-Order Probabilistic Inference,” In Proceedings of the 18th International Joint
Conference on Artificial Intelligence, Morgan Kaufmann, Acapulco, Mexico, pp. 985–991, 2003.
[POON07] H. Poon and P. Domingos, “Joint Inference in Information Extraction,” In Proceedings of the 22nd
National Conference on Artificial Intelligence, AAAI Press, Vancouver, Canada, pp. 913–918, 2007.
[RACH12] J. Rachapalli, M. Kantarcioglu, B. Thuraisingham, “Tag-Based Information Flow Analysis for
Document Classification in Provenance,” In USENIX TAPP Workshop, Boston, MA, 2012.
[RAHM14] T. Rahman and P. Kothalkar, V. Gogate, Cutset Networks: A Simple, Tractable, and Scalable
Approach for Improving the Accuracy of Chow-Liu Trees,” In Proceedings of the 31st International
Conference on Machine Learning, Beijing, China, 2014. JMLR.
[RICH06] M. Richardson and P. Domingos, “Markov Logic Networks,” Machine Learning, 62, 107–136,
2006.
[ROY10] I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, E. Witchel, “Airavat: Security and Privacy for
Mapreduce,” In USENIX, San Jose, CA, pp. 20–20, 2010.
Toward a Case Study in Healthcare for Big Data Analytics and Security 451
34.1 INTRODUCTION
While big data management and analytics (BDMA) is evolving into a field called data science with
significant progress over the past 5 years with various courses being taught at universities, there
is still a lot to be done. We find that many of the courses are more theoretical in nature and are
not integrated with real-world applications. Furthermore, big data security and privacy (BDSP) is
becoming a critical need and there is very little being done not only on research, but also on edu-
cation programs and infrastructures for BDSP. For example, BDMA techniques on personal data
could violate individual privacy. With the recent emergence of the quantified self (QS) movement,
personal data collected by wearable devices and smartphone apps is being analyzed to guide users
in improving their health or personal life habits. This data is also being shared with other service
providers (e.g., retailers) using cloud-based services, offering potential benefits to users (e.g., infor-
mation about health products). But such data collection and sharing are often being carried out with-
out the users’ knowledge, bringing grave danger that the personal data may be used for improper
purposes. Privacy violations could easily get out of control if data collectors could aggregate finan-
cial and health-related data with tweets, Facebook activity, and purchase patterns. While some of
our research is focusing on privacy protection in QS applications and controlling access to the data,
education and infrastructure programs in BDSP are yet to be developed.
To address the limitations of BDMA and BDSP experimental education and infrastructure pro-
grams, we are proposing to design such programs at The University of Texas at Dallas and the pur-
pose of this chapter is to share our plans. Our main objectives are the following: (1) to train highly
qualified students to become expert professionals in big data management and analytics and data
science. That is, we will be developing a course on big data management and analytics integrated
with real-world applications as a capstone course. (ii) Leverage our investments in BDSP research,
BDMA research and education, and cyber security education to develop a laboratory to carry out
hands-on exercises for relevant courses as well as a capstone course in BDSP, including extensive
experimental student projects to support the education.
To address the objectives, we are assembling an interdisciplinary team with expertise in big data
management and mining, machine learning, atmospheric science, geospatial data management, and
data security and privacy to develop the programs. Essentially, our team consists of computer and
information scientists who will develop the fundamental aspects of the courses together with applica-
tion specialists (e.g., atmospheric scientists) who will develop the experimental aspects of the courses as
well as provide the data for the students to carry out experiments on. Specifically, we will be utilizing
the planned case study discussed in Chapter 32 to design our education and experimental programs.
This chapter is organized in the following way. In Section 34.2, we will discuss some of our rel-
evant current research and infrastructure development activities in BDMA and BDSP. Our new pro-
grams will be being built utilizing these efforts. In Section 34.3, we describe our plan for designing
a program in BDMA. In Section 34.4, we describe our plan for designing a program in BDSP. This
chapter is summarized in Section 34.5. Figure 34.1 illustrates our plan for developing a curriculum
on integrated data science and cyber security.
453
454 Big Data Analytics with Applications in Insider Threat Detection
Developing an
educational Experimental
program and program in BDSP Research
experimental and BDMA activities in
infrastructure for BDMA and BDSP
BDMA and BDSP
FIGURE 34.1 Developing an educational program and experimental infrastructure for big data management
and analytics and big data security and privacy.
the application layer. Research was carried out on encrypted data storage in the cloud as well as
on secure cloud query processing. The secure cloud was demonstrated with assured information
sharing as an application.
34.2.7 Infrastructure Development
We have developed hardware, software and data infrastructures for our students to carry out experi-
mental research. These include secure cloud infrastructures, mobile computing infrastructures as
well as social media infrastructures. The data collected includes both geospatial data, social media
data, as well as malware data. Some of these infrastructures are discussed in our previous books
([THUR14], [THUR16]).
(NLP) or database design. The course we are designing integrates out BDMA course with real-
world applications.
Our current BDMA course focuses on data mining and machine learning algorithms for ana-
lyzing very large amounts of data. MapReduce and NoSQL system are used as tools/standards for
creating parallel algorithms that can process very large amounts of data. It covers basics of Hadoop,
MapReduce, NoSQL systems (e.g., key–value stores, column-oriented data stores), Cassandra, Pig,
Hive, MongoDB, Hbase, BigTable, SPARK, Storm, large-scale supervised machine learning, data
streams, clustering, and applications including recommendation systems. The following reference
books are used to augment the material presented in lectures:
• Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan &
Claypool Publishers, 2010. http://lintool.github.com/MapReduceAlgorithms/
• Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge Press, http://
infolab.stanford.edu/∼ullman/mmds/book.pdf
• Chuck Lam, Hadoop in Action, December, 2010|336 pages ISBN: 9781935182191.
• Spark: http://spark.apache.org/docs/latest/
Our capstone course to be designed will be titled Big Data and Machine Learning for Scientific
Discovery. It will integrate several of the topics in the BDMA course as well as our course in
machine learning as well as the theatrical concepts with experimental work using real-world appli-
cations such as Environmental and Remote Sensing Applications. The course will focus on the
practical application of a variety of supervised and unsupervised machine learning approaches that
can be used for nonlinear multivariate systems including neural networks, deep neural networks,
support vector machines, random forests, and Gaussian processes. A variety of supervised and
unsupervised classifiers such as self-organizing maps will be used. Many of these datasets are
non-Gaussian so mutual information will be introduced. Using remote sensing from a wide variety
of platforms, from satellites to aerial vehicles coupled with machine learning, multiple massive
big datasets can be of great use for a wide variety of scientific, societal, and business applications.
Remote sensing can provide invaluable tools for both improved understanding and making data-
driven decisions and policies. This course will give an introduction to a wide range of big data
applications in remote sensing of land, ocean, and atmosphere and their practical applications of
major societal importance such as environmental health, drought and water issues, and fire. The
experimental projects will include the processing of multiple massive datasets and machine learn-
ing. The skills developed from the big data curriculum may be used in a scenario where students
can learn the practical techniques of designing algorithms using large datasets. For example, after
learning NoSQL (MapReduce, Pig, Hive, SPARK) from the course, students can apply these tech-
niques/tools to query from large datasets in a scalable manner using commodity hardware. In addi-
tion to the graduate education in BDMA, we are also planning on introducing senior design projects
in collaboration with local corporations.
The Capstone BDMA course will consist of two modules. The data management part introduces
various techniques and data structures to handle large data where traditional models or data struc-
tures do not efficiently scale to address the problems involving such voluminous data. The data ana-
lytics module introduces various algorithms which are widely used for analyzing the information in
the large datasets. Supervised/unsupervised learning is a collection of algorithms used for classifi-
cation, clustering, and pattern recognition. The module introduces generic problems, which can be
ubiquitously applied to specific cases when working on a dataset in the relevant field when learning
a model using the data values. For example, classification can be used for real-time anomaly detec-
tion where the anomalous data is to be classified from nonanomalous data either assuming class
labels or no class labels. Relational learning is a general term used for data mining algorithms deal-
ing with data and feature relationships. A real-world dataset may have multiple features and data
items that may have specific relationships between them [CHAN14]. For example, an atmospheric
Toward an Experimental Infrastructure and Education Program for BDMA and BDSP 457
dataset may have features of temperature, pressure, type of cloud, moisture content, and so on. In
order to predict if a set of data indicates rain, the relationship between these features needs to be
considered for a better prediction model. These relationships may be causal or noncausal in nature.
A query (e.g., will it rain in the next few days?) would be better evaluated from a model that best
represents the features and evidence given. Lastly, stream mining is a collection of algorithms used
for handling continuously occurring data. Data streams are continuous flows of data. Examples
of data streams include network traffic, sensor data, call center records, and so on. Data streams
demonstrate several unique properties that together conform to the characteristics of big data (i.e.,
volume, velocity, variety, and veracity) and add challenges to data stream mining. Most existing
data stream classification techniques ignore one important aspect of stream data: arrival of a novel
class. We have addressed this issue and proposed a data stream classification technique that inte-
grates a novel class detection mechanism into traditional classifiers, enabling automatic detection
of novel classes before the true labels of the novel class instances arrive ([MASU09], [MASU11a],
[MASU11b], [HAQU14], [HAQU15]). Overall, methodologies to perform such analytics from the
data would be useful for students to understand the practical implications of using large datasets
while designing an algorithm.
34.3.2 Experimental Program
Figure 34.2 shows an example of case studies related to the planned Capstone BDMA course. The
data management part of the course includes learning about NoSQL, Hadoop, Spark, Storm, and so
on. These technologies will be used while studying cases in
1. GDELT (1 and 2) Political event data which use NoSQL and Hadoop MapReduce, Spark
concepts in a supervised/semisupervised settings.
2. Timely Health Indicator (3 and 5) with data management techniques using Spark while
performing supervised/semisupervised learning on a stream data.
3. Scalable Inference on Graphical Models (4 and 7) studies Relational Learning using con-
cepts of Spark with Alchemy 2.0.
1 GDELT
NoSQL; Hadoop;
Data Map Reduce; Yarn Timely health
management 3 indicator
2
learning
7
Data analytics
Real time anomaly
8
Relational learning
detection
9
Stream mining
10
Mahout
Big data tools Security and privacy
R
FIGURE 34.2 Association between big data management and case studies.
458 Big Data Analytics with Applications in Insider Threat Detection
4. Real-Time Anomaly detection (6 and 9) using Spark to manage data while performing
supervised stream mining.
5. Security and Privacy (8 and 10) issues in implementing supervised or unsupervised learn-
ing or in implementing big data management systems such as Spark or NoSQL.
These case studies, part of our research related to Big Data, will be discussed in more detail later
in this Chapter. Hence, they will be integrated with the planned education module. Further, these
would be supported with the knowledge in Tools such as Mahout, R, and so on depending on the
application and working environment. The planned case studies discussed in Chapter 33 will also
be utilized for the experimental programs.
We discuss a sample of our research projects that have contributed a great deal to our Big Data
education. These projects can be integrated into the case studies of the capstone course.
in the world via open source release. Our lab together with the lab manuals as well as our courses
will be made available to the researchers and educators in CyS. Our research in BDSP as well as
our education program in CyS will be integrated to build a strong capacity for BDSP education
program. Essentially, we aim to address the security challenges, by providing a platform accessible
to Big Data users (e.g., Iota users), developers, and researchers to provide better insights for further
research in BDSP.
34.4.2 Curriculum Development
34.4.2.1 Extensions to Existing Courses
In order to integrate the proposed lab with these courses, we will design various modules to be
integrated with its existing projects. An overview is provided in Figure 34.3. These modules will be
derived from our existing research (detailed in Section 34.3) and/or our proposed virtual lab devel-
opment (detailed in Section 34.4.1). For example, the modules, “Access Control for Secure Storage
and Retrieval” may be integrated with our graduate course, “Data and Applications Security and
Privacy.” The “Performance Overhead of Trusted Execution Environments” module will be inte-
grated into the course “System Security and Binary Code Analysis.” The “Feature Extraction &
(Un) Supervised Learning,” “Distributed Trusted Execution Environments,” and “Secure Encrypted
Stream Analytics” modules will be integrated into the “Big Data Management and Analytics”
course. Below we will describe a sample of our cyber security courses that are being enhanced with
BDSP models.
1. Data and Applications Security: This course provides a comprehensive overview of data-
base security, confidentiality, privacy and trust management, data security and privacy,
data mining for security applications, secure social media, secure cloud computing, and
web security. In addition to term papers, students also carry out a programming project
that addresses any of the topics covered in class. Typical student projects have included
data mining tools for malware detection as well as access control tools for social media
systems. We have introduced a course module in access control for secure storage and
retrieval of Big Data.
Modules Courses
Feature extraction
and (un)supervised
Capstone course learning Big data management
and analytics
Security and
privacy preserving Secure encrypted
big data analytics stream analytics
Data and
and management application security
Performance
overhead of
trusted execution Secure cloud
environments computing
2. System Security and Binary Code Analysis: The goal of this course is to explain the low-
level system details from compiler, linker, loader, to OS kernel and computer architectures;
examine the weakest link in each system component, explore the left bits and bytes after
all these transformations; and study the state-of-the-art offenses and defenses. The learn-
ing outcome is students shall be able to understand how an attack is launched (e.g., how an
exploit is created) and how to do the defense (e.g., developing OS patches, analyzing the
binary code, and detecting intrusions). We will introduce additional units on overhead of
TEE for secure hardware extension.
3. Big Data Analytics and Management: As stated earlier, our current BDMA course focuses
on data mining and machine learning algorithms for analyzing very large amounts of
data or big data. MapReduce and NoSQL system are used as tools/standards for creating
parallel algorithms that can process very large amounts of data. It covers the basics of
Hadoop, MapReduce, NoSQL systems (Cassandra, Pig, Hive, MongoDB, Hbase, BigTable,
HBASE, Spark), Storm, Large-scale supervised machine learning, data streams, cluster-
ing, and applications including recommendation systems, web, and security. This course
focuses on large-scale feature extraction and learning to leverage the big data platform to
perform parallel and distributed analysis. In addition, the course focuses on a stream ana-
lytics framework using secure hardware extension. We have also introduced a module on
BDSP for this course.
4. Secure Cloud Computing: This course introduces the concepts of secure web services and
service-oriented architecture and then describes in detail the various layers of a secure
cloud. These include secure hypervisors, secure data storage, secure cloud query process-
ing, and cloud forensics. The use of the cloud for computing intensive tasks such as mal-
ware detection is also discussed. We have introduced a module BDSP and will introduce
additional modules on the set-up of TEE for secure hardware extension.
5. Secure Cyber-Physical Systems and Critical Infrastructures: This course introduces the
security of cyber-physical systems from a multidisciplinary point of view, from computer
science security research (network security and software security) to public policy (e.g.,
the Executive Order 13636), risk assessment, business drivers, and control-theoretic meth-
ods to reduce the cyber risk to cyber-physical critical infrastructures. We will introduce a
module on feature extraction and (un/semi) supervised learning to find anomalies in cyber-
physical systems.
34.4.3 Experimental Program
34.4.3.1 Laboratory SetUp
The lab to be developed would be accessible to all of students who will enroll into relevant and cap-
stone courses. First, we will leverage our current single Intel SGX-enabled machine and later will
construct a cluster for the lab experiment. We use an Intel SGX-enabled Linux system with i7-6700
462 Big Data Analytics with Applications in Insider Threat Detection
CPU (Skylake) and 64-GB RAM operating at 3.40 GHz with 8 cores, running Ubuntu 14.04. We
have installed the latest Intel SGX SDK and SGX driver [SGXSDK]. While running SGX applica-
tions, the trusted hardware establishes an enclave by protecting isolated memory regions within
the existing address space, called processor reserved memory (PRM), against other nonenclave
memory accesses including kernel, hypervisor, and other privileged code.
Number of enclaves on single machine—A special region inside PRM called the Enclave Page
Cache (EPC) stores sensitive code and data as encrypted 4 kB pages. EPC size can be configured
inside BIOS setting to a maximum size of 128 MB. Hence, the number of enclaves that can be run
efficiently inside a single machine is limited by EPC size.
The overhead of SGX application increases with the increase in number of enclaves run on a
single machine. Typically, 5–8 enclaves can be run simultaneously on a single machine without
producing significant performance overhead. SGX applications also show memory access over-
head because every data read or write needs to be present inside the EPC cache. Thus, running
heavy computations on large data inside the enclave can produce performance overhead. We will
maintain secure enclave cluster. There are various challenges in developing such an enclave cluster.
Building SGX-enabled cluster requires: (a) using SGX-enabled machine at each node and (b) secur-
ing communication between enclaves running on same machine or different machine. SGX-enabled
machine protects local code and data running on a single machine. For secure communication
between enclaves running on same or different machines, enclaves can first authenticate each other
and establish a Diffie Hellman cryptography-based secure communication channel.
smallest possible footprint, namely, just the hardware and the application code itself without trust-
ing the hypervisor, operating systems, or the surrounding libraries.
At a high level, SGX allows an application or part of an application to run inside a secure enclave,
an isolated execution environment in which code and data can execute without the fear of inspec-
tion and modification. SGX hardware, as a part of the CPU, protects the enclave against malicious
software, including the operating systems, device drivers, hypervisor, or even low-level firmware
code (e.g., SMM) from compromising its integrity and confidentiality. Also, physical attacks such
as memory bus snooping, memory tampering, and cold boot attacks [HALD09] all will fail since
the enclave secret will be only visible inside the CPUs. Coupled with remote attestation, SGX allows
developers to build a root of trust even in an untrusted environment. Therefore, SGX provides an
ideal platform to protect the secrets of an application in the enclave even when an attacker has full
control of the entire system.
SGX is likely to make outsourced computing in data centers and the cloud practical. However,
there is no study to precisely quantify the overhead of Intel SGX, partly because SGX requires
programmers to use these instructions to develop the application or system software and currently
there is no publicly available SGX test bed or benchmarks. To answer the question of how much
overhead SGX could bring to an application, we have systematically measured the overhead of SGX
programs using both macro-benchmarks and micro-benchmarks.
instances generated from a source, traditional learning techniques that require prior knowledge of
data sizes and train once on a stationary data cannot be directly employed over a data stream.
As such, we would like to answer the following question “How can we perform computations
on an encrypted dataset while remaining encrypted to an adversary?” for a large amount of con-
tinuously arriving data streams. Our goal is to keep the privacy-sensitive streaming data encrypted
except while it is processed securely inside the secure enclave and perform interesting data analyt-
ics task on the encrypted data by leveraging the recent developments in secure hardware design
(e.g., Intel SGX).
Therefore, we plan to develop a framework for performing data analytics over sensitive encrypted
streaming data [MASU11c] using SGX to ensure data privacy. In particular, we plan to design data-
oblivious mechanisms to address the two major problems of classification over continuous data
streams (i.e., concept drift and concept evolution) when deployed over an Intel SGX processor. In
addition, we will explore complex query processing over encrypted data streams. Using algorithmic
manipulation of data access within a secure environment, we will prudently design and implement
methods to perform data classification and data querying by adapting to changes in data patterns
that occur over time while suppressing leakage of such information to a curious adversary via side
channels. In addition to hiding data patterns during model learning or testing, we also aim to hide
changes in those data patterns over time. Furthermore, we plan to support basic stream query pro-
cessing on the encrypted sensitive streams in addition to building classification models. Our initial
work indicates the adaptation of such algorithms perform equivalently to a data stream processing
on unencrypted data, and achieve data privacy with a small overhead.
Concretely, we need to develop techniques to address the above challenges while utilizing
encrypted streaming data containing sensitive and private information. As shown in Figure 34.4,
encrypted data will be only decrypted inside the secure/trusted enclave protected by the hardware.
Data decrypted inside the enclave cannot be accessed by the operating system or any other software
running on the system. In addition, we appeal to the data obliviousness property required from
an algorithm to guarantee data privacy. Here, data-obliviousness refers to the algorithmic prop-
erty where memory, disk, and network accesses are performed independent of the input data. The
main idea is to use appropriate data structures and introduce algorithmic “decoy” code whenever
Users
Result
Adversary
necessary. In addition to using oblivious testing mechanisms for predicting class label of test data
instances, we develop data-oblivious learning mechanisms which are frequently invoked during
model adaptation as needed.
Furthermore, existing data stream classification techniques rely on ensemble of classifiers. These
have been shown to outperform single classifiers. When developing an ensemble-based classifica-
tion method over an encrypted data stream, it may be necessary to use multiple secure environments
(enclaves). We plan to leverage the support of multiple enclaves by Intel SGX processors to perform
such ensemble operations.
REFERENCES
[ALNA14]. K.M. Al-Naami, S. Seker, L. Khan, “GISQF: An Efficient Spatial Query Processing System,”
In 7th IEEE International Conference on Cloud Computing, June 27–July 2, 2014, Anchorage, AK.
[ALNA16]. K.M. Al-Naami, S.E. Seker, L. Khan, “GISQAF: MapReduce Guided Spatial Query Processing
and Analytics System,” to appear in Journal of Software: Practice and Experience. John Wiley & Sons,
Ltd., 46 (10), 1329–1349, 2016.
[ANAT13]. I. Anati, S. Gueron, S.P. Johnson, V.R. Scarlata, “Innovative Technology for CPU Based Attestation
and Sealing,” In Proceedings of the 2nd International Work Shop on Hardware and Architectural
Support for Security and Privacy (HASP), Tel Aviv, Israel, pp. 1–8, 2013.
[CHAN14]. S. Chandra, J. Sahs, L. Khan, B. Thuraisingham, C. Aggarwal, “Stream Mining Using Statistical
Relational Learning,” In IEEE International Conference on Data Mining Series (ICDM), December,
Shenzhen, China, pp. 743–748, 2014.
[CHEN08]. X. Chen, T. Garfinkel, E.C. Lewis, P. Subrahmanyam, C.A. Waldspurger, D. Boneh, J. Dwoskin,
D.R. Ports, “Overshadow: A Virtualization-Based Approach to Retrofitting Protection in Commodity
Operating Systems,” In Proceedings of the 13th International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS XIII, ACM, Seattle, WA, USA, pp. 2–13,
2008.
[DINH14]. T.T. Anh Dinh, A. Datta, “Streamforce: Outsourcing Access Control Enforcement for Stream Data
to The Clouds,” CODASPY, 2014, 13–24, 2014.
466 Big Data Analytics with Applications in Insider Threat Detection
[DINH15]. T.T. Anh Dinh, P. Saxena, E.-C. Chang, B.C. Ooi, C. Zhang, “M2R: Enabling Stronger Privacy in
MapReduce Computation,” USENIX Security Symposium 2015, Washington, DC, pp. 447–462, 2015.
[DOOR06]. L. Van Doorn, “Hardware Virtualization Trends,” In ACM/USENIX International Conference
on Virtual Execution Environments: Proceedings of the 2nd International Conference on Virtual
Execution Environments, Ottawa, Ontario, Canada, vol. 14, pp. 45–45, 2006.
[GUPT16]. D. Gupta, B. Mood, J. Feigenbaum, K. Butler, P. Traynor, “Using Intel Software Guard Extensions
for Efficient Two-Party Secure Function Evaluation,” In Proceedings of the 2016 FC Workshop on
Encrypted Computing and Applied Homomorphic Cryptography, Barbados, pp. 302–318, 2016.
[HALD09]. J.A. Halderman, S.D. Schoen, N. Heninger, W. Clarkson, W. Paul, J.A. Calandrino, A.J.
Feldman, J. Appelbaum, E.W. Felten, “Lest We Remember: Cold-Boot Attacks on Encryption Keys,”
Communications of the ACM, 52 (5), 91–98, 2009.
[HAQU14]. A. Haque, S. Chandra, L. Khan, C. Aggarwal, “Distributed Adaptive Importance Sampling on
Graphical Models Using MapReduce,” In 2014 IEEE International Conference on Big Data (IEEE
BigData 2014), Washington, DC, USA, pp. 597–602, 2014.
[HAQU15]. A. Haque, L. Khan, M. Baron, “Semi-Supervised Adaptive Framework for Classifying Evolving
Data Stream,” PAKDD (2), Ho Chi Minh City, Vietnam, pp. 383–394, 2015.
[HOEK13]. M. Hoekstra, R. Lal, P. Pappachan, V. Phegade, J. Del Cuvillo, “Using Innovative Instructions
to Create Trustworthy Software Solutions,” In Proceedings of the 2nd International Workshop on
Hardware and Architectural Support for Security and Privacy (HASP), Tel Aviv, Israel, pp. 1–8, 2013.
[KHAD12]. V. Khadilkar, K.Y. Oktay, M. Kantarcioglu, S. Mehrotra, “Secure Data Processing over Hybrid
Clouds,” IEEE Data Eng. Bull., 35(4), 46–54, 2012.
[KIM15]. S. Kim, Y. Shin, J. Ha, T. Kim, D. Han, “A First Step Towards Leveraging Commodity Trusted
Execution Environments for Network Applications,” In Proceedings of the 14th ACM Workshop on Hot
Topics in Networks (p. 7), November, ACM, 2015.
[KOCH14]. O. Kocabas and T. Soyata, “Medical Data Analytics in The Cloud Using Homomorphic
Encryption,” Handbook of Research on Cloud Infrastructures for Big Data Analytics, P. R. Chelliah
and G. Deka (eds), IGI Global, pp. 471–488, 2014.
[LI14]. Y. Li, J. McCune, J. Newsome, A. Perrig, B. Baker, W. Drewry, “MiniBox: A Two-Way Sandbox for
x86 Native Code,” USENIX 2014, Philadelphia, PA, pp. 409–420.
[MASU09]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection with
Classification for Concept-Drifting Data Streams,” In Proceedings of European Conference on Machine
Learning and Knowledge Discovery in Databases (ECML PKDD), Bled, Slovenia, pp. 79–94, 2009.
[MASU11a]. M.M. Masud, J. Gao, L. Khan, J. Han, K.W. Hamlen, N.C. Oza, “Facing the Reality of Data
Stream Classification: Coping with Scarcity of Labeled Data,” International Journal of Knowledge and
Information Systems (KAIS), 33 (1), 213–244, 2011, Springer, 2011.
[MASU11b]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham,“Classification and Novel Class
Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge
and Data Engineering, 23 (6), 859–874, 2011.
[MASU11c]. M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class
Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on
Knowledge and Data Engineering, 23 (6), 859–874, 2011.
[MCCU07]. J.M. McCune, B. Parno, A. Perrig, M.K. Reiter, H. Isozaki, “An Execution Infrastructure for TCB
Minimization,” Technical Report CMU-CyLab-07-018, Carnegie Mellon University, Dec. 2007.
[MCCU10]. J.M. McCune, Y. Li, N. Qu, Z. Zhou, A. Datta, V. Gligor, A. Perrig, “Trustvisor: Efficient TCB
Reduction and Attestation,” In Proceedings of the 2010 IEEE Symposium on Security and Privacy,
IEEE Computer Society, pp. 143–158, 2010.
[MCKE13]. F. McKeen, I. Alexandrovich, A. Berenzon, C.V. Rozas, H. Shafi, V. Shanbhogue, U.R.
Savagaonkar, “Innovative Instructions and Software Model for Isolated Execution,” In Proceedings
of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy
(HASP), Tel Aviv, Israel, pp. 1–8, 2013.
[NAEH11]. M. Naehrig, K. Lauter, V. Vaikuntanathan, “Can Homomorphic Encryption be Practical?,” In
Proceedings of the 3rd ACM Work Shop on Cloud Computing Security Workshop, ACM, Chicago, IL,
pp. 113–124, 2011.
[OHIR16]. O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, M. Costa, “Oblivious
Multi-Party Machine Learning on Trusted Processors.” USENIX Security Austin, TX, pp. 619–636, 2016.
[PERE06]. R. Perez, R. Sailer, L. van Doorn et al., “vTPM: Virtualizing the Trusted Platform Module,”
USENIX Security Symposium, pp. 305–320, 2006.
Toward an Experimental Infrastructure and Education Program for BDMA and BDSP 467
[RANE15]. A. Rane, C. Lin, M. Tiwari, “Raccoon: Closing Digital Side-Channels through Obfuscated
Execution,” In 24th USENIX Security Symposium (USENIX Security 15), Washington, DC, pp. 431–
446, 2015.
[SANT14]. N. Santos, H. Raj, S. Saroiu, A. Wolman, “Using Arm Trustzone to Build A Trusted Language
Runtime for Mobile Applications,” ACM SIGARCH Computer Architecture News, ACM, Vol. 42(1), pp.
67–80, 2014.
[SCHU15]. F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, M. Russinovich,
“VC3: Trustworthy Data Analytics in the Cloud Using SGX,” In 2015 IEEE Symposium on Security and
Privacy, May, IEEE, San Jose, CA, pp. 38–54, 2015.
[SESH07]. A. Seshadri, M. Luk, N. Qu, A. Perrig, “Secvisor: A Tiny Hypervisor to Provide Lifetime Kernel
Code Integrity for Commodity OSes,” In Proceedings of 21st ACM SIGOPS Symposium on Operating
Systems Principles, SOSP ’07, Stevenson, WA, USA, pp. 335–350, 2007.
[SGXSDK]. Intel software guard extensions (intel sgx) sdk. https://software.intel.com/en- us/sgx-sdk.
[SHVA10]. K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop Distributed File System,” In 2010
IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, Incline Village, NV,
pp. 1–10, 2010.
[SOLA16]. M. Solaimani, R. Gopalan, L. Khan, B. Thuraisingham, “Spark-Based Political Event Coding,” to
appear in the 2nd IEEE International Conference on Big Data Computing Service and Applications,
IEEE BigDataService 2016, Oxford, UK, pp. 14–23, March 29–April 1, 2016.
[STEI10]. U. Steinberg and B. Kauer. NOVA: A Microhypervisor-Based Secure Virtualization Architecture,”
ACM, Paris, France, pp. 209–222, 2010.
[SUN12]. K. Sun, J. Wang, F. Zhang, A. Stavrou. “SecureSwitch: BIOS-Assisted Isolation and Switch between
Trusted and Untrusted Commodity OSes,” NDSS, San Diego, CA, 2012.
[THUR14]. B. Thuraisingham, Developing and Securing the Cloud. CRC Press, Boca Raton, FL, USA, 2014.
[THUR16]. B. Thuraisingham, S. Abrol, R. Heatherly, M. Kantarcioglu, V. Khadilkar, L. Khan, Analyzing
and Securing Social Networks. CRC Press, Boca Raton, FL, USA, 2016.
[WANG10]. J. Wang, A. Stavrou, A.K. Ghosh, “Hypercheck: A Hardware-Assisted Integrity Monitor,” Recent
Advances in Intrusion Detection, 13th International Symposium, RAID 2010, Ottawa, Ontario, Canada,
September 15–17, Proceedings, pp. 158–177, 2010.
[ZHANG11]. F. Zhang, J. Chen, H. Chen, and B. Zang, “CloudVisor: Retrofitting Protection of Virtual
Machines in Multi-Tenant Cloud with Nested Virtualization.” In Proceedings of the 23rd ACM
Symposium on Operating Systems Principles (SOSP’11), Cascais, Portugal, pp. 203–216, 2011.
35 Directions for BDSP
and BDMA
35.1 INTRODUCTION
While the chapters in Sections II through IV have described the experimental systems and proto-
types we have developed in big data management and analytics (BDMA) and big data security and
privacy (BDSP) and the previous chapters in Section V have discussed some of our exploratory
work on research in BDMA and BDSP as well as our approach to developing experimental infra-
structures and course in these fields, this chapter describes the direction for BDSP and BDMA. In
particular, we provide a summary of the discussions of the National Science Foundation (NSF)
sponsored workshop on BDSP (including applications of BDMA for BDSP) held in Dallas, Texas
on September 16–17, 2014. Our goal is to build a community in BDSP to explore the challenging
research problems. We also presented the results of the workshop at the National Privacy Research
Strategy meeting in Washington, DC to set the directions for research and education on these topics.
Recently a few workshops and panels have been held on BDSP. Examples include the ACM CCS
workshop on Big Data Security, ACM SACMAT, and IEEE Big Data Conference panels. These
workshops and panels have been influenced by different communities of researchers. For example,
the ACM CCS workshop series is focusing on big data for security applications while the IEEE
Big Data conference is focusing on cloud security issues. Furthermore, these workshops and panels
mainly address a limited number of the technical issues surrounding BDSP. For example, the ACM
CCS workshop does not appear to address the privacy issues dealing with regulations or the security
violations resulting from data analytics.
To address the above limitations, we organized a workshop on Big Data Security and Privacy on
September 16–17, 2014 in Dallas, Texas sponsored by the NSF [NSF]. The participants of this work-
shop consisted of interdisciplinary researchers in the fields of higher performance computing, systems,
data management and analytics, cyber security, network science, healthcare, and social sciences who
came together and determined the strategic direction for BDSP. NSF has made substantial investments
both in cyber security and big data. It is therefore critical that the two areas work together to determine
the direction for big data security. We made a submission based on the workshop results to the National
Privacy Research Strategy [NPRS]. We also gave a presentation at the NITRD (The Networking and
Information Technology Research and Development) Privacy Workshop [NITRD]. This document is
the workshop report that describes the issues in BDSP, presentations at the workshop, and the discus-
sions at the workshop. We hope that this effort will help toward building a community in BDSP.
The organization of this chapter is as follows. Section 35.2 describes the issues surrounding
BDSP. The workshop participants were given these issues to build upon during the workshop dis-
cussions. A summary of the workshop presentations is provided in Section 35.3. A summary of the
discussions at the workshop is provided in Section 35.4. Next steps are discussed in Section 35.5.
Figure 35.1 illustrates the topics discussed in this Chapter.
469
470 Big Data Analytics with Applications in Insider Threat Detection
Research
issues in big
data security
and privacy
cyber security. While big data has roots in many technologies, database management is at its heart.
Therefore, in this section, we will discuss how data management has evolved and will then focus
on the BDSP issues.
Database systems technology has advanced a great deal during the past four decades from the
legacy systems based on network and hierarchical models to relational and object database systems.
Database systems can now be accessed via the web and data management services have been imple-
mented as web services. Due to the explosion of web-based services, unstructured data manage-
ment and social media and mobile computing, the amount of data to be handled has increased from
terabytes to petabytes and zetabytes in just two decades. Such vast amounts of complex data have
come to be known as big data. Not only must big data be managed efficiently, such data also has
to be analyzed to extract useful nuggets to enhance businesses as well as improve society. This has
come to be known as big data analytics.
Storage, management, and analysis of large quantities of data also result in security and privacy
violations. Often data has to be retained for various reasons including for regulatory compliance.
The data retained may have sensitive information and could violate user privacy. Furthermore,
manipulating such big data, such as combining sets of different types of data could result in secu-
rity and privacy violations. For example, while the raw data removes personally identifiable infor-
mation, the derived data may contain private and sensitive information. For example, the raw data
about a person may be combined with the person’s address which may be sufficient to identify the
person.
Different communities are working on the big data challenge. For example, the systems commu-
nity is developing technologies for massive storage of big data. The network community is develop-
ing solutions for managing very large networked data. The data community is developing solutions
for efficiently managing and analyzing large sets of data. Big data research and development is
being carried out both in academia, industry, and government research labs. However, little atten-
tion has been given to security and privacy considerations for big data. Security cuts across multiple
areas including systems, data, and networks. We need the multiple communities to come together
to develop solutions for BDSP.
This section describes some of the issues in BDSP. An overview of BDMA is provided in Section
35.2.2. Security and privacy issues are discussed in Section 35.2.3. BDMA for cyber security are
discussed in Section 32.2.4. Our goal toward building a community is discussed in Section 32.2.5.
1. Building infrastructure and high performance computing techniques for the storage of big
data.
Directions for BDSP and BDMA 471
2. Data management techniques such as integrating multiple data sources (both big and
small) and indexing and querying big data.
3. Data analytics techniques that manipulate and analyze big data to extract nuggets.
We will briefly review the progress made in each of the areas. With respect to building infra-
structures, technologies such as Hadoop and MapReduce as well as Storm are being developed for
managing large amounts of data in the cloud. In addition, main memory data management tech-
niques have advanced so that a few terabytes of data can be managed in main memory. Furthermore,
systems such as Hive and Cassandra as well as NoSQL databases have been developed for manag-
ing petabytes of data.
With respect to data management, traditional data management techniques such as query pro-
cessing and optimization strategies are being examined for handling petabytes of data. Furthermore,
graph data management techniques are being developed for the storage and management of very
large networked data.
With respect to data analytics, the various data mining algorithms are being implemented on
Hadoop- and MapReduce-based infrastructures. Additionally, data reduction techniques are being
explored to reduce the massive amounts of data into manageable chunks while still maintaining the
semantics of the data.
In summary, BDMA techniques include extending current data management and mining
techniques to handle massive amounts of data as well as developing new approaches including
graph data management and mining techniques for maintaining and analyzing large networked
data.
structured, semistructured, unstructured, and graph data be integrated? Since big data may result
from combining data from numerous sources, how can you ensure the quality of the data?
Finally, the entire area of security, privacy, integrity, data quality, and trust policies has to be
examined within the context of big data security. What are the appropriate policies for big data?
How can these policies be handled without affecting performance? How can these policies be made
consistent and complete?
This section has listed some of the challenges with respect to security and privacy for big data.
We need a comprehensive research program that will identify the challenges and develop solutions
for BDSP. Security cannot be an afterthought. That is, we cannot incorporate security into each
and every Big Data technology that is being developed. We need to have a comprehensive strategy
so that security can be incorporated while the technology is being developed. We also need to
determine the appropriate types of policies and regulations to enforce before Big Data technologies
are employed by an organization. This means researchers from multiple disciplines have to come
together to determine what the problems are and explore solutions. These disciplines include cyber
security and privacy, high-performance computing, data management and analytics, network sci-
ence, and policy management.
35.2.5 Community Building
The various issues surrounding BDSP were discussed at the beginning of the workshop and five
keynote presentations were given at the workshop that addressed many of these issues. In addition,
several position papers were submitted by the workshop participants and subsequently, presenta-
tions based on these papers were given. These papers and presentations set the stage for the two
breakout sessions held during the workshop. One of these sessions focused on the security and
privacy issues while the other focused on the applications. The presentations and the discussions at
the workshop are summarized in Sections 35.3 and 35.4 of this report. Our goal is to build a com-
munity in BDSP.
We have made some progress toward this goal over the past 2 years. In particular, we par-
ticipated in the BDSP workshops organized at the Women in Cyber Security Conference series
in Dallas in 2016 and in Tucson in 2017. In addition, we also organized a Women in Data Science
and Engineering Workshop in San Diego as section of the IEEE International Conference on Data
Engineering (ICDE) series. We also continue to present papers and present tutorials at various big
data-related conferences.
35.3.1 Keynote Presentations
We had five keynote presentations to motivate the workshop participants. These keynote presenta-
tions discussed the various BDSP initiatives at NIST, Honeywell, and IBM as well as provided an
overview of some of the research challenges. The opening keynote given by Wo Chang from NIST
discussed the initiatives at NIST on big data and provided an overview of the big data workgroup.
Later Arnab Roy from Fujitsu provided some details of the work by the BDSP subgroup of this
workgroup. Elisa Bertino from Purdue discussed issues and challenges of providing security with
privacy. Raj Rajagopalan from Honeywell discussed BDSP challenges for industrial control sys-
tems. Sandeep Gopisetty from IBM discussed the Big Data Enterprise efforts at IBM while Murat
Kantarcioglu from UT Dallas provided an overview of the BDSP initiatives at UT Dallas.
There were several presentations given by the workshop participants. Below we give a summary
of these presentations.
35.3.1.2 Formal Methods for Preserving Privacy While Loading Big Data
Brian Blake from the University of Miami discussed how formal methods can be incorporated into
approaches to handle privacy violations when multiple pieces of information are combined. In par-
ticular, he discussed the creation of a software life cycle and framework for big data testing.
35.3.1.4 Business Intelligence Meets Big Data: An Overview of Security and Privacy
Claudio Ardagna from the University of Milano in Crema discussed the notions of full data and
zero latency analysis within the context of BDSP.
35.3.1.6 Big Data Analytics: Privacy Protection Using Semantic Web Technologies
Csilla Farkas from the University of South Carolina discussed the use of semantic web technolo-
gies for representing policies and data and subsequently reasoning about these policies to prevent
security and privacy violations.
to big data privacy are discussed in Section 35.4.5. An overview of BDMA techniques for cyber
security is provided in Section 4.6.
reconciling security with privacy. At the same time, there are a few approaches that focus on effi-
ciently reconciling security with privacy and we discuss them as follows:
Instead, multiple dimensions need to be tailored for different application domains to achieve
practical solutions. First of all, different domains require different definitions of data utility. For
example, if we want to build privacy-preserving classification models, 0/1 loss could be a good
utility measure. On the other hand, for privacy-preserving record linkage, F1 score could be a
better choice. Second, we need to understand the right definitions of privacy risk. For example, in
data sharing scenarios, the probability of re-identification given certain background knowledge
could be considered the right measure of privacy risk. On the other hand, ε=1 could be consid-
ered an appropriate risk for differentially private data mining models. Finally, the computation,
storage, and communication costs of given protocols need to be considered. These costs could
be especially significant for privacy-preserving protocols that involve cryptography. Given these
three dimensions, one can envisage a multiobjective framework where different dimensions could
be emphasized:
• Maximize utility, given risk and costs constraints: This would be suited for scenarios
where limiting certain privacy risks are paramount.
• Minimize privacy risks, given the utility and cost constraints: In some scenarios, (e.g.,
medical care), significant degradation of the utility may not be allowed. In this setting,
the parameter values of the protocol (e.g., ε in differential privacy) are chosen in such a
way that we try to do our best in terms of privacy given our utility constraints. Please note
that in some scenarios, there may not be any parameter settings that can satisfy all the
constraints.
• Minimize cost, given the utility and risk constraints: In some cases, (e.g., cryptographic
protocols), you may want to find the protocol parameter settings that may allow for the
least expensive protocol that can satisfy all the utility and cost constraints.
To better illustrate these dimensions, consider the privacy-preserving record matching problem
addressed in [INAN12]. Existing solutions to this problem generally follow two approaches: sani-
tization techniques and cryptographic techniques. In [INAN12], a hybrid technique that combines
these two approaches is presented. This approach enables users to make trade-offs between privacy,
accuracy, and cost. This is similar to the multiobjective optimization framework discussed in this
chapter. These multiobjective optimizations are achieved by using a blocking phase that operates
over sanitized data to filter out pairs of records, in a privacy-preserving manner, that do not sat-
isfy the matching condition. By disclosing more information (e.g., differentially private data statis-
tics), the proposed method incurs considerably lower costs than those for cryptographic techniques.
On the other hand, it yields matching results that are significantly more accurate when compared to
the sanitization techniques, even when privacy requirements are high. The use of different privacy-
parameter values allows for different cost, risk, and utility outcomes.
To enable the multiobjective optimization framework for data privacy, we believe that more
research needs to be done to identify appropriate utility, risk, and cost definitions for different appli-
cation domains. Especially defining correct and realistic privacy risks is paramount. Many human
actions ranging from oil extraction to airline travel, involve risks and benefits. In many cases, such
as trying to develop an aircraft that may never malfunction, avoiding all risks are either too costly
or impossible. Similarly we believe that avoiding all privacy risks for all individuals would be too
costly. In addition, assuming that an attacker may know everything is too pessimistic. Therefore,
coming up with privacy risk definitions under realistic attacker scenarios are needed.
• Data confidentiality: Several data confidentiality techniques and mechanisms exist, the
most notable being access control systems and encryptions. Both techniques have been
widely investigated. However, for access control systems for big data we need approaches
for the following:
• Merging large numbers of access control policies. In many cases, big data entails inte-
grating data originating from multiple sources; these data may be associated with their
own access control policies (referred to as “sticky policies”) and these policies must be
enforced even when the data is integrated with other data. Therefore, policies need to
be integrated and conflicts solved.
• Automatically administering authorizations for big data and in particular for grant-
ing permissions. If fine-grained access control is required, manual administration on
large datasets is not feasible. We need techniques by which authorization can be auto-
matically granted, possibly based on the user digital identity, profile, and context, and
on the data contents and metadata.
• Enforcing access control policies on heterogeneous multimedia data. Content-based
access control is an important type of access control by which authorizations are
granted or denied based on the content of data. Content-based access control is critical
when dealing with video surveillance applications which are important for security.
As for privacy, such videos have to be protected. Supporting content-based access con-
trol requires understanding the contents of protected data and this is very challenging
when dealing with multimedia large data sources.
• Enforcing access control policies in big data stores. Some of the recent big data
systems allow its users to submit arbitrary jobs using programming languages
such as Java. For example, in Hadoop, users can submit arbitrary MapReduce
jobs written in Java. This creates significant challenges to enforce fine-grained
access c ontrol efficiently for different users. Although there is some existing work
([ULUS14]) that tries to inject access control policies into submitted jobs, more
research needs to be done on how to efficiently enforce such policies in recently
developed big data stores.
• Automatically designing, evolving, and managing access control policies. When deal-
ing with dynamic environments where sources, users, and applications as well as the
data usage are continuously changing, the ability to automatically design and evolve
policies is critical to make sure that data is readily available for use while at the same
time assuring data confidentiality. Environments and tools for managing policies are
also crucial.
• Privacy-preserving data correlation techniques: A major issue arising from big data is
that in correlating many (big) datasets, one can extract unanticipated information. Relevant
issues and research directions that need to be investigated include
• Techniques to control what is extracted and to check that what is extracted can be
used and/or shared.
• Support for both personal privacy and population privacy. In the case of population
privacy, it is important to understand what is extracted from the data as this may lead
to discrimination. Also, when dealing with security with privacy, it is important to
understand the trade-off of personal privacy and collective security.
• Efficient and scalable privacy-enhancing techniques. Several such techniques have
been developed over the years, including oblivious RAM, security multiparty com-
putation, multi-input encryption, homomorphic encryption. However, they are not yet
practically applicable to large datasets. We need to engineer these techniques, using
for example parallelization, to fine tune their implementation and perhaps combine
them with other techniques, such as differential privacy (like in the case of the record
linkage protocols described in [SCAN07]). A possible further approach in this respect
Directions for BDSP and BDMA 479
is to first use anonymized/sanitized data, and then depending on the specific situation
to get specific nonanonymized data.
• Usability of data privacy policies. Policies must be easily understood by users. We
need tools for the average users and we need to understand user expectations in terms
of privacy.
• Approaches for data services monetization. Instead of selling data, organizations own-
ing datasets can sell privacy-preserving data analytic services based on these datas-
ets. The question to be addressed then is: how would the business model around data
change if privacy-preserving data analytic tools were available? Also, if data is con-
sidered as a good to be sold, are there regulations concerning contracts for buying/
selling data? Can these contracts include privacy clauses be incorporated requiring for
example that users to whom this data pertains to have been notified?
• Data publication. Perhaps we should abandon the idea of publishing data, given the
privacy implications, and rather require the user of the data to utilize a controlled envi-
ronment (perhaps located in a cloud) for using the data. In this way, it would be much
easier to control the proper use of data. An issue would be the case of research data
used in universities and the repeatability of data-based research.
• Privacy implication on data quality. Recent studies have shown that people lie espe-
cially in social networks because they are not sure that their privacy is preserved. This
results in a decrease in data quality that then affects decisions and strategies based on
these data.
• Risk models. Different types of relationship of risks with big data can be identified: (a)
big data can increase privacy risks and (b) big data can reduce risks in many domains
(e.g. national security). The development of models for these two types of risk is criti-
cal in order to identify suitable trade-off and privacy-enhancing techniques to be used.
• Data ownership. The question about who is the owner of a piece of data is often a
difficult question. It is perhaps better to replace this concept with the concept of stake-
holder. Multiple stakeholders can be associated with each data item. The concept of
stakeholder ties well with risks. Each stakeholder would have different (possibly con-
flicting) objectives and this can be modeled according to multiobjective optimization.
In some cases, a stakeholder may not be aware of the others. For example, a user about
whom the data pertains to (and thus a stakeholder for the data) may not be aware that
a law enforcement agency is using this data. Technology solutions need to be investi-
gated to eliminate conflicts.
• Human factors. All solutions proposed for privacy and for security with privacy need
to be investigated in order to determine human involvement, e.g., how would the user
interact with the data and his/her specific tasks concerning the use and/or protection of
the data, in order to enhance usability.
• Data lifecycle framework. A comprehensive approach to privacy for big data needs
to be based on a systematic data lifecycle approach. Phases in the lifecycle need to be
identified and their privacy requirements and implications also need to be identified.
Relevant phases include
• Data acquisition: We need mechanisms and tools to prevent devices from acquiring data
about other individuals (relevant when devices like Google glasses are used); for example,
can we come up with mechanisms that automatically block devices from recording/acquir-
ing data at certain locations (or notify a user that recording devices are around). We also
need techniques by which each recorded subject may have a say about the use of the data.
• Data sharing: Users need to be informed about data sharing/transferred to other parties.
Addressing the above challenges requires multidisciplinary research drawing from many differ-
ent areas, including computer science and engineering, information systems, statistics, risk models,
480 Big Data Analytics with Applications in Insider Threat Detection
economics, social sciences, political sciences, human factors, and psychology. We believe that all
these perspectives are needed to develop effective solutions to the problem of privacy in the era of
big data as well as to reconcile security with privacy.
• What is different about big data management analytics (BDMA) for cyber security? The
workshop participants pointed out that BDMA for cyber security needs to deal with adap-
tive and malicious adversaries who can potentially launch attacks to avoid being detected
(i.e., data poisoning attacks, denial of service, denial of information attacks, etc.). In
addition, BDMA for cyber security needs to operate in high volume (e.g., data coming
from multiple intrusion detection systems and sensors) and high noise environments (i.e.,
constantly changing normal system usage data is mixed with stealth advanced persistent
threat-related data). One of the important points that came out of this discussion is that
we need BDMA tools that can integrate data from hosts, networks, social networks, bug
reports, mobile devices, and internet of things sensors to detect attacks.
• What is the right BDMA architecture for cyber security? We also discussed whether we
need different types of BDMA system architectures for cyber security. Based on the use
cases discussed, participants felt that existing BDMA system architectures can be adapted
for cyber security needs. One issue pointed out was that real-time data analysis must be
supported by a successful BDMA system for cyber security. For example, once a certain
type of attack is known, the system needs to be updated to look for such attacks in real
time including re-examining the history data to see whether prior attacks have occurred.
• Data sharing for BDMA for cyber security: It emerged quickly during our discussions
that cyber security data needs to be shared both within as well as across organizations. In
addition to obvious privacy, security and incentive issues in sharing cyber security data,
participants felt that we need common languages and infrastructure to capture and share
such cyber security data. For example, we need to represent certain low-level system infor-
mation (e.g., memory, CPU states, etc.) so that it can be mapped to similar cyber security
incidents.
• BDMA for preventing cyber attacks: There was substantial discussion on how BDMA
tools could be used to prevent attacks. One idea that emerged is that BDMA systems that
can easily track sensitive data using the captured provenance information can potentially
detect attacks before too much sensitive information is disclosed. Based on this observa-
tion, building provenance-aware BDMA systems would be needed for cyber attack preven-
tion. Also, BDMA tools for cyber security can potentially mine useful attacker information
such as their motivations, technical capabilities, modus operandi, and so on to prevent
future attacks.
• BDMA for digital forensics: BDMA techniques could be used for digital forensics by com-
bining or linking different data sources. The main challenge that emerged was identifying
the right data sources for digital forensics. In addition, answers to the following questions
were not clear immediately: What data to capture? What to filter out (big noise in big
data)? What data to link? What data to store and for how long? How to deal with machine-
generated content and Internet of Things?
• BDMA for understanding the users of the cyber systems: Participants believe that BDMA
could be used to mine human behavior to learn how to improve the systems. For example,
an organization may send phishing e-mails to its users and carry out security re-training
Directions for BDSP and BDMA 481
for those who are fooled by such a phishing attack. In addition, BDMA techniques could
be used to understand and build normal behavior models per user to find significant devia-
tions from the norm.
Overall, during our workshop discussions, it became clear that all of the above topics have sig-
nificant research challenges and more research needs to be done to address them. Furthermore,
regardless of whether we are using BDMA for cyber security or for other applications (e.g., health-
care, finance), it is critical that we need to design scalable BDMA solutions. These include parallel
BDMA techniques as well as BDMA technical implemented on cloud platforms such as Hadoop/
MapReduce, Storm and Spark. In addition, we need to explore the use of BDMA systems such as
HBase and CouchDB for use in various applications.
REFERENCES
[BERT12]. E. Bertino, Data Protection from Insider Threats, Morgan & Claypool, 2012.
[BERT14]. E. Bertino, “Security with Privacy—Opportunities and Challenges” Panel Statement, COMPSAC
Vasteras, Sweden, pp. 436-437, 2014.
[DAMI07]. M. Damiani, E. Bertino, B. Catania, P. Perlasca, “GEO-RBAC: A Spatially Aware RBAC,” ACM
Transactions on Information and System Security, 10 (1), Article No. 2, 2007.
[HAML14]. S. Khan, K. Hamlen, M. Kantarcioglu, “Silver Lining: Enforcing Secure Information Flow at the
Cloud Edge,” IC2E, Boston, MA, pp. 37–46, 2014.
[INAN08]. A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, “A Hybrid Approach to Private Record
Linkage,” ICDE, Cancun, Mexico, pp. 496–505, 2008.
[INAN12]. A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino, “A Hybrid Approach to Private Record Matching,”
IEEE Transactions on Dependable Secure Computing (TDSC), 9 (5), 684–698, 2012.
482 Big Data Analytics with Applications in Insider Threat Detection
[KAR14]. S. Kar, “Gartner Report: Big Data will Revolutionize Cyber Security in the Next Two Years,” cloud-
times.org, Feb. 12, 2014.
[KUZU13]. M. Kuzu et al. “Efficient Privacy-Aware Record Integration,” In Proceedings of Joint 2013 EDBT/
ICDT Conferences, EDBT’13, Genoa, Italy, Mar. 18–22, ACM, 2013.
[LIND00]. Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” In Advances in Cryptology, Aug.
20–24, Springer-Verlag, Berlin, 2000.
[LIU14]. D. Liu, E. Bertino, X. Yi, “Privacy of Outsourced K-Means Clustering,” In Proceedings of the 9th
ACM Symposium on Information, Computer and Communication Security, Jun. 4–6, Kyoto (Japan),
pp. 123-134, 2014.
[NITRD]. http://csi.utdallas.edu/events/NSF/NPRS%20Workshop%20Presentation.pdf
[NPRS]. https://www.nitrd.gov/cybersecurity/nprsrfi102014/BigData-SP.pdf
[NSF]. http://csi.utdallas.edu/events/NSF/NSF%20workshop%202014.htm
[NIST1]. https://bigdatawg.nist.gov/
[NIST2]. https://www.nist.gov/itl/applied-cybersecurity/nice
[SCAN07]. M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, “Privacy Preserving Schema and Data
Matching,” In Proceedings of 2007 ACM SIGMOD International Conference on Management of Data,
Beijing, China, pp. 653-664, 2007.
[THUR02]. B. Thuraisingham, “Data Mining, National Security, Privacy and Civil Liberties,” SIGKDD
Explorations, 4 (2), 1–5, 2002.
[ULUS14]. H. Ulusoy et al., “Vigiles: Fine-Grained Access Control for MapReduce Systems,” In 2014 IEEE
International Congress on Big Data (BigData Congress), Anchorage, AK, pp. 40–47, 2014.
[VAID06]. J. Vaidya, Y. Zhu, C. Clifton, “Privacy Preserving Data Mining,” Advances in Information Security,
19, Springer, New York, pp. 1–121, 2006.
[WANG14]. H.X. Wang, K. Nayak, C. Liu, E. Shi, E. Stefanov, Y. Huang, “Oblivious Data Structures,” IACR
Cryptology ePrint Archive, 185, 2014.
Conclusion to Part V
Part V, consisting of seven chapters, described some of our exploratory systems and plans for both
big data management and analytics (BDMA) and big data security and privacy (BDSP).
Chapter 29 provided an overview of confidentiality, privacy, and trust considerations with respect
to inference with an emphasis on big data systems such as social media systems. We first discussed
a framework for enforcing confidentiality, privacy, and trust (CPT) for the semantic web. Next, we
described our approach to confidentiality and inference control. Next, we discussed privacy for
the semantic web. This was followed by a discussion of trust management as well as an integrated
framework for CPT. Finally, we discussed how we can adapt the CPT framework for big data
systems such as social media systems. Chapter 30 essentially integrated much of the design and
implementation of the systems we have developed and described a unifying framework for big data
systems. The framework includes components both for access control and inference control as well
as information sharing control. We also discussed the modules of the framework as well as building
the global inference controller. Chapter 31 discussed the need for IoT security and provided some
use cases. Next, we described a layered architecture for secure IoT. We argued that data security
and analytics are at the heart of IoT security because data is being collected from the various IoT
devices. This data has to be secured. In addition, the threat/attack data has to be analyzed to deter-
mine anomalies. Chapter 32 discussed malware detection in one type of IoT system and that is con-
nected smartphones. We discussed the security challenges for smartphones and then discussed our
approach to malware detection for smartphones based on the Android operating system. Next, we
discussed our virtual laboratory for experimentation. In Chapter 33, we discussed a planned case
study in the healthcare domain that illustrates how the various techniques in big data management,
analytics, security, and privacy could be applied. In Chapter 34, discussed our plans for developing
an education curriculum that will be integrated with an experimental infrastructure both for BDMA
and BDSP. Finally, in Chapter 35, we provided a summary of the discussions at the NSF workshop
we organized on BDSP and BDMA for cyber security.
36 Summary and Directions
485
486 Big Data Analytics with Applications in Insider Threat Detection
Confidentiality,
privacy and trust
for big data systems
FIGURE 36.1 Layered framework for big data management and analytics and big data security and privacy.
Summary and Directions 487
resource description framework (RDF), Ontologies, and Web Ontology Language (OWL). This was
followed by a discussion of security issues for the semantic web. Finally, we discussed cloud com-
puting frameworks based on semantic web technologies. Chapter 6 discussed the problem of insider
threat and our approach to graph mining for insider threat detection. We represent the insiders and
their communication as RDF graphs and then query and mine the graphs to extract the nuggets. We
also provided a comprehensive framework for insider threat detection. Much of the discussion in
Section III was built on the concepts discussed in Chapter 6. Chapter 7 discussed three types of big
data systems. First, we discussed frameworks for managing big data. These are essentially massive
data processing platforms such as the Apache Hadoop, Spark, Storm, and Flink. Then we discussed
various big data management systems. These included SQL- and NoSQL-based systems. This was
followed by a discussion of big data analytics systems. Finally, we discussed cloud platforms that
provide the capability for the management of massive amounts of data.
Section II, consisting of six chapters, 8 through 13, focused on stream data analytics. Chapter
8 stressed the need for mining data streams and discussed the challenges. The challenges include
infinite length, concept drift, concept evolution, and limited labeled data. We also provided an
overview of our approach to mining data streams. Specifically, our approach determines whether
an item belongs to a pre-existing class or whether it is a novel class. Chapter 9 discussed several
prior approaches that have influenced our work as well as our approach for stream analytics. For
example, in the single-model classification approach, incremental learning techniques are used.
The ensemble-based techniques are more efficiently built than the single-model approach. Our
novel class detection approach integrates both data stream classification and novelty detection. Our
data stream classification technique with limited labeled data uses a semisupervised technique.
Chapter 10 introduced a multiple partition, multiple chunk (MPC) ensemble method for classifying
concept-drifting data streams. Our ensemble approach is a generalization over previous ensemble
approaches that train a single classifier from a single data chunk. By introducing this MPC ensem-
ble, we have reduced the errors significantly over the single-partition, single-chunk approach. We
have proved our claims theoretically, tested our approach on both synthetic data and real-world data,
and obtained better classification accuracies compared to other approaches. Chapter 11 presented a
novel technique to detect new classes in concept-drifting data streams. Most of the novelty detection
techniques either assume that there is no concept drift, or build a model for a single “normal” class
and consider all other classes as novel. But our approach is capable of detecting novel classes in the
presence of concept drift and even when the model consists of multiple “existing” classes. Besides,
our novel class detection technique is nonparametric, meaning, it does not assume any specific dis-
tribution of data. We also show empirically that our approach outperforms the state-of-the-art data
stream-based novelty detection techniques in both classification accuracy and processing speed. It
might appear to readers that in order to detect novel classes, we are in fact examining whether new
clusters are being formed, and therefore, the detection process could go on without supervision.
But supervision is necessary for classification. Without external supervision, two separate clusters
could be regarded as two different classes, although they are not. Conversely, if more than one novel
class appears in a chunk, all of them could be regarded as a single novel class if the labels of those
instances are never revealed. Chapter 12 addressed a more realistic problem of stream mining: train-
ing with a limited amount of labeled data. Our technique is a more practical approach to the stream
classification problem since it requires less labeled data, saving much time and cost that would be
otherwise required to manually label the data. Previous approaches for stream classification did not
address this vital problem. We designed and implemented a semisupervised clustering-based stream
classification algorithm to solve this limited labeled data problem. We tested our technique on a
synthetically generated dataset, and a real botnet dataset and received better classification accura-
cies than other stream classification techniques. Chapter 13 discussed our findings and provided
directions for further work in stream data analytics in general and stream data classification in par-
ticular. We need to enhance the algorithms by providing greater accuracy and fewer false positives
and negatives. Furthermore, we need to enhance the performance of the algorithms for handling
488 Big Data Analytics with Applications in Insider Threat Detection
massive amounts of stream data. Toward this end, we believe that a cloud-based implementation is
a viable approach for high performance data analytics.
Section III, consisting of nine chapters, 14 through 22, discussed stream data analytics for insider
threat detection. Chapter 14 defined the insider threat detection as a stream mining problem and
proposed two methods (supervised and unsupervised learning) for efficiently detecting anoma-
lies in stream data. To cope with concept evolution, our supervised approach maintains an evolv-
ing ensemble of multiple one-class support vector machine (OCSVM) models. Our unsupervised
approach combines multiple graph-based anomaly detection (GBAD) models in an ensemble of
classifiers. The ensemble updating process is designed in both cases to keep the ensemble current
as the stream evolves. This evolutionary capability improves the classifier’s survival of concept
drift as the behavior of both legitimate and illegitimate agents varies over time. In the experi-
ments, we use test data that records system call data for a large, UNIX-based, multiuser system.
Chapter 15 discussed aspects of stream mining as well as applying stream mining for massive data.
We argued that many of the learning techniques that have been proposed in the literature do not
handle data streams. As a result, these techniques do not address the evolving nature of streams.
Our goal was to adapt SVM techniques for data streams so that such techniques can be used to
handle the insider threat problem. Chapter 16 discussed ensemble-based learning for insider threat
detection. In particular, we described techniques for both supervised and unsupervised learning
and discussed the issues involved. We believe that ensemble-based approaches are suited for data
streams as they are unbounded. Chapter 17 described the different classes of learning techniques
for nonsequence data. It described exactly how each method arrives at detecting insider threats
and how ensemble models are built, modified, and discarded. First, we discussed supervised learn-
ing in detail and then unsupervised learning. Chapter 18 discussed our testing methodology and
experimental results for mining data streams consisting of nonsequence data. In particular, the
datasets used our experimental setup and results were discussed. We examined various aspects
such as false positives, false negatives, and accuracy. Our results indicate that supervised learning
yields better results for certain datasets. However, we need to carry out more extensive experiments
for a variety of datasets. Nevertheless, our work has given guidance to experimentation for insider
threat detection. Chapter 19 discussed sequenced data. We argued that insider threat detection-
related sequence data is stream-based in nature. Sequence data may be gathered over time, maybe
even years. We assumed a continuous data stream will be converted into a number of chunks. For
example, each chunk may represent a week and contain the sequence data which arrived during that
time period. We then described various techniques, both supervised and unsupervised, for mining
data streams for sequence data. Chapter 20 discussed our testing methodology and experimental
results for mining data streams consisting of sequence data. In particular, the datasets used our
experimental setup and results were discussed. We examined various aspects such as false positives,
false negatives, and accuracy. We also explained the results obtained. In Chapter 21, we discussed
the scalability of our techniques and the issues in designing big data analytics techniques for insider
threat detection. Finally, Chapter 22 discussed future directions for using stream data analytics for
insider treat detection.
Section IV, consisting of six chapters, 23 through 28, described some of the experimental sys-
tems we have developed. These systems have also been discussed in our previous books. However,
in this discussion, we also focused on the BDMA and BDSP techniques that we used to design
these systems. Chapter 23 presented a framework capable of handling enormous amounts of RDF
data that can be used to represent big data systems such as social networks. Since our framework
is based on Hadoop, which is a distributed and highly fault-tolerant system, it inherits these two
properties automatically. The framework is highly scalable. To increase the capacity of our system,
all that needs to be done is to add new nodes to the Hadoop cluster. We proposed a schema to store
RDF data, an algorithm to determine a query processing plan, whose worst case is bounded, to
answer a SPARQL query and a simplified cost model to be used by the algorithm. Our experiments
demonstrated that our system is highly scalable. Chapter 24 described the design of InXite which
Summary and Directions 489
is a social media system. InXite will be a great asset to the analysts who have to deal with massive
amounts of data streams in the form of billions of blogs and messages among others. For example,
by analyzing the behavioral history of a particular group of individuals as well as details of concepts
such as events, analysts will be able to predict behavioral changes in the near future and take neces-
sary measures. We have also discussed our use of cloud computing in the implementation of InXite
as well as provided an overview of the use of big data technologies to improve the performance of
InXite. Chapter 25 described our design and implementation of a cloud-based information sharing
system called CAISS. CAISS utilizes several of the technologies we have developed as well as open
source tools. We also described the design of an ideal cloud-based assured information sharing
system called CAISS++. Chapter 26 described techniques to protect data by encrypting it before
storing on cloud computing servers like Amazon S3. We proposed to use two key servers to gener-
ate and store the keys. Also, we assured more security than some of the other known approaches
as we do not store the actual key used to encrypt the data. We also discussed our implementation
that was based on a semantic web-based framework. Chapter 27 discussed big data techniques for
malware detection. We formulated both malicious code detection and botnet traffic detection as
stream data analytics problems, and introduced extended MPC (EMPC), a novel ensemble learning
technique for automated classification of infinite-length, concept-drifting streams. Applying EMPC
to real data streams obtained from polymorphic malware and botnet traffic samples yielded better
detection accuracies than other stream data classification techniques. Finally, Chapter 28 described
the first of a kind inference controller that will control certain unauthorized inferences in a data
management system. Our approach is powerful due to the fact that we have applied semantic web
technologies for both policy representation and reasoning. Furthermore, we have described infer-
ence control for provenance data represented as RDF graphs. We also discussed the use of big data
techniques for inference control.
Section V, consisting of seven chapters, 29 through 35, discussed aspects of big data analytics,
security, and privacy together with a presentation of some of our exploratory systems. Chapter 29
provided an overview of security, privacy, and trust considerations with respect to inference. We first
discussed a framework for enforcing confidentiality, privacy, and trust for the semantic web. Next,
we described our approach to confidentiality and inference control as well as privacy for the seman-
tic web. This was followed by a discussion of trust management as well as an integrated framework
for confidentiality, privacy, and trust (CPT). Finally, we discussed how we can adapt the CPT frame-
work for big data systems such as social media data systems. Chapter 30 essentially integrated much
of the design and implementation of the various systems that we have developed and described a
unifying framework for big data systems. The framework includes components both for access con-
trol and inference control and also information sharing control. Our framework can also include the
modules for risk and game theoretic approaches for access and inference control as well as integrity
management modules. We discussed the modules of the framework as well as developing a global
inference controller. Chapter 31 discussed the need for IoT security and provided some use cases.
Next, we described a layered architecture for secure IoT. We argued that data security, privacy,
and analytics are at the heart of IoT security because data is being collected from the various IoT
devices. This data has to be secured. In addition, the threat/attack data has to be analyzed to deter-
mine anomalies. Also, the privacy of the individuals using IoT systems has to be ensured. Chapter
32 discussed malware detection in one type of IoT system and, that is, connected s martphones. We
discussed the security challenges for smartphones and then discussed our approach to malware
detection for smartphones based on the Android operating system. Next, we discussed our virtual
laboratory for experimentation. The technologies include Hadoop/MapReduce as well as Storm
and Spark for designing scalable systems. We also discussed the course modules we are introduc-
ing for the security of smartphones. These include mobile system forensics as well as mobile data
management security. Chapter 33 discussed a planned case study in the healthcare domain that
illustrated how the various techniques in big data management, analytics, security, and privacy
could be applied. In particular, we focused on methodologies as well as designs of a framework for
490 Big Data Analytics with Applications in Insider Threat Detection
the proposed case study. Finally, Chapter 34 discussed a planned education curriculum that will
be integrated with an experimental infrastructure both for BDMA and BDSP. Finally, Chapter 35
provided a summary of the discussions at the National Science Foundation Workshop on BDSP.
In particular, we explored the issues surrounding BDSP as well as applying BDMA techniques for
cyber security. We argued that as massive amounts of data are being collected, stored, manipulated,
merged, analyzed, and expunged, security and privacy concerns will explode. Therefore, we need
to develop technologies to address security and privacy issues throughout the lifecycle of the data.
This book has two appendices. In Appendix A, we provide the broad picture for data manage-
ment and discuss how all the books we have written fit together. In Appendix B, we discuss database
management systems. Much of the work discussed in this book has evolved from the concepts and
technologies discussed in Appendix B.
Directions and
challenges in BDMA
and BDSP
BDSP: BDMA:
Privacy-enhanced Novel machine
systems learning models
Integrity and trust Data analytics for
management cyber security
Scalability Scalability
FIGURE 36.2 Directions and challenges in big data management and analytics and big data security and
privacy.
Summary and Directions 491
systems. In addition, we also need to explore areas such as identity management, handling identity
theft as well as auditing and forensics for BDMA systems.
Big data analytics technique for cyber security applications such as malware detection is an area
that has emerged as a major research direction. The challenge here is to design scalable data analyt-
ics (essentially machine learning) techniques to detect the attacks to the BDMA systems. Due to the
fact that BDMA systems such as Facebook have close to a billion active daily users, we need big
data analytics techniques to monitor the activity and determine suspicious behavior, Furthermore,
social media data which is essentially a type of big data that includes not only text but also photos,
images, video, audio, and animation data has to be secure. That is, we need appropriate policies for
such data. Finally, the privacy of the individuals using big data systems has to be ensured.
493
494 Appendix A
exabyte-sized databases which is now called the “big data problem.” Furthermore, while machine-
learning techniques used to work on what was called “toy problems” at that time is now used to
work on “real-world problems.” This is because of the tremendous advances in hardware as well as
advances in understanding data as well as more sophisticated learning techniques, especially in the
field of what is called “deep learning.” Therefore, as long as organizations (commercial, academic,
and government) collect data and analyze data, big data analytics is here to stay. It also means we
have to ensure that security and privacy policies are enforced at all stages of the data lifecycle.
We have addressed some aspects of the challenges in this book.
The organization of this appendix is as follows. Since database systems are a key component of
data management systems, we first provide an overview of the developments in database systems.
These developments are discussed in Section A.2. Then we provide a vision for data management
systems in Section A.3. Our framework for data management systems is discussed in Section A.4.
Note that data mining, warehousing, as well as web data management are components of this frame-
work. Building information systems from our framework with special instantiations is discussed in
Section A.5. It should be noted that the Sections A.2 through A.5 have been taken from our first
book and duplicated in each of our subsequent books. The relationship between the various texts
that we have written for CRC Press is discussed in Section A.6. This appendix is summarized in
Section A.7.
Network,
hierarchical
database systems
are those of Oracle Corporation, Sybase Inc., Informix Corporation, INGRES Corporation, IBM,
Digital Equipment Corporation, and Hewlett Packard Company). During the 1990s, products from
other vendors emerged (e.g., Microsoft Corporation). In fact, to date numerous relational database
system products have been marketed. However, Codd has stated that many of the systems that are
being marketed as relational systems are not really relational (see, e.g., the discussion in [DATE90]).
He then discussed various criteria that a system must satisfy to be qualified as a relational database
system. While the early work focused on issues such as data model, normalization theory, query
processing and optimization strategies, query languages, and access strategies and indexes, later the
focus shifted toward supporting a multiuser environment. In particular, concurrency control and
recovery techniques were developed. Support for transaction processing was also provided.
Research on relational database systems as well as on transaction management was followed
by research on distributed database systems around the mid-1970s. Several distributed database
system prototype development efforts also began around the late 1970s. Notable among these
efforts include IBM’s System R*, DDTS (Distributed Database Testbed System) by Honeywell Inc.,
SDD-I and Multibase by CCA (Computer Corporation of America), and Mermaid by SDC (System
Development Corporation). Furthermore, many of these systems (e.g., DDTS, Multibase, Mermaid)
function in a heterogeneous environment. During the early 1990s, several database system vendors
(such as Oracle Corporation, Sybase Inc., Informix Corporation) provided data distribution capa-
bilities for their systems. Most of the distributed relational database system products are based on
client–server architectures. The idea is to have the client of vendor A communicates with the server
database system of vendor B. In other words, the client–server computing paradigm facilitates a
heterogeneous computing environment. Interoperability between relational and nonrelational com-
mercial database systems is also possible. The database systems community is also involved in stan-
dardization efforts. Notable among the standardization efforts are the ANSI/SPARC 3-level schema
architecture, the IRDS (Information Resource Dictionary System) standard for Data Dictionary
Systems, the relational query language SQL (Structured Query Language), and the RDA (Remote
Database Access) protocol for remote database access.
Another significant development in database technology is the advent of object-oriented database
management systems. Active work on developing such systems began in the mid-1980s and they
are now commercially available (notable among them include the products of Object Design, Inc.,
Ontos, Inc., Gemstone Systems, Inc., and Versant Object Technology). It was felt that new genera-
tion applications such as multimedia, office information systems, CAD/CAM, process control, and
software engineering have different requirements. Such applications utilize complex data struc-
tures. Tighter integration between the programming language and the data model is also desired.
Object-oriented database systems satisfy most of the requirements of these new generation applica-
tions [CATT91].
According to the Lagunita report published as a result of a National Science Foundation (NSF)
workshop in 1990 (see [SILB90] and [KIM90]), relational database systems, transaction process-
ing, and distributed (relational) database systems are stated as mature technologies. Furthermore,
vendors are marketing object-oriented database systems and demonstrating the interoperability
between different database systems. The report goes on to state that as applications are getting
increasingly complex, more sophisticated database systems are needed. Furthermore, since many
organizations now use database systems, in many cases of different types, the database systems
need to be integrated. Although work has begun to address these issues and commercial products
are available, several issues still need to be resolved. Therefore, challenges faced by the database
systems researchers in the early 1990s were in two areas. One was next-generation database systems
and the other was heterogeneous database systems.
Next-generation database systems include object-oriented database systems, functional database
systems, special parallel architectures to enhance the performance of database system functions,
high-performance database systems, real-time database systems, scientific database systems, tem-
poral database systems, database systems that handle incomplete and uncertain information and
496 Appendix A
intelligent database systems (also sometimes called logic or deductive database systems). Ideally,
a database system should provide the support for high-performance transaction processing, model
complex applications, represent new kinds of data, and make intelligent deductions. While signifi-
cant progress has been made during the late 1980s and early 1990s, there is much to be done before
such a database system can be developed.
Heterogeneous database systems have been receiving considerable attention during the past
decade [MARC90]. The major issues include handling different data models, different query
processing strategies, different transaction processing algorithms, and different query languages.
Should a uniform view be provided to the entire system or should the users of the individual systems
maintain their own views of the entire system? These are questions that have yet to be answered
satisfactorily. It is also envisaged that a complete solution to heterogeneous database management
systems is a generation away. While research should be directed toward finding such a solution,
work should also be carried out to handle limited forms of heterogeneity to satisfy the customer
needs. Another type of database system that received some attention is a federated database system.
Note that some have used the terms heterogeneous database system and federated database system
interchangeably. While heterogeneous database systems can be part of a federation, a federation can
also include homogeneous database systems.
The explosion of users on the web as well as developments in interface technologies has resulted
in even more challenges for data management researchers. A second workshop was sponsored by
NSF in 1995, and several emerging technologies were identified to be important as we entered
into the twenty-first century [WIDO96]. These include digital libraries, managing very large data-
bases, data administration issues, multimedia databases, data warehousing, data mining, data man-
agement for collaborative computing environments, and security and privacy. Another significant
development in the 1990s is the development of object-relational systems. Such systems combine the
advantages of both object-oriented database systems and relational database systems. Also, many
corporations are now focusing on integrating their data management products with web technolo-
gies. Finally, for many organizations there is an increasing need to migrate some of the legacy data-
bases and applications to newer architectures and systems such as client–server architectures and
relational database systems. We believe there is no end to data management systems. As new tech-
nologies are developed, there are new opportunities for data management research and development.
A comprehensive view of all data management technologies is illustrated in Figure A.2. As
shown, traditional technologies include database design, transaction processing, and benchmarking.
Multimedia
database Data mining
Then there are database systems based on data models such as relational and object oriented.
Database systems may depend on features they provide such as security and real time. These data-
base systems may be relational or object oriented. There are also database systems based on mul-
tiple sites or processors such as distributed and heterogeneous database systems, parallel systems,
and systems being migrated. Finally, there are the emerging technologies such as data warehousing
and mining, collaboration, and the web. Any comprehensive text on data management systems
should address all of these technologies. We have selected some of the relevant technologies and put
them in a framework. This framework is described in Section A.5.
Integrated view
Data Web
warehouse technology
Multimedia
Data mining
database
Knowledge-
Visualization based
Application systems
layer Collaborative Mobile
computing computing
Data management
Information
extraction Data Data
Layer III and sharing warehousing mining
Internet Collaborative
databases database
management
Interoperability Multimedia
Migrating
and migration database
legacy
Layer II databases systems
Heterogeneous Client/
database server
systems databases
Database Distributed
technology and database
Layer I distribution systems
Database
systems
Supporting
technologies Distributed Mass
processing storage
Supporting Distributed
layer Networking object
management Agents
the technologies provided by the lower layer. Layer I is the database technology and distribution
layer. This layer consists of database systems and distributed database systems technologies. Layer
II is the interoperability and migration layer. This layer consists of technologies such as heteroge-
neous database integration, client/server databases, and multimedia database systems to handle
heterogeneous data types and migrating legacy databases. Layer III is the information extraction
and sharing layer. This layer essentially consists of technologies for some of the newer services
supported by data management systems. These include data warehousing, data mining [THUR98],
web databases, and database support for collaborative applications. Data management systems may
utilize lower level technologies such as networking, distributed processing, and mass storage. We
have grouped these technologies into a layer called the supporting technologies layer. This support-
ing layer does not belong to the data management systems framework. This supporting layer also
consists of some higher-level technologies such as distributed object management and agents. Also,
shown in Figure A.5 is the application technologies layer. Systems such as collaborative comput-
ing systems and knowledge-based systems which belong to this layer may utilize data management
systems. Note that the application technologies layer is also outside of the data management systems
framework.
The technologies that constitute the data management systems framework can be regarded to be
some of the core technologies in data management. However, features like security, integrity, real-
time processing, fault tolerance, and high-performance computing are needed for many applications
utilizing data management technologies such as medical, financial, or military, among others. We
illustrate this in Figure A.6, where a three-dimensional view relating data management technologies
with features and applications is given. For example, one could develop a secure distributed data-
base management system for medical applications or a fault-tolerant multimedia database manage-
ment system for financial applications.
Integrating the components belonging to the various layers is important to developing efficient
data management systems. In addition, data management technologies have to be integrated with
the application technologies to develop successful information systems. However, at present, there is
limited integration between these various components. Our books have addressed concepts related
to the various layers of this framework.
Note that security cuts across all the layers. Security is needed for the supporting layers such as
agents and distributed systems. Security is needed for all of the layers in the framework including
Application areas:
e.g., medical, financial,
military, manufacturing
• Database management
Features:
• Distributed database management
• Security
• Heterogeneous database integration
• Integrity
• Client server databases
• Real-time
• Multimedia databases
• Fault-
• Migrating legacy databases
tolerance
• Data warehousing
• Data mining
• Internet database management
• Relational/Object-oriented database
• Database support for collaboration
database security, distributed database security, warehousing security, web database security, and
collaborative data management security.
Collaboration,
visualization
Multimedia database,
distributed database systems
Mass storage,
distributed processing
Visualization
Heterogeneous
database
integration
Distributed
object
management
including collaboration and teleconferencing. These application technologies will make use of data
management technologies such as distributed database systems and multimedia database systems.
That is, one may need to support multimedia data such as audio and video. The data management
technologies in turn draw upon lower level technologies such as distributed processing and network-
ing. We illustrate this in Figure A.9.
In summary, information systems include data management systems as well as application layer
systems such as collaborative computing systems and supporting layer systems such as distributed
object management systems.
While application technologies make use of data management technologies and data management
technologies make use of supporting technologies, the ultimate user of the information system is the
application itself. Today numerous applications make use of information systems. These applications
Medical application:
medical video
teleconferencing
Tele-
Collaboration
conferencing
Distributed Multimedia
database database
management management
Distributed
Networking
processing
Application:
medical, financial,
manufacturing,
telecommunications,
defense
Data management
systems framework
are from multiple domains such as medical, financial, manufacturing, telecommunications, and
defense. Specific applications include signal processing, electronic commerce, patient monitoring,
and situation assessment. Figure A.10 illustrates the relationship between the application and the
information system. The evolution from data to big data is illustrated in Figure A.11.
Data Management
1970–2000
Relational, object-oriented,
distributed, federated, real-
time, deductive,
warehousing, mining,
security
Data Management
2000–2010
Data Management
2010–Present
Big data management and
analytics: zettabyte-sized,
complex, heterogeneous and
social media data
Data management
systems
evolution and
interoperation 1–1
Chapter 9 Chapter 10
Chapter 11 Chapter 12
Managing and mining
multimedia databases
for the electronic
enterprise 1–4
Chapter 8 Chapter 10
Chapter 18 Chapter 13
Chapter 25
Building trustworthy
semantic webs
1–8
Chapter 16
Secure service-oriented
systems
1–9
Chapters 5, 25
Developing
and securing
the cloud
1–10
Data Management Systems: Evolution and Interoperation [THUR97], Data Mining: Technologies,
Techniques, Tools and Trends [THUR98], Web Data Management and Electronic Commerce
[THUR00], Managing and Mining Multimedia Databases for the Electronic Enterprise [THUR01],
XML, Databases and The Semantic Web [THUR02], Web Data Mining and Applications in Business
Intelligence and Counter-terrorism [THUR03], Database and Applications Security: Integrating
Data Management and Information Security [THUR05]. Building Trustworthy Semantic Web
504 Appendix A
2–3 2–1
[THUR07], and Secure Semantic Service-Oriented Systems [THUR10]. Our last book in these
series titled: Developing and Securing the Cloud, [THUR2014] has evolved from our previous book
on Secure Semantic Service-Oriented Systems. All of these books have evolved from the framework
that we illustrated in this appendix and address different parts of the framework. The connection
between these texts is illustrated in Figure A.12.
We have published four books in the second series. The first is titled Design and Implementation
of Data Mining Tools [AWAD09] and the second is titled Data Mining Tools for Malware Detection
[MASU11]. Our book, Secure Data Provenance and Inference Control with Semantic Web
[THUR15] was the third in these series. Our book Analyzing and Securing Social Networks is the
fourth in these series [THUR16]. Our current book Big Data Analytics with Applications in Insider
Threat Detection is the fifth in these series [THUR17]. The relationship between these books as
well as with our previous books is illustrated in Figure A.13.
REFERENCES
[AWAD09]. M. Awad, L. Khan, B. Thuraisingham, L. Wang, Design and Implementation of Data Mining Tools,
CRC Press, Boca Raton, FL, 2009.
[CATT91]. R. Cattell, Object Data Management Systems, Addison- Wesley, MA, 1991.
[CODD70]. E.F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the
ACM, 13 (#6), 377–387, June 1970.
[DATE90]. C.J. Date, An Introduction to Database Management Systems, Addison-Wesley, MA, 1990 (6th
edition published in 1995 by Addison-Wesley).
[KIM90]. W. Kim, editor, “Directions for future database research & development,” ACM SIGMOD Record, 19
(4), December 1990.
[MARC90]. S.T. March, editors, Special Issue on Heterogeneous Database Systems, ACM Computing Surveys,
22 (3), September 1990.
[MASU11]. M. Masud, B. Thuraisingham, L. Khan, Data Mining Tools for Malware Detection, CRC Press,
Boca Raton, FL, 2011.
[SILB90]. A. Silberschatz, M. Stonebraker, J.D. Ullman, (editors, “Database systems: Achievements and
Opportunities,” The “Lagunita” Report of the NSF Invitational Workshop on the Future of Database
Systems Research, February 22–23, Palo Alto, CA, 1990 (TR-90-22), Department of Computer Sciences,
University of Texas at Austin, Austin, TX. (also in ACM SIGMOD Record, December 1990).
[THUR97]. B. Thuraisingham, Data Management Systems: Evolution and Interoperation, CRC Press, Boca
Raton, FL, 1997.
[THUR98]. B. Thuraisingham, Data Mining: Technologies, Techniques, Tools and Trends, CRC Press, Boca
Raton, FL, 1998.
[THUR00]. B. Thuraisingham, Web Data Management and Electronic Commerce, CRC Press, Boca Raton, FL,
2000.
[THUR01]. B. Thuraisingham, Managing and Mining Multimedia Databases for the Electronic Enterprise,
CRC Press, Boca Raton, FL, 2001.
[THUR02]. B. Thuraisingham, XML, Databases and The Semantic Web, CRC Press, Boca Raton, FL, 2002.
[THUR03]. B. Thuraisingham, Web Data Mining Applications in Business Intelligence and Counter-Terrorism,
CRC Press, Boca Raton, FL, 2003.
[THUR05]. B. Thuraisingham, Database and Applications Security: Integrating Data Management and
Information Security, CRC Press, Boca Raton, FL, 2005.
[THUR07]. B. Thuraisingham, Building Trustworthy Semantic Webs, CRC Press, Boca Raton, FL, 2007.
[THUR10]. B. Thuraisingham, Secure Semantic Service-Oriented Systems, CRC Press, Boca Raton, FL, 2010.
[THUR14]. B. Thuraisingham, Developing and Securing the Cloud, CRC Press, Boca Raton, FL, 2013.
[THUR15]. B. Thuraisingham, Tyrone Cadenhead, Murat Kantarcioglu and Vaibhav Khadilkar, Secure Data
Provenance and Inference Control with Semantic Web, CRC Press, Boca Raton, FL, 2014.
[THUR16]. B. Thuraisingham, S. Abrol, R. Heatherly, M. Kantarcioglu, V. Khadilkar, L. Khan, Analyzing and
Securing Social Networks, CRC Press, Boca Raton, FL, 2016.
[THUR17]. B. Thuraisingham et al., Big Data Analytics with Applications in Insider Threat Detection, CRC
Press, Boca Raton, FL, 2017.
[WIDO96]. J. Widom, editor, In Proceedings of the Database Systems Workshop, Report published by the
National Science Foundation, 1995 (also in ACM SIGMOD Record, March 1996, Vol 25 (1), Database
Research: Achievements and Opportunities into the 21st Century.
Appendix B: Database
Management Systems
B.1 OVERVIEW
Database systems technology has advanced a great deal during the past five decades from the legacy
systems based on network and hierarchical models to relational database systems to object databases
and more recently big data management systems. We consider a database system to include both the
database management system (DBMS) and the database (see also the discussion in [DATE90]). The
DBMS component of the database system manages the database. The database contains persistent
data. That is, the data are permanent even if the application programs go away.
We have discussed database systems in this appendix as it is at the heart of big data technolo-
gies. For example, the supporting technologies discussed in Part I of this book have their roots in
database systems (e.g., data mining, data security, big data management). Also, big data manage-
ment systems have evolved from database query processing and transaction management that were
initially developed in the 1970s. Furthermore, some of the experimental systems we have discussed
in this book such as cloud-centric assured information sharing have evolved from the concepts in
federated databases systems. Therefore, an understanding of database systems is essential to master
the concepts discussed in this book.
The organization of this chapter is as follows. In Section B.2, relational data models, as well as
entity-relationship models are discussed. In Section B.3, various types of architectures for data-
base systems are described. These include architecture for a centralized database system, schema
architecture, as well as functional architecture. Database design issues are discussed in Section
B.4. Database administration issues are discussed in Section B.5. Database system functions are
discussed in Section B.6. These functions include query processing, transaction management, meta-
data management, storage management, maintaining integrity and security, and fault tolerance.
Distributed database systems are the subject of Section B.7. Heterogeneous database integration
aspects are summarized in Section B.8. Object models are discussed in Section B.9. Other types of
database systems and their relevance to BDMA are discussed in Section B.10. The chapter is sum-
marized in Section B.11. More details on database systems can be found in [THUR97].
507
508 Appendix B
EMP DEPT
3 Mary 40K 20
one–one relationship. If it is assumed that an employee works in one department and each depart-
ment can have many employees, then WORKS is a many–one relationship. If it is assumed that an
employee works in many departments, and each department has many employees, then WORKS is
a many–many relationship.
Several extensions to the entity-relationship model have been proposed. One is the entity-
relationship-attribute model where attributes are associated with entities as well as relationships,
and another has introduced the notion of categories into the model (see, e.g., the discussion in
[ELMA85]). It should be noted that ER models are used mainly to design databases. That is, many
database CASE (computer-aided software engineering) tools are based on the ER model, where the
application is represented using such a model and subsequently the database (possibly relational) is
generated. Current database management systems are not based on the ER model. That is, unlike the
relational model, ER models did not take off in the development of database management systems.
DBMS
Database
Query Transaction
processor processor
Metadata Storage
manager manager
Integrity Security
manager manager
or to process rules or to handle multimedia data types or even to do mining. Such an extensible
architecture is illustrated in Figure B.6.
External schema
Mappings
Conceptual schema
Mappings
Internal schema
User interface
Extension layer
DBMS
Database
Representing
the application Generating Normalizing
using the ER relations relations
model
Database
administration
Auditing, Access
Data
backup and methods and
modeling,
recovery index
schema design
strategies
User Response
interface manager
manager
Query
Query optimizer
transformer
Database Storage
manager
in the query tree as much as possible. If selections and projections are performed before the joins,
then the cost of the joins can be reduced by a considerable amount.
Figure B.9 illustrates the modules in query processing. The user-interface manager accepts que-
ries, parses the queries, and then gives them to the query transformer. The query transformer and
query optimizer communicate with each other to produce an execution strategy. The database is
accessed through the storage manager. The response manager gives responses to the user.
B.6.3 Transaction Management
A transaction is a program unit that must be executed in its entirety or not executed at all. If trans-
actions are executed serially, then there is a performance bottleneck. Therefore, transactions are
executed concurrently. Appropriate techniques must ensure that the database is consistent when
multiple transactions update the database. That is, transactions must satisfy the ACID (Atomicity,
Consistency, Isolation, and Durability) properties. Major aspects of transaction management are
serializability, concurrency control, and recovery. We discuss them briefly in this section. For a
detailed discussion of transaction management, we refer to [KORT86] and [BERN87].
Serializability: A schedule is a sequence of operations performed by multiple transactions. Two
schedules are equivalent if their outcomes are the same. A serial schedule is a schedule where no
two transactions execute concurrently. An objective in transaction management is to ensure that any
schedule is equivalent to a serial schedule. Such a schedule is called a serializable schedule. Various
conditions for testing the serializability of a schedule have been formulated for a DBMS.
Concurrency Control: Concurrency control techniques ensure that the database is in a consistent
state when multiple transactions update the database. Three popular concurrency control techniques
which ensure the serializability of schedules are locking, time-stamping and validation (which is
also called optimistic concurrency control).
Recovery: If a transaction aborts due to some failure, then the database must be brought to a
consistent state. This is transaction recovery. One solution to handling transaction failure is to
maintain log files. The transaction’s actions are recorded in the log file. So, if a transaction aborts,
then the database is brought back to a consistent state by undoing the actions of the transaction.
The information for the undo operation is found in the log file. Another solution is to record the
actions of a transaction but not make any changes to the database. Only if a transaction com-
mits should the database be updated. This means that the log files have to be kept in stable stor-
age. Various modifications to the above techniques have been proposed to handle the different
situations.
When transactions are executed at multiple data sources, then a protocol called two-phase com-
mit is used to ensure that the multiple data sources are consistent. Figure B.10 illustrates the various
aspects of transaction management.
514 Appendix B
Transaction
management
B.6.4 Storage Management
The storage manager is responsible for accessing the database. To improve the efficiency of query
and update algorithms, appropriate access methods and index strategies have to be enforced. That
is, in generating strategies for executing query and update requests, the access methods and index
strategies that are used need to be taken into consideration. The access methods used to access the
database would depend on the indexing methods. Therefore, creating and maintaining an appropri-
ate index file is a major issue in database management systems. By using an appropriate indexing
mechanism, the query-processing algorithms may not have to search the entire database. Instead,
the data to be retrieved could be accessed directly. Consequently, the retrieval algorithms are more
efficient. Figure B.11 illustrates an example of an indexing strategy where the database is indexed
by projects.
Much research has been carried out on developing appropriate access methods and index strate-
gies for relational database systems. Some examples of index strategies are B-Trees and Hashing
[DATE90]. Current research is focusing on developing such mechanisms for object-oriented data-
base systems with support for multimedia data as well as for web database systems, among others.
B.6.5 Metadata Management
Metadata describes the data in the database. For example, in the case of the relational database illus-
trated in Figure B.1, metadata would include the following information: the database has two rela-
tions, EMP and DEPT; EMP has four attributes and DEPT has three attributes, etc. One of the main
Details on
Project A
Project A
Details on
Project B Project B
Project C
Details on
Project D Project C
Details on
Project D
Relation REL
Relation Attribute
EMP SS#
EMP Ename
EMP Salary
EMP D#
DEPT D#
DEPT Dname
DEPT Mgr
issues is developing a data model for metadata. In our example, one could use a relational model to
model the metadata also. The metadata relation REL shown in Figure B.12 consists of information
about relations and attributes.
In addition to information about the data in the database, metadata also includes information
on access methods, index strategies, security constraints, and integrity constraints. One could
also include policies and procedures as part of the metadata. In other words, there is no standard
definition for metadata. There are, however, efforts to standardize metadata (see, e.g., the IEEE
Mass Storage Committee efforts as well as IEEE Conferences on Metadata [MASS]. Metadata
continues to evolve as database systems evolve into multimedia database systems and web data-
base systems.
Once the metadata is defined, the issues include managing the metadata. What are the techniques
for querying and updating the metadata? Since all of the other DBMS components need to access
the metadata for processing, what are the interfaces between the metadata manager and the other
components? Metadata management is fairly well understood for relational database systems. The
current challenge is in managing the metadata for more complex systems such as digital libraries
and web database systems.
B.6.6 Database Integrity
Concurrency control and recovery techniques maintain the integrity of the database. In addition,
there is another type of database integrity and that is enforcing integrity constraints. There are
two types of integrity constraints enforced in database systems. These are application indepen-
dent integrity constraints and application specific integrity constraints. Integrity mechanisms also
include techniques for determining the quality of the data. For example, what is the accuracy of the
data and that of the source? What are the mechanisms for maintaining the quality of the data? How
accurate is the data on output? For a discussion of integrity based on data quality, we refer to [DQ].
Note that data quality is very important for mining and warehousing. If the data that is mined is not
good, then one cannot rely on the results.
Application independent integrity constraints include the primary key constraint, the entity
integrity rule, referential integrity constraint, and the various functional dependencies involved in
the normalization process (see the discussion in [DATE90]). Application specific integrity con-
straints are those constraints that are specific to an application. Examples include “an employee’s
salary cannot decrease” and “no manager can manage more than two departments.” Various tech-
niques have been proposed to enforce application specific integrity constraints. For example, when
the database is updated, these constraints are checked and the data are validated. Aspects of data-
base integrity are illustrated in Figure B.13.
516 Appendix B
Database
integrity
B.6.7 Fault Tolerance
The previous two sections discussed database integrity and security. A closely related feature is
fault tolerance. It is almost impossible to guarantee that the database will function as planned. In
reality, various faults could occur. These could be hardware faults or software faults. As mentioned
earlier, one of the major issues in transaction management is to ensure that the database is brought
back to a consistent state in the presence of faults. The solutions proposed include maintaining
appropriate log files to record the actions of a transaction in case its actions have to be retraced.
Another approach to handling faults is checkpointing. Various checkpoints are placed during the
course of database processing. At each checkpoint it is ensured that the database is in a consistent
state. Therefore, if a fault occurs during processing, then the database must be brought back to the
last checkpoint. This way it can be guaranteed that the database is consistent. Closely associated
with checkpointing are acceptance tests. After various processing steps, the acceptance tests are
checked. If the techniques pass the tests, then they can proceed further. Some aspects of fault toler-
ance are illustrated in Figure B.14.
B.6.8 Other Functions
In this section we will briefly discuss some of the other functions of a database system. They are:
security, real-time processing, managing heterogeneous data types, view management, and backup
and recovery.
Security: Note that security is a critical function. Therefore, both discretionary security and
mandatory security will be discussed throughout this book.
Real-time processing: In some situations, the database system may have to meet real-time
constraints. That is, the transactions will have to meet deadlines.
Checkpoint A
Start processing
*
*
Acceptance test
If OK, then go to Checkpoint B
Else, roll back to Checkpoint A
Checkpoint B
Start processing
*
*
Heterogeneous data types: The database system may have to manage multimedia data types
such as voice, video, text, and images.
Auditing: The databases may have to be audited so that unauthorized access can be monitored.
View management: As stated earlier views are virtual relations created from base relations.
There are many challenges related to view management.
Backup and Recovery: The DBA has to back-up the databases and ensure that the database is
not corrupted. Some aspects were discussed under fault tolerance. More details are given
in [DATE90].
Network
Global user
DQP DTM
DMM
DSM DIM
and DIM is responsible for maintaining integrity at the global level. Note that the modules of DP
communicate with their peers at the remote nodes. For example, the DQP at node 1 communicates
with the DQP at node 2 for handling distributed queries.
Network
Database system B
Database system A Object
Federation database
Relational
F1 system
database
system
Cooperating database
systems yet maintaining Federation
some degree of autonomy Legacy F2
database
system Database system C
The development in heterogeneous data management was then extended into federated data man-
agement in the 1990s. As stated by Sheth and Larson [SHET90], a federated database system is a
collection of cooperating but autonomous database systems belonging to a federation. That is, the
goal is for the database management systems, which belong to a federation, to cooperate with one
another and yet maintain some degree of autonomy. Note that to be consistent with the terminology,
we distinguish between a federated database management system and a federated database sys-
tem. A federated database system includes both a federated database management system, the local
DBMSs, and the databases. The federated database management system is that component which
manages the different databases in a federated environment.
Figures B.18 illustrates a federated database system. Database systems A and B belong to fed-
eration F1 while database systems B and C belong to federation FB. We can use the architecture
illustrated in Figure B.18 for a federated database system. In addition to handling heterogeneity, the
HDP also has to handle the federated environment. That is, techniques have to be adapted to handle
cooperation and autonomy. We have called such an HDP an FDP (Federated Distributed Processor).
An architecture for an FDS is illustrated in Figure B.19.
Figure B.20 illustrates an example of an autonomous environment. There is communication
between components A and B and between B and C. Due to autonomy, it is assumed that compo-
nents A and C do not wish to communicate with each other. Now, component A may get requests
from its own user or from component B. In this case, it has to decide which request to honor first.
Also, there is a possibility for component C to get information from component A through compo-
nent B. In such a situation, component A may have to negotiate with component B before it gives a
reply to component B. The developments to deal with autonomy are still in the research stages. The
challenge is to handle transactions in an autonomous environment. Transitioning the research into
commercial products is also a challenge.
Network
Global user
EMP class
MGR ENG
subclass subclass
A key concept in object-oriented data modeling is encapsulation. That is, an object has well-
defined interfaces. The state of an object can only be accessed through interface procedures called
methods. For example, EMP may have a method called Increase-Salary. The code for Increase-
Salary is illustrated in Figure B.22. A message, say Increase-Salary(1, 10K), may be sent to the
object with object ID of 1. The object’s current salary is read and updated by 10K.
A second key concept in an object model is inheritance where a subclass inherits properties from
its parent class. This feature is illustrated in Figure B.22 where the EMP class has MGR (manager)
and ENG (engineer) as its subclasses. Other key concepts in an object model include polymorphism
and aggregation. These features are discussed in [BANE87]. Note that a second type of inheritance
is when the instances of a class inherit the properties of the class.
A third concept is polymorphism. This is the situation where one can pass different types of
arguments for the same function. For example, to calculate the area, one can pass a sphere or a
cylinder object. Operators can be overloaded also. That is, the add operation can be used to add two
integers or real numbers.
Another concept is the aggregate hierarchy also called the composite object or the is-part-of
hierarchy. In this case an object has component objects. For example, a book object has component
section objects. A section object has component paragraph objects. Aggregate hierarchy is illus-
trated in Figure B.23.
Objects also have relationships between them. For example, an employee object has an associa-
tion with the department object which is the department he is working in. Also, the instance vari-
ables of an object could take integers, lists, arrays, or even other objects as values. Many of these
concepts are discussed in the book by Cattell [CATT91]. Object Data Management Group has also
proposed standards for object data models [ODMG93].
Relational database vendors are extending their system with support for objects. In one approach
the relational model is extended with an object layer. The object layer manages objects while the
relational database system manages the relations. Such systems are called extended relational data-
base systems. In another approach, the relational model has objects as its elements. Such a model
is called an object-relational data model and is illustrated in Figure B.24. A system based on the
object-relational data model is called an object-relational database system.
Book
object
Introduction References
Set of
sections
Book
1 X
2 Y ++++
3 Z ########
OLAP (on-line analytical processing) models. OLAP models in turn have influenced data mining
systems.
Data warehousing is one of the key data management technologies to support data mining and
data analysis. As stated by Inmon [INMO93], data warehouses are subject oriented. Essentially
data warehouses carry out analytical processing for decision-support functions of an enterprise. For
example, while the data sources may have the raw data, the data warehouse may have correlated
data, summary reports, and aggregate functions applied to the raw data. Big data analytics has
evolved from such data warehouse systems.
We have discussed only a sample of the database systems that have been developed over the past
40 years. The challenge is to develop data models, query and transaction processing techniques
as well as security and integrity for database systems that manage zettabyte- and exabyte-sized
databases,
REFERENCES
[BANE87]. J. Banerjee et al. “A Data Model for Object-Oriented Applications,” ACM Transactions on Office
Information Systems, 5 (1), 3–26, 1987.
[BERN87]. P. Bernstein et al. Concurrency Control and Recovery in Database Systems. Addison-Wesley, MA,
1987.
[BUNE82]. P. Buneman et al. “An Implementation Technique for Database Query Languages,” ACM
Transactions on Database Systems, 7 (2), 1982, 164–180.
[CATT91]. R. Cattel, Object Data Management Systems. Addison-Wesley, MA, 1991.
[CERI84]. S. Ceri and G. Pelagatti, Distributed Databases, Principles and Systems. McGraw-Hill, NY, 1984.
[CHEN76]. P. Chen, “The Entity-relationship Model—Toward a Unified View of Data,” ACM Transactions on
Database Systems, 1 (1), 9–36, 1976.
[CODD70]. E.F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the
ACM, 13 (6), 377–387, 1970.
[DATE90]. C. Date, An Introduction to Database Systems. Addison-Wesley, Reading, MA, 1990.
[DE]. Proceedings of the IEEE Data Engineering Conference Series, IEEE Computer Society Press, CA.
[DEWI90]. D.J. Dewitt et al. “The Gamma Database Machine Project,” IEEE Transactions on Knowledge and
Data Engineering, 2 (1), 44–62, 1990.
[DMH96]. B. Thuraisingham, editor. Data Management Handbook Supplement. Auerbach Publications, NY,
1996.
[DMH97]. B. Thuraisingham, editor. Data Management Handbook. Auerbach Publications, NY, 1997.
524 Appendix B
525
526 Index
Cloud computing, 51, 173, 237, 263, 307, 331, 332 Concept-adapting very fast decision tree learner
cloud storage and data management, 54–55 (CVFDT), 106
components, 52 Concept-drifting data streams, 127, 171
framework, 173 baseline approach, 142–143
frameworks based on semantic web technologies, classification with novel class detection, 133–141
63–65 datasets, 141–142
for malware detection, 341 datasets and experimental setup, 122
model, 51–52 ECSMiner, 127–133
preliminaries, 52–53 ensemble development, 115
secure, 454–455, 461 error reduction using MPC training, 116–121
technologies, 52 evaluation approach, 143
tools, 56–57 experiments, 121, 141, 142
virtualization, 53–54 MPC, 115–116
Cloud platforms, 83 performance study, 122–125, 143
Amazon Web Services’ DynamoDB, 83 results, 143–147
Google’s cloud-based big data solutions, 84 Concept-drifting synthetic dataset (SynD), 161
IBM’s cloud-based big data solutions, 84 Concept-evolving synthetic dataset (SynDE), 161, 166
Microsoft Azure’s Cosmos DB, 83–84 Concept drift, 93–95, 141, 160, 253, 340, 373, 410
Cloud query processing system for big data management issues, 416
approach, 264 in sequence stream, 238
architecture, 267–269 in stream data, 198, 218
cloud computing, 263 SynDE, 161
contributions, 265 synthetic data with, 99
evaluation, 280–281 in training set, 228–230
experimental setup, 264, 279–280 Concept evolution, 93, 95–97, 410
MapReduce framework, 269–278 synthetic data with, 99
related work, 265–267 Concept Instantiation, 60
results, 279 Concept satisfiability, 60
security extensions, 281–285 Concept subsumption, 60
Cloud services, 394–396 Concurrency control, 392, 513
for integrity management, 394 Confidentiality, 379
models, 54 approach to confidentiality management, 384–385
Clustering, 27, 132 Confidentiality inference engine (CIE), 382–383
algorithm, 39 Confidentiality, privacy, and trust (CPT), 379, 380,
cluster-impurity, 109 483, 489
techniques, 28 advanced, 382–383, 384
CM, see Compression method approach to confidentiality management, 384–385
CMRJ, see Conflicting MapReduceJoins big data systems, 379
CNSIL, see Computer Networks and Security within context of big data and social networks,
Instructional Lab 388–390
Collusion attack, 252 framework, 381
Command sequences (cseq), 244 integrated system, 387–388, 389
Communication privacy for social media systems, 385–387
data, 410 process, 382, 383
devices, 405 role of server, 381–382
energy-efficient, 410 trust for social networks, 387
small communication frames, 407 trust, privacy, and confidentiality, 379–381, 383–384
wireless communication networks, 404 Conflicting MapReduceJoins (CMRJ), 271
Community building, 472 Conflicts, 284–285
Complete elimination, 275 resolution, 17
Completely labeled training data, 94 Consistency and completeness of rules, 18
Complexity Constraints, 24
analysis, 224 constraint-based approaches, 109
of Bestplan, 276 Content-based access control, 478
of inference engine, 365 Content-based image retrieval (CBIR), 38
Compound impurity-measure, 109–110 Content-based score computation, 294
Compressed/quantized dictionary construction, 251–252 Control processing units, 405
Compression-based techniques, 203 Control systems, 405
Compression method (CM), 221 Conventional data mining, 476
Compression/quantization using MR, 243 Conventional relational database management system, 436
Computer attacks, 47 COPD, see Chronic obstructive pulmonary disease
Computer Corporation of America (CCA), 495 CoreNLP, 458
Computer Networks and Security Instructional Lab Cost estimation for query processing, 270–274
(CNSIL), 427 CouchDB, 56, 438, 490
Index 529
False positive rates (FPR), 183, 230 GFS, see Google File System
False positives (FP), 186, 190, 197, 212, 230, 251 Gibbs sampling, 444
Farthest-first traversal heuristic, 155 Gini index, 132
Fast classification model, 174 Global big data security and privacy controller,
Fault 400–401
detection, 95 Global data-mining models, 408
fault-tolerant computing, 24 Global Database of Event, Language, and Tone
tolerance, 393, 516 (GDELT), 458
FDP, see Federated data processor geospatial data processing on, 458
Feature extraction, 341, 347 Google, 266
Feature selection, 341, 347 BigQuery, 79, 81
Feature weighting, 175 BigTable, 82
Federated data management, 518–520 Calendar, 405
Federated data processor (FDP), 519 cloud-based big data solutions, 84
Field actuation mechanisms, 404 Compute Engine, 409
File organization, 73, 268 Google+, 289
predicate object split, 74 Monkey tool, 423
predicate split, 73–74 Google File System (GFS), 82, 193, 438
Filtered outlier (F outliers), 97, 134–135 GPS-equipped vehicles techniques, 405
Firewalls, 407 Graph
First-order logic formulas and inference, 443 analysis, 70
First-order Markov model, 34 graph-based behavior analysis, 415–416
Five Vs, see Volume, velocity, variety, veracity, and value mining techniques, 69
FN, see False negatives rewriting, 361
Forecasting, 409 transformation, 361
Forest cover dataset, 100 Graph-based anomaly detection (GBAD), 183–184, 190,
from UCI repository, 142 197, 203–204, 251; see also Anomaly detection
Formal policy analysis, 321, 324 GBAD-MDL, 204
Forming associations, 27 GBAD-MPS, 205
Foursquare, 289 GBAD-P, 204–205
F outliers, see Filtered outlier models, 488
FP, see False positives Graphical models and rewriting, 361
FPR, see False positive rates Graphical user interface (GUI), 421
F pseudopoints, 135–136 GREE88 dataset, 227
Framework design, 437 Ground truth, 198, 199, 220
mixed continuous and discrete domains, 444–446 Guest machine, 54
offline scalable statistical analytics, 442–444 Guests, 54
privacy and security aware data management for GUI, see Graphical user interface
scientific data, 440–442
real-time stream analytics, 446–448 H
storing and retrieving multiple types of scientific data,
437–440 Hadoop, 193, 265, 463, 488
Framework integration, 320 cluster, 244
Frequency, 221 distributed system setup, 351
Frequent itemset graph, 36, 37 storage architecture, 312, 318, 325
“Friends-smokers” social network domain, 443, 444 Hadoop distributed file system (HDFS), 51, 70, 79, 173,
Functional architecture, 510 174, 184, 237, 265, 312, 322
Functional database systems, 522–523 Hadoop/MapReduce, 438
Functionality, 415 framework, 181, 345–347
Future system, 439–442, 444, 446 platform, 237–238, 490
online structure learning methods for stream technologies, 373
classification, 447–448 HAN, see Home area network
semisupervised classification/prediction, 446–447 HAQU13a approach, 193
HAQU13b approach, 193
G Hard subspace clustering, 71
Hardware, 279, 339
Gaussian distribution, 141, 163, 204 hardware-assisted security, 406
GBAD, see Graph-based anomaly detection hardware-level security, 406
GDELT, see Global Database of Event, Language, services, 52
and Tone virtualization, 54
Generating and populating knowledge base, 366 Hardware security modules (HSMs), 406
Generic problems, 456 HBase, 56, 436, 438, 490
Genetic algorithms, 109 HDFS, see Hadoop distributed file system
Geospatial data processing on GDELT, 458 HDP, see Heterogeneous data processor
Index 533
INGRES, 15, 16, 494, 495 Intrusion detection systems (IDS), 27, 414
project at University of California at Berkeley, 23 InXite, 290, 291
Input events generation, 424 application of SNOD, 300
Input files selection, 270 cloud-based system, 289
Input method editor (IME), 424 cloud-design of Inxite to hanndle big data,
Insider threat detection, 51, 67–68, 189–191, 209, 251; 301–302
see also Malware detection; Security policies expert systems support, 300–301
additional experiments, 252 implementation, 302
anomaly detection in social network and author information engine, 291–293
attribution, 252–253 InXite-Law, 302
big data analytics for, 454 InXite-Marketing, 302
big data issues, 184 InXite-Security, 302
challenges, related work, and approach, 68–69 plug-and-play approach, 291
collusion attack, 252 threat detection and prediction, 298–300
comprehensive framework, 75–76 InXite POI
contributions, 185–186 analysis, 293–298
data mining, 68, 69, 74–75 profile generation and analysis, 293–294
data storage, 73–74 threat analysis, 294–296
feature extraction and compact representation, 70–72 IoT, see Internet of Things
GBAD, 183–184 IPAM, see Incremental probabilistic action modeling
incorporate user feedback, 252 IRDS, see Information Resource Dictionary System
RDF repository architecture, 72–73 IRM, see In-line reference monitor
for sequence data, 217–224 IT, see Information technology
sequence stream data, 184 Iterative conditional mode algorithm (ICM algorithm), 155
solution architecture, 69–70
stream data analytics applications for, 3–4 J
stream mining as big data mining problem, 253
as stream mining problem, 183, 184 Jena (Java application programming package), 266, 385
SVMs, 251 Job JB, 271
Insider threats, 43–44, 67, 197, 203 JobTracker, 79
analysis, 46 Joining variable, 275
Instrumental behavior analysis, 415
Integrated system, 387–388, 389 K
Integration framework, 310–311
Integrity, 380, 391–392 Kafka, 448
aspects, 392–393 KDD cup 1999 intrusion detection dataset (KDD99), 100,
for big data, 396 141–142, 160–161
constraints, 24, 393, 395 KEND98 dataset, 207
of data, 380 Keynote presentations, 473
management, 394–396 access control and privacy policy challenges in big
Intellidimension RDF Gateway, 385 data, 474
Intelligence Advanced Research Project Activity additional presentations, 474
(IARPA), 331 authenticity of digital images in social media, 473
Intelligent fuzzier for automatic android GUI application big data analytics, 473
testing, 423 business intelligence meets big data, 473
Intelligent transportation systems, 404 final thoughts, 474
Intel SGX, 463, 465 formal methods for preserving privacy while loading
Intel SGX-enabled machine, 461 big data, 473
SDK and SGX driver, 462 privacy in world of mobile devices, 474
Interface manager, 358 securing big data in cloud, 473
International Business Machine Corporation (IBM), 494 timely health indicators using remote sensing and
International Classification of Diseases (ICD), 439 innovation for validity of environment, 474
International Conference on Data Engineering toward privacy aware big data analytics, 473
(ICDE), 472 K-means clustering, 28
Internet of Things (IoT), 2, 377, 403–404, 433, 485 K-means clustering with cluster-impurity minimization
data protection, 407–408 (MCI-K means), 152–154
layered framework for securing, 406–407 K models, 209
scalable analytics for IOT security applications, 408–411 k-nearest neighbor algorithm (KNN algorithm), 40,
use cases, 404–406 149, 342
Interoperability, 57, 391 classification model, 131
of heterogeneous database systems, 518 k-NN-based approach, 108
Interuser parallelization, 244 KNN algorithm, see k-nearest neighbor algorithm
Intrusion, 46, 47 Knowledge base, 282
detection, 189, 407 Knowledge representation (KR), 59
Index 535
L Malicious insiders, 3
Malicious intrusions, 45
Labeled data, 149, 211 Malware, 339, 347
K-means clustering with cluster-impurity behavior modeling, 415
minimization, 152–154 dataset, 350
optimizing objective function with E-M, 154–155 Malware detection, 46, 95, 340–342, 414–419; see also
problem description, 152 Insider threat detection
storing classification model, 155–156 application to Smartphones, 418–419
training with limited, 152 behavioral feature extraction and analysis, 415–417
unsupervised K-means clustering, 152 challenges, 414–415
Labeled points, 155 cloud computing for, 341
Laboratory setup, 461–462 contributions, 341–342
Language-based security, 428 as data stream classification problem, 340–341
Large scale, automated detection of SSL /TLS, 421 experimental activities, 419–421
Last technique, 122, 123 infrastructure development, 421–426
Layered framework for secure IOT, 406–407 reverse engineering methods, 417
Layered security framework, 403 risk-based framework, 417–418
LBAC, see Location based access control in Smartphones, big data analytics for, 413, 414
Learning classes Mandatory security policies, 15
supervised learning, 203 Manual labeling of data, 149
unsupervised learning, 203–205 Map input phase (MI phase), 272
Learning models, 183 Map keys (MKey), 346
Lehigh University Benchmark (LUBM), 314 Map output phase (MO phase), 272
Lempel−Ziv–Welch algorithm (LZW algorithm), 220, Mappings, 509
224, 237 MapReduce framework (MR framework), 51, 56, 70, 79,
constructing LZW Dictionary by selecting patterns, 184, 193, 237, 265–266, 269, 348, 428, 438, 456
221–222 breaking ties by summary statistics, 277–278
dictionary construction using MR, 241–242 compression/quantization, 243
scalable LZW and QD construction using MR job, cost estimation for query processing, 270–274
238–244 input files selection, 270
Leveraging randomized response-based differential- join execution, 278–279
privacy technique, 408 LZW dictionary construction, 241–242
LIBSVM, 209 paradigm, 458
Lifted learning and approximations of pseudolikelihood, 445 processes, 265
Lightweight IP-based network stacks, 407 query plan generation, 274–277
Lincoln Laboratory Intrusion Detection dataset, 207, scalable LZW and QD construction, 238–244
210–211 technology, 193
“Lineage”, 394 MapReduceJoin (MRJ), 271
Link analysis, 28 Map values (MVal), 346
LinkedIn, 289 Markov logic, 442
L-model, 158 Markov logic networks (MLNs), 443
Location based access control (LBAC), 359, 398 Markov model, 27, 32–35
Location spoofing detecting in mobile apps, 420 Markov network, 443
Logic database systems, see Next-generation database Masquerade detection, 189, 190, 191
systems Massive data problem, 493–494
LOGITBOOST.PL algorithms, 193 Maximum likelihood tree, 447
Loop detectors, 404 MaxWalksat, 444
Lossy compression process, 221 MCI-K means, see K-means clustering with cluster-
6LoWPAN, 407 impurity minimization
LUBM, see Lehigh University Benchmark MDL approach, see Minimum description length approach
LZW algorithm, see Lempel−Ziv–Welch algorithm Mean distance (µd), 133
Medical domain implementation, 365–366
M Mermaid, 495
Metadata, 391
Machine learning, 409 controller, 398
algorithms, 83 management, 514–515
techniques, 410, 417 Meteorological data, 446
Mahout, 193 Mica2 nodes running TinyDB applications, 410
Major mechanical problem, 98 Microcluster, 99, 132, 149
Malicious applications, 418 Microlevel location mining, 296
Malicious code detection, 347 Microsoft Azure’s Cosmos DB, 83–84
distributed feature extraction and selection, 348–349 Minimum cost plan generation problem, 275
nondistributed feature extraction and selection, Minimum description length approach (MDL approach),
347–348 69, 190, 204
536 Index
big data analytics for insider threat detection, 454 Scalable LZW and QD construction using MR job,
binary code analysis, 455 238–244
CPS security, 455 1MRJ approach, 241–244
infrastructure development, 455 2MRJ approach, 238–241
secure cloud computing, 454–455 Schema, 509
secure data provenance, 454 SciDB, 438–439
TEE, 455 multidimensional array data model, 436
Research challenges, 477–480 Scientific data
Resilient distributed dataset (RDD), 80 privacy and security aware data management, 440–442
Resource description framework (RDF), 3, 15, 57, 58, 263, storing and retrieving multiple types, 437–440
290, 308, 364, 373, 438, 487, 488 SDB, see SPARQL database
data manager, 308 SDC, see System Development Corporation
Gateway, 385 SDN, see Software-defined networking
graphs, 69 Search space size, 276
integration, 63–64 Second-order Markov model, 34
policy engine, 323–324 Secret sharing-based techniques, 408
processing engines, 326 Secure big data management and analytics, unified
RDF-3X, 267 framework for, 392
RDF-based policy engine, 325, 367 design of framework, 397–400
repository architecture, 72–73 global big data security and privacy controller,
security, 62–63 400–401
Reverse engineering methods, 417 integrity management and data provenance for big data
REWARDS technique, 417, 419 systems, 391–396
RI phase, see Reduce input phase Secure cloud computing, 454–455, 461
Risk-based framework, 417–418 Secure cyber-physical systems, 461
Risk analyzer, 399 Secure data
Risk models, 479 integration framework, 339
Robotium (ROBO), 423 provenance, 454
ROC curves, see Receiver operating characteristic curves storage and retrieval in cloud, 322, 324–325, 462
Role-based access control (RBAC), 15, 18–19, 331, 359, Secure encrypted stream data processing, 463–465
398, 442 SecureMR, 440
Role hierarchy, 19 Secure multiparty computation (SMC), 476
RO phase, see Reduce output phase Secure SPARQL query processing on cloud,
Routing protocols, 407 322–323
Rule-combining algorithms, 335 Security, 516
and IoT, 403–411
S labels, 441
and ontologies, 63
SaaS, see Software as a Service query and rules processing, 63
SAMOA, 253, 447 RDF, 62–63
Sanitization semantic web AND, 61
task output derivation, 441 XML, 62
tasks, 441 Security and privacy for big data, 459
techniques, 477 approach, 459–460
Satellite AOD data, 446 curriculum development, 460–461
SCADA systems, see Supervisory control and data experimental program, 461–465
acquisition systems Security applications
Scalability, 69, 184, 186, 391, 410 data mining for cyber security, 43–47
big dataset for insider threat detection, 244–245 data mining tools, 47–48
big data techniques for, 192–193 Security extensions, 281
experimental setup and results, 244 access control model, 282–283
Hadoop cluster, 244 access token assignment, 283–284
Hadoop MapReduce platform, 237–238 conflicts, 284–285
issues, 447 Security policies, 15, 16; see also Insider threat detection
results for big data set relating to insider threat access control policies, 16–19
detection, 245–248 administration policies, 20
scalable analytics for IOT security applications, auditing, 21
408–411 authentication, 20–21
scalable LZW and QD construction using MR job, discretionary security policies, 16
238–244 identification, 20–21
test, 147 views for security, 21
Scalable, high-performance, robust and distributed SElinux, 440
(SHARD), 266, 325 Semantic gap, 38
540 Index
Semantic web-based inference controller for provenance Simple Protocol and RDF Query Language (SPARQL),
big data 58–59, 69, 263, 269, 488
architecture for inference controller, 356–360 query modification, 364–365
big data management and inference control, query processor, 312, 325
367–368 Single-chunk approach, 171
implementing inference controller, 365–367 Single-partition, single-chunk approach (SPC approach),
inference control through query modification, 115, 340, 344
361–365 ensemble approach, 116
Semantic web, 51, 57 Single map reduce job approach (1MRJ approach), 238,
cloud computing frameworks based on technologies, 241–244
63–65 Single model approach, 94
DL, 59–60 classification, 106–107
graphical models and rewriting, 361 incremental approaches, 417
inferencing, 60–61 Single pass algorithm, 220
OWL, 59 Single source derivation, 441
preliminaries in, 52 Singular value decomposition (SVD), 40
RDF, 58 Small communication frames, 407
and security, 61–63 Smart grid, 405–407
semantic web-based models, 360–361 Smart home, 405
semantic web-based security policy engines, 326 Smart meters, 408
SPARQL, 58–59 Smartphones application, 418
SWRL, 61 classification model, 418
technologies, 52, 263, 360, 396 data gathering, 419
technology stack for, 57 data reverse engineering, 419
XML, 58 malware detection, 419
Semantic Web Rules Language (SWRL), 58, 61, 309, SMC, see Secure multiparty computation
358–359, 387 SMM, see System management mode
Semisupervised classification/prediction, 446–447 SNOD, see Stream-based novel class detection
Semisupervised clustering Social factor-based technique, 297
stream classification algorithm, 172 Social graph-based score computation, 295
techniques, 109, 131, 149 Social media
Sensing infrastructure, 404 authenticity of digital images in, 473
Sensor network, 408–409 privacy for, 385–387
Sensor signal, 409 sites, 291
Sentiment mining, 297–298 systems, 27, 379
Sequence-based behavior analysis, 416 Social network, 388–389
Sequence data, 217; see also Nonsequence data community, 263
anomaly detection, 223–224 trust for, 387
choice of ensemble size, 233–235 Soft subspace clustering, 71
classification, 217–220 Software, 280
complexity analysis, 224 Software as a Service (SaaS), 53, 307, 332
concept drift in training set, 228–230 Software-defined networking (SDN), 407
dataset, 227–228 SOWT, see Special operations weather specialists
experiments and results for, 227 Space complexity, 140–141
insider threat detection for, 217 Space sensors, 404–405
NB-INC vs. USSL-GG for various drift values, Spark, 422, 458
231–232 emerge, 490
results, 230 running, 409
stream data, 184, 251 SPARQL, see Simple Protocol and RDF Query Language
TN, 230–231 SPARQL database (SDB), 321
USSL, 220–223 SpatialHadoop, 458
Serializability, 513 Spatiotemporal Database Systems, 522
Server role, 381–382 SPC approach, see Single-partition, single-chunk approach
Service models, 53 Special operations weather specialists (SOWT), 459
SETM algorithm, 35 Split using explicit type information of object, 269
SGX hardware, 463 Spout, 447–448
SHARD, see Scalable, high-performance, robust and SQL, see Structured Query Language
distributed SSL/TLS, large scale, automated detection, 421
Signature(s), 47 SSO, see System security officer
behavior, 189, 191 Stand-alone systems, 497
database, 342 Stanford framework, 458
detection, 339 State-of-the-art stream classification techniques, 127, 149, 171
signature-based malware detectors, 342 Static analysis, 421
Silver Lining, 440 Static GBAD approaches, 190
Index 541
TRBAC, see Time based access control; Time role-based User demographics-based, 297
access control User feedback, 252
Triple Pattern Join (TPJ), 271 User interface (UI), 423–424
Triple patterns (TPs), 264, 271 manager, 357, 398
Triples, 72 User-level applications, 189
True negatives (TNs), 197, 230 U.S. Homeland Security, 67
True positive rate (TPR), 186, 230 USSL, see Unsupervised stream-based sequence learning
True positives (TPs), 197, 230
“Truncated” UNIX shell commands, 189, 191 V
Trust, 379, 380
probabilities, 387 VA, see Veterans Administration
for social networks, 387 Vector representation of content (VRC), 70–71
Trusted execution environments (TEE), 454, 455, 459 Vertically partitioned layout, 318–319
systematic performance study, 462–463 Very Fast Decision Trees (VFDTs), 106, 340
Trust inference engine (TIE), 382–383 Veterans Administration (VA), 433, 434
Trust, privacy, and confidentiality, 379 decision support tools, 436
current successes and potential failures, 380–381 Personal Health Record system, 434
inference engines, 383–384 VFDTs, see Very Fast Decision Trees
motivation for framework, 381 Victim selection, 220
TrustZone security, 406 Video signal, 409
Twitter, 289 View management, 517
Two-class SVM, 209, 211 ViewServer, 424
Two MapReduce jobs (2MRJ), 238 Vigiles, 441
approach, 238–241 Virtualization, 53–54
Two-phase commit, 513 Virtual laboratory development, 421
Type sink, 417 architectural diagram for virtual lab and
integration, 422
U experimental system, 425–426
input events generation, 424
UAV could, 409 intelligent fuzzier for automatic android GUI
UCON, see Usage control application testing, 423
UI, see User interface interface, 423–424
Unbounded data stream, 221 laboratory setup, 421–422
Unified framework mitigating data leakage in mobile apps, 424–425
design of framework, 397–400 policy engine, 426
global big data security and privacy controller, problem statement, 423
400–401 programming projects to supporting virtual lab, 423
integrity management and data provenance for big data technical challenges, 425
systems, 391–396 Virtual machine manager (VMM), 462
learning framework, 409 Virtual machines (VM), 244
for secure big data management and analytics, 392 image, 55
Uniform resource identifiers (URIs), 58, 74, 269, 318, 331 monitor, 54
UNIX shell commands, 189 Vision, 497
Unsupervised ensemble classification and updating, 198 VM, see Virtual machines
Unsupervised K-means, 131–132 VMM, see Virtual machine manager
clustering, 152 VMware, 54
Unsupervised learning, 191, 203, 210, 212–214, 415; see Volume, velocity, variety, veracity, and value (Five Vs), 1
also Supervised learning Voting, 409
algorithm, 183, 184 VRC, see Vector representation of content
ensemble for, 199–200
GBAD-MDL, 204 W
GBAD-MPS, 205
GBAD-P, 204–205 WA., see Weighted average
GBAD, 203–204 Wang, 122, 123, 124, 125
Unsupervised method, 183 W3C, see World Wide Web Consortium
Unsupervised stream-based sequence learning (USSL), WCE, see Weighted classifier ensemble
184, 185, 218, 219, 220, 230 WCOP, see Web rules, credentials, ontologies, and policies
constructing LZW Dictionary, 221–222 Weak authorization, 17
data chunk, 220–221 Web-based interface, 421
USSL-GG algorithms, 230–235 Web Ontology Language (OWL), 58, 59, 263, 309, 355,
URIs, see Uniform resource identifiers 364, 487
Usage control (UCON), 19 OWL 2 specification, 400
U.S. Bureau of Labor and Statistics (BLS), 1 Web rules, credentials, ontologies, and policies
Use cases, 404–406 (WCOP), 388
Index 543
Weighted average (WA), 199 World Wide Web, 20, 24, 53, 57, 365, 462
Weighted classifier ensemble (WCE), 142 World Wide Web Consortium (W3C), 57, 380
Weight learning, 443 Wrapper-based simultaneous feature weighing, 39
Weka (machine learning open source package), 83, 122 WSN, see Wireless sensor networks
Whitepages, 366
WHO, see World Health Organization X
Wireless communication networks, 404
Wireless sensor networks (WSN), 410 XACML, see eXtensible Access Control Markup
Workgroups, 474 Language
Workshop discussions, 474 XEN, 54
BDMA for cyber security, 480–481 XML, see eXtensible Markup Language
examples of privacy-enhancing techniques, 475–476 XQuery, 23
multiobjective optimization framework for data
privacy, 476–477 Y
philosophy for BDSP, 475
research challenges and multidisciplinary approaches, Yahoo!, 266
477–480 Yellowpages, 366
workgroups, 474
Workshop presentations Z
keynote presentations, 473–474
summary, 472–474 Zero-knowledge proof of knowledge protocols (ZKPK
World Health Organization (WHO), 433 protocols), 476