Analytics and Knowledge Management
Analytics and Knowledge Management
Management
Data Analytics Applications
Series Editor: Jay Liebowitz
PUBLISHED
Actionable Intelligence for Healthcare
by Jay Liebowitz and Amanda Dawson
ISBN: 978-1-4987-6665-4
Analytics and Knowledge Management
by Suliman Hawamdeh and Hsia-Ching Chang
ISBN 978-1-1386-3026-0
Edited by
Suliman Hawamdeh
Hsia-Ching Chang
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Preface ...............................................................................................................vii
Editors ................................................................................................................xi
Contributors .....................................................................................................xiii
v
vi ◾ Contents
Index ...........................................................................................................433
Preface
The terms data analytics, Big Data, and data science have gained popularity in
recent years for a number of good reasons. The most obvious reason is the exponen-
tial growth in digital information and the challenge of managing large sets of data.
Big Data forms a challenge and opportunity at the same time. It is a challenge if
not managed properly and the organization does not make the needed investment
in the knowledge infrastructure recognizing the value of data as an organizational
asset. Knowledge infrastructure is made of several components including intel-
lectual capital (human capital, social capital, intellectual property, and content),
physical capital, and financial capital. What makes Big Data an opportunity is the
prospects of knowledge discovery from Big Data and the value of such knowledge
in enhancing an organization’s competitive advantage through improved products
and services, as well as enhanced decision-making processes.
Given the cost associated with managing Big Data, organizations must adopt
a knowledge management strategy in which Big Data is viewed as a key organi-
zational asset. This also includes making the necessary investment in data science
and data analytics tools and technologies. Knowledge management places a higher
emphasis on people and human capital as a key to realizing the concept of the
knowledge-based economy. This means any knowledge management strategy must
include a plan to educate and enhance the capacity of those working with Big Data
and knowledge discovery.
The shift toward the knowledge economy and the realization of the importance
of data as an organizational asset within the context of knowledge management
has given rise to the emerging fields of data science and data analytics. The White
House’s “Data to Knowledge to Action” initiative in 2013 aimed at building Big
Data partnerships with academia, industries, and public sectors. This initiative
led to the National Science Foundation (NSF) increasing the nation’s data science
capacity by investing in human capital and infrastructure development. The Big
Data to Knowledge (BD2K) initiative by the National Institutes of Health (NIH)
in 2012 and Google’s Knowledge Graph were also aimed at building big data
capacity. Such capabilities will be based on well-established knowledge infrastruc-
tures made of a network of individuals, organizations, routines, shared norms, and
practices. Building knowledge infrastructures requires human interactions through
vii
viii ◾ Preface
xi
Contributors
xiii
xiv ◾ Contributors
for the master’s degree in data science program at the Department of Information
Science in the College of Information at the University of North Texas.
Angela R. Martin is an intelligence specialist who works for New Generation Warfare
Study Group at USA TRADOC ARCIC as a Cyber Security Strategic Planning and
Development Analyst. She has 10 years of experience in the Department of Defense
in various roles to include human intelligence collector, senior interrogator, general
military intelligence analyst, human terrain specialist, research manager, knowledge
management and information technology supervisor, core curriculum instructor and
senior all source socio-cultural dynamics analyst.
Eduardo Rodriguez PhD, MSc., MBA Eduardo is the sentry endowed chair in
Business Analytics University of Wisconsin-Stevens Point. In his work he has created
the Analytics Stream of the MBA at University of Fredericton, Fredericton, Canada,
analytics adjunct professor at Telfer School of Management at Ottawa University,
Ottawa, Kansas, corporate faculty of the MSc in Analytics at Harrisburg University
of Science and Technology, Harrisburg, Pennsylvania, senior associate-faculty of the
Center for Dynamic Leadership Models in Global Business at http://nz.linkedin.com/
company/the-leadership-alliance-inc.?trk=ppro_cprof. The Leadership Alliance Inc.
Toronto Canada, and principal at IQAnalytics Inc. Research Centre and Consulting
Firm in Ottawa, Canada. He has been a visiting scholar Chongqing University,
Chongqing, China, and EAFIT University, Medellín, Colombia, for the Master of
Risk Management.
Eduardo has extensive experience in analytics, knowledge, and risk management
mainly in the insurance and banking industry. He has been the knowledge man-
agement advisor and quantitative analyst at Export Development Canada (EDC)
in Ottawa, regional director of PRMIA (Professional Risk Managers International
Association) in Ottawa, vice-president in Marketing and Planning for Insurance
Companies and Banks in Colombia. Moreover, he has a worked as a part-time pro-
fessor at Andes University and CESA in Colombia, author of six books in analytics,
and a reviewer of several journals and with publications in peer-reviewed journals
and conferences. He created and Chair the Analytics Think-Tank, organized and
Chair of the International Conference in Analytics ICAS, member of academic
committees for conferences in Knowledge Management and international lecturer
in the analytics field.
Eduardo holds a PhD from Aston Business School, Aston University,
Birmingham, UK, and MSc in Mathematics from Concordia University, Montreal,
Canada, Certification of the Advanced Management Program McGill University,
xviii ◾ Contributors
Montreal, Canada, and an MBA and bachelor in Mathematics from Los Andes
University, Bogotá, Colombia. His main research interest is in the field of Analytics
and Knowledge Management applied to Enterprise Risk Management.
Eric R. Schuler received his PhD in experimental psychology from the University
of North Texas, Denton, Texas. He has worked as a teaching fellow in the
Department of Psychology and as the research consultant for Department of
Information Science’s Information Research and Analysis Lab. His research inter-
ests include: refining best-practice quantitative techniques based on Monte Carlo
statistical simulations, measurement development and validation, and how belief
systems shift after a traumatic event. When Eric is not running R-code, he enjoys
playing Dungeons and Dragons.
Hillary Stark, with a background in business and marketing, focuses her research
towards nutrition marketing efforts, specifically the information relied upon by
individuals when purposefully making more healthful eating choices. She is excited
and encouraged by the present age of Big Data, as the tools made available to
analyze robust amounts of information, from conversations across social media
platforms to documents submitted on behalf of passing new dietary legislation,
can assist academics, health-care professionals, manufacturers, and the individu-
als themselves to be more in control of their actions through this newly acquired
knowledge. She is currently finishing her PhD studies in Information Science at the
University of North Texas, Denton, Texas, and enjoys running and volunteering in
her community in her free time.
Christian Stary received his diploma degree in computer science from the Vienna
University of Technology, Austria, in 1984; his PhD degree in usability engineer-
ing, and also his Habilitation degree from the Vienna University of Technology,
Austria, in 1988 and 1993, respectively. He is currently a full Professor of Business
Information Systems with the University of Linz. His current research interests
include the area of interactive distributed systems, with a strong focus on method-
driven learning and explication technologies for personal capacity building and
organizational development.
Knowledge Management
for Action-Oriented
Analytics
John S. Edwards and Eduardo Rodriguez
Contents
Introduction ..........................................................................................................2
Categorizing Analytics Projects ..............................................................................6
Classification by Technical Type ........................................................................6
Classification from a Business Perspective .........................................................7
How Analytics Developed .....................................................................................8
Analytics, Operations Research/Management Science,
and Business Intelligence..................................................................................9
Overview of Analytics Examples ..........................................................................10
Single-Project Examples ......................................................................................12
Strategic Analytics...........................................................................................12
Managerial Analytics.......................................................................................14
Operational Analytics .....................................................................................17
Customer-Facing Analytics .............................................................................18
Scientific Analytics..........................................................................................20
Multiple Project Examples ...................................................................................20
1
2 ◾ Analytics and Knowledge Management
Introduction
Analytics, Big Data, and especially their combination as “Big Data analytics” or
“Big Data and analytics” (BDA) continue to be among the hottest current topics in
applied information systems. With the exception of artificial intelligence methods
such as those focused on deep learning, the emphasis on analytics is now moving
away from the purely technical aspects to the strategic, managerial, and longer-term
issues: not a moment too soon, some people would say.
In this chapter, we take a strategic perspective. How can an organization use
analytics to help it operate more successfully? Or even to help it become the orga-
nization that those who run it would like it to be? The need for more thought about
this is certainly evident, but the starting point is far from clear.
Our choice of starting point is to build on previous work on the relationship
between analytics, Big Data, and knowledge management. This is represented as an
action-oriented model connecting data, analytics techniques, knowledge, and pat-
terns and insights. These elements will be used to categorize and analyze examples of
analytics and big data projects, comparing the United Kingdom with other countries.
A particular focus is the difference between one-off projects, even if they lead to
ongoing results, such as sensor monitoring, and ongoing activities or a series of proj-
ects. In the latter, our model adds influences of knowledge on data (what to collect and
standardization of meaning), knowledge on techniques (new techniques developed in
the light of past examples), and action on knowledge (learning from skilled practice).
Links will also be made to the higher-level issues that drive analytics efforts - or at
least should do. These include the definition of problems, goals, and objectives, and
the measurement of organizational and project performance. This raises the question
of the relationship between knowledge management and strategic risk management.
Looking at the analytics “landscape,” we see that opinions, and even reported
facts, differ fundamentally about the extent to which organizations are already
using analytics. For example, several “white papers” attempt to describe the extent
and nature of the use of analytics by employing variants of the well-known five-
stage software engineering capability maturity model (CMM) (Paulk, Curtis,
Chrissis, & Weber, 1993).
A report commissioned from the UK magazine Computing by Sopra Steria
uses the same names as the CMM stages for respondents to describe their orga-
nization’s approach to data and analytics (Sopra Steria, 2016). The results were:
initial 18%, repeatable 18%, defined 19%, managed 33%, and optimized 12%.
Knowledge Management for Action-Oriented Analytics ◾ 3
International Data Corporation (IDC) has a very similar five-stage model for
“BDA competency and maturity”—ad hoc, opportunistic, repeatable, manage-
able, and optimized—but their report gives no figures (Fearnley, 2015). MHR
Analytics (MHR Analytics, 2017), a UK consultancy group focusing on human
resource management, has developed a five-stage model for what they call the
“data journey,” comprising the following: unaware 33%, opportunistic 7%, stan-
dards led 18%, enterprise 19%, and transformational 30%. However, the fact that
these percentages add up to 107% does call their accuracy into question, even if
only in the proofreading of the report. Nevertheless, the impression given from
these three analyses is that at least half of the organizations responding are well on
the way to making good use of analytics and Big Data.
On the more skeptical side, the American Productivity and Quality Center
(APQC) reported in 2016 that four-fifths of organizations had not yet begun to
take advantage of Big Data (Sims, 2016). This does seem to be closer to our own
experience, although research by the McKinsey Global Institute (Henke et al.,
2016) found considerable differences between sectors, with retail more advanced
than most. As for the trend in Big Data and analytics use, again it may not be as
rapid as some articles—usually, those not quoting data—imply. A report by the
Economist Intelligence Unit (Moustakerski, 2015) suggests about a 10% movement
toward the strategic use of data between 2011 and 2015.
One reason for this difference of opinion/fact is the type of decisions for which
analytics and Big Data are being used. Are they the most significant decisions about
the strategic direction of the business, characterized by the Economist Intelligence
Unit in a report for PwC (Witchalls, 2014) as “big” decisions, or are they every-
day decisions? Later in the chapter we will use a classification by Chambers and
Dinsmore (2015) to help understand the effect of these differences.
The other crucial element is the extent to which the organization’s managers
accept, and plan for the use of, analytics; indeed, it might not be an exaggeration to
phrase that as the extent to which they believe in analytics. The Computing survey
(Sopra Steria, 2016) found that only 25% of respondents’ organizations had a well-
defined strategy and—not surprisingly in the light of that figure—that only 10%
of analytics projects were “always” driven by business outcomes, with 44% being
driven “most of the time.”
We see knowledge and its management as the connection between the orga-
nization and its strategy on the one hand, and the availability of data and the use
of analytics on the other. It is well-recognized that organizations with a culture of
knowledge sharing are more successful than those without such a culture (Argote,
2012). A key element in the use of BDA is the presence of a data-driven culture.
A data-driven, or data-oriented, culture refers to “a pattern of behaviors and prac-
tices by a group of people who share a belief that having, understanding and using
certain kinds of data and information plays a critical role in the success of their
organization” (Kiron, Ferguson, and Prentice, 2013, p.18). This is the link to the
“belief” in analytics that we mentioned earlier.
4 ◾ Analytics and Knowledge Management
Nevertheless, belief by itself is not enough for effective use of analytics. A bal-
ance needs to be struck between the “having” and “using” elements of the defini-
tion, and the “understanding” element. Spender (2007) has found that there are
three different types of understanding, although he prefers to call it organizational
knowing. These are data, meaning, and skilled practice. To use the distinction
identified by Polanyi (1966), some of this understanding and knowing is explicit
(codifiable in language), and some is tacit (not readily expressible). Crucially, for
skilled practice, the tacit knowledge dominates: even having good data and high-
quality technical analysis is not enough on its own.
Another element in our thinking is that the development of the relationship
between knowledge and analytics is in permanent evolution. Each time that an
analytics solution is obtained, it is then time to start the search for a new analyt-
ics solution based on the knowledge acquired from the previous analytics solution
implementation. Data changes, and that modifies the outcomes of the models, but
at the same time the models are changing, the problems are better defined, and the
scope is clearer, based on the experience in successive implementations. In addition,
there is a clear challenge in using the freshest data that it is possible to access. In
strategic risk, for example, every second the changes in the stock prices are poten-
tially modifying some results in models; in credit risk, there is a need to use a
permanent data flow that will potentially modify the outcomes for the decision-
making process (granting loans).
When we consider analytics, meaning and skilled practice refer both to the ana-
lytics and to the target domain where the analytics techniques are being applied.
Skilled practice and meaning also have a two-way relationship with knowledge;
they influence it and they are influenced by it.
This two-way interaction of meaning and skilled practice is illustrated by
the process of management control systems creation. In one direction the target
domain is the development of systems to assure the implementation of strategy in
organizations. The analytics domain concerns how to implement the knowledge
from accounting, cost control, and so on to develop performance indicators and
build up a measurement system for the organization’s performance at any level.
Nucor Corporation (Anthony and Govindarajan, 2007) is an example where the
organization started implementing knowledge management as a management con-
trol system, and evolved to develop capabilities to use data in its strategic develop-
ment and strategy implementation and in particular to the use of business analytics
(Hawley, 2016) for improving supply chain management practice, a core process
in the business. Thus a better understanding of the requirements of management
control systems led a transition from a general analytics view to specific analytics
application in a key area of Nucor Corporation’s strategy.
Figure 1.1 shows how knowledge, and the various elements involved in an ana-
lytics study, fit together when just a single study is considered. The knowledge falls
into one or more of four categories:
Knowledge Management for Action-Oriented Analytics ◾ 5
Techniques
Data
(analytics)
Patterns/
Action
insights
Knowledge
Figure 1.1 The interactions between data, analytics, and human knowledge in a
single study. (From Edwards, J.S., and Rodriguez, E., Proc. Comp. Sci., 99, 36–49,
2016.)
◾ Knowledge about the domain, whether that is medical diagnosis, retail sales,
bank customer loyalty, etc.
◾ Knowledge about the data: are the sources reliable and accurate, are the
meanings clear and well defined, etc.
◾ Knowledge about the analytics techniques being used
◾ Knowledge about taking action: how to make real change happen; this can
be further subdivided into the aspects of people, processes, and technology
(Edwards, 2009)
We see analytics acting to supplement gut feel and intuition in the development
of solutions to business process problems. Knowledge management supports the
development of actions in organizations looking to reduce the storage of knowledge
that is never used. Knowledge that is not related to the process of finding solu-
tions is not adding value to the organization. The same is true of data: data with-
out analytics to create knowledge to be used in business solutions cannot provide
value. The Economist Intelligence Unit/PwC report mentioned earlier (Witchalls,
2014) indicates that several factors in big decisions can affect the use of analyt-
ics. Analytics requires preparation time to create a process that adds value in the
organization. Several decisions are made in a short time, forcing a decision based
on reactions. The necessary preparation time to avoid reactive solutions or solutions
and decisions enforced without good knowledge can be made available through a
systematic analytics monitoring system and continuous feedback regarding deci-
sions made and the value added by the analytics.
In the remainder of this chapter, we first offer some relevant definitions, then
look at the history of analytics, and move into the related issue of terminology.
This enables us to categorize the examples that we consider in the main body of the
chapter. We then conclude by considering likely future influences of the political
and business environments.
6 ◾ Analytics and Knowledge Management
aim to recommend a particular course of action. Robinson et al. (2010) add that
“Most of us would probably agree that in fact most operations research (OR) tech-
niques reside in this space.”
Robinson et al. (2010) see the three types as a distinct hierarchy, commenting
that “Businesses, as they strive to become more analytically mature, have indicated
a goal to move up the analytics hierarchy to optimize their business or operational
processes. They see the prescriptive use of analytics as a differentiating factor for
their business that will allow them to break away from the competition.” This is
somewhat more contentious, much as its appeal to those pushing analytics services
and relevant IT products is obvious. INFORMS commissioned the 2010 work to
help decide how it addressed the rise of analytics as a hot topic in business. Haight
and Park (2015), in a report for Blue Hill Research about the Internet of Things
(IoT), also see descriptive, predictive, and prescriptive analytics as the stages in a
maturity model. In their case, the ultimate goal is complete automation, that is, to
“provide specific recommendations based on live streaming data to an employee or
to automatically initiate a process” (p.4).
To understand this issue better, we need to look at a classification of analytics
along a different dimension.
Operational analytics are “embedded into the front-line processes for an organi-
zation and are executed as part of the day-to-day operations. Operational analytics
ranges from real-time (now) to short-time horizons (today or this week).” (Chambers
and Dinsmore, 2015, p.66). No doubt about the timescales for this type.
Customer-facing analytics “typically provide value by providing insights
about customers. They also tend to range from real-time to short-time horizons.”
(Chambers and Dinsmore, 2015, p.67). The difference between this type and
the operational category is more clear cut in manufacturing than in some ser-
vice industry sectors. In the latter, sometimes all that matters is what the cus-
tomer thinks of the service.
Scientific analytics “add new knowledge—typically in the form of new intellec-
tual property—to an organization. The frequency may be periodic (every year) or
occasional (once every several years).” (Chambers and Dinsmore, 2015, p.67). This
type of analytics is specifically related to intellectual capital, and is rather different
from the managerial, operational, and customer-facing types, in that it is related to
building capacity or capability rather than to operations directly.
We will now begin to use these categories in considering the history and termi-
nology of analytics.
(Kirby and Capey, 1998, p.308). In the United Kingdom, there is the clear example
of the Shirley Research Institute, analyzing large amounts of data from the cot-
ton textiles industry in the 1920s (Kirby and Capey, 1998). In the United States,
management consultants such as Arthur D. Little, McKinsey & Company, and
Booz Allen Hamilton were well-known by the 1930s, and some of their work used
analytical approaches.
What we would now call analytics really began to gain traction in the 1950s,
spurred on by military uses in the 1940s. In the United Kingdom, J. Lyons & Co.
built the world’s first business computer in 1952 to carry out analytics work on food
sales in their chain of cafés. In the United States, UPS initiated its first corporate
analytics group in 1954 (Davenport and Dyché, 2013). For more early examples,
see Kirby and Capey (1998) for the United Kingdom and Holsapple, Lee-Post, and
Pakath (2014) for the United States.
Improvements in information technology in the twenty-first century have led to
the current boom in interest in analytics. Some authors have attempted to identify
different generations of analytics. Chen, Chiang, and Storey (2012) argued that
Business Intelligence and Analytics (BI&A), as they call it, evolved from BI&A
1.0 (database management system-based structured content) to BI&A 2.0 (web-
based unstructured content) and BI&A 3.0 (mobile and sensor-based content).
Davenport’s list, also of three generations, describes an evolution from Analytics
1.0 (the era of business intelligence) to Analytics 2.0 (the era of Big Data), and
moving toward Analytics 3.0 (the era of data-enriched offerings) (Davenport, 2013;
Davenport and Dyché, 2013). Davenport and Dyché (2013) actually date the three
eras, with Analytics 1.0 running to 2009, Analytics 2.0 from 2005 (so there is an
overlap) to 2012, and Analytics 3.0 from 2013.
algorithm for a brand of cat food more than 30 years ago. Again, this was prescrip-
tive analytics in action. Both this and the Coca-Cola orange juice example are
operational analytics, although in the cat food example the recalculation was only
done every few days. This was because of the purchasing cycle for the ingredients
rather than technological limitations; calculating the results (using what was then
called a mini-computer) took less than ten minutes.
On the other side of the divide, some operations research/management science
professionals seem to be ignoring the analytics “bandwagon” entirely. For example,
the study into improving paper production in Finland by Mezei, Brunelli, and
Carlsson (2016) uses detailed data and machine operators’ knowledge to fine-tune
the production machinery. This could appropriately have been described as an
example of predictive analytics, but neither the terms analytics nor Big Data appear
anywhere in their paper.
The problem with the bandwagon effect is that analytics has now reached the
point where consultancies and other players seek to differentiate themselves from
others in the market by using somewhat different terminology, and academics try to
make sense of the various labels. Early in the “boom” period, Davenport and Harris
(2007) described analytics as a subset of business intelligence. We have already seen
that Chen, Chiang, and Storey (2012) couple the two terms together. Business intel-
ligence, derived from the military sense of that term, and so meaning information
rather than cleverness, also has a long history. Holsapple, Lee-Post, and Pakath
(2014) point out that it was first coined at IBM in the 1950s (Luhn, 1958), although
it is often credited incorrectly to the Gartner Group in the 1990s (see e.g., Watson &
Wixom, 2007), much as Gartner helped to popularize the term. Summing up the
confusion, Watson & Wixom (2007) observed that “BI [Business Intelligence] is now
widely used, especially in the world of practice, to describe analytic applications.”
The MHR report (MHR Analytics, 2017) is a good illustration of how this
confusion of terminology continues. It has data analytics in its subtitle, but begins
with the statement “Business Intelligence (BI) is not a new topic,” and refers most
frequently to “BI and [data] analytics.” Just to add to the mix, the Computing report
(Sopra Steria, 2016) concentrates on the phrase “applied data and analytics.”
Action (People,
Domain Data Techniques Process, Technology)
To explain this further, let us look again at the two UK public health examples
from the nineteenth century. Although a “business” perspective does not strictly
apply, the categories of Chambers and Dinsmore can still be used. It is clear that
Snow’s first study was operational analytics, but later work became strategic analyt-
ics as more studies were carried out. In the first study, the crucial knowledge ele-
ments were knowledge of techniques and then knowledge of action. Snow needed
to develop a new technique—an early form of data visualization—to identify the
patterns in the data. Having identified the offending pump, Snow knew that just
putting a notice on it would not stop people using it, especially as many people in
London at that time could not read, and so the only way to stop them from drink-
ing the contaminated water was to disable the pump: the people element of action
was crucial. His later strategic work contributed to knowledge of the domain—the
theory of how diseases are transmitted.
Nightingale’s work also relied on knowledge of the people element of action,
specifically how to influence army generals and politicians to adopt her ideas, espe-
cially given that they were being proposed by a woman. Similarly, her work went on
to represent a major contribution to knowledge in the domain of public sanitation.
Single-Project Examples
Strategic Analytics
Let’s begin the section with another historical example. When retail banks in
the United Kingdom began to automate their processes in the late 1950s, based
on magnetic ink character recognition technology, the key piece of data was the
account number. As the use of computers expanded through the 1960s and 1970s,
databases continued to be structured with the account number as the primary key.
However, this was no help in identifying which accounts belonged to the same
customer. Integrating data to make the customer centrally involved considerable
rewriting of the banks’ systems, and only the twin drivers of concerns about Year
2000 (Y2K) problems and the rise of online banking finally persuaded UK banks
to make the necessary investment to do it properly. The whole process of switching
the data focus from the account to the person took about 20 years in the United
Kingdom, and arguably is still not entirely complete. This is an example of knowl-
edge about data linked to strategic analytics, which was simply not possible (even
for descriptive analytics) with the old account-centric computer systems.
For the next example, we present Tesco, the United Kingdom’s number one
supermarket, and one of the world’s largest retailers. Tesco has always been an
analytics leader among UK retailers, even before the term came into regular use.
For example, it pioneered the use of geographic information systems in United
Kingdom retailing in the 1980s, to analyze the “catchment area” of a supermarket,
aided by one valuable strategic insight. Tesco managers realized that deciding where
Knowledge Management for Action-Oriented Analytics ◾ 13
to locate a new store and evaluating the performance of an existing store were basi-
cally variants of the same question: if there is a store located here, how well should
it perform? To gather the data and do the predictive analytics, they set up the Tesco
Site Research Unit. This enabled them to take a portfolio approach to opening new
stores (and acquiring stores from competitors) instead of considering each potential
new store decision in isolation. This was a strategic use of analytics resulting from
knowledge about the domain and was one of the elements that enabled Tesco to
become the United Kingdom’s number one supermarket, a position it has retained
ever since. We shall return to that latter point later in the chapter.
Another example is a project by the RAND Corporation (Chenoweth, Moore,
Cox, Mele, & Sollinger, 2012), developing supplier relationship management (SRM)
for the U.S. Air Force. Chenoweth et al. (2012) observed “As is the case with all mil-
itary services, the U.S. Air Force (USAF) is under pressure to reduce the costs of its
logistics operations while simultaneously improving their performance.” The USAF
addressed the problem on a premise of managing a good relationship with suppliers
based on the development of analytics capacity in the SRM system. The scope of
this project made it another strategic use of analytics based on domain knowledge.
The analytical capabilities were concentrated on the construction of appropriate
supplier scorecards that are associated with identification of improvement opportu-
nities in areas ranging from maintenance to resource planning systems. According
to Teradata, the main benefits for the USAF are concentrated on data and analytics
tools use “…to streamline workflows, increase efficiencies and productivity, track
and manage assets, and replace scheduled/time-based maintenance with condition-
based maintenance (CBM) – saving at least U.S.$1.5 million in one year.”
Returning to retail, the case of Luxottica Retail North America is also a strate-
gic view of analytics, this time based on knowledge of techniques. Luxottica sells
luxury and sports eyewear through multiple channels. The project consisted of the
integration of data and applications from internal and external sources. Its purpose
was to support the marketing strategy associated with data integration of multiple
distribution channels and the appropriate use of resources in marketing develop-
ment. According to IBM “…Luxottica gained a 360-degree view of its customers
and can now fine tune its marketing efforts to ensure customers are targeted with
products they actually want to buy.” (Pittman, 2016).
For an example of a focus on knowledge about technology leading to disruptive
innovation at the strategic level, we turn to Uber. The understanding of costs related
to taxi operations and the analysis of customers’ needs led to the development of a
substitute taxi service based on network and data analytics, where the technology
(phone applications) was crucial to effective action. Berger, Chen, and Frey (2017)
pointed out, “Unlike a traditional taxi business, Uber does not own any cars; instead,
it provides a matching platform for passengers and self-employed drivers and profits
by taking a cut from each ride.” The use of technology in understanding customers’
needs, transportation operation, and development is leading better customer services
and at the same time improvements in the organization’s performance.
14 ◾ Analytics and Knowledge Management
Innovation is not the only engine to connect strategic analytics and actions. The
Hong Kong Efficiency Unit (Hong Kong Efficiency Unit, 2017) illustrates strategic
analytics and all three aspects of knowledge about action: people, process, and
technology. To provide a good service to Hong Kong’s citizens, the Efficiency Unit
created a system for managing the answers to the contacts that people have with the
government. The use of analytics on data relating to prior contacts has provided a
capacity to anticipate complaints and offer a 24 × 7 service providing rapid answers
to the issues that have been reported.
Managerial Analytics
Moving to the slightly shorter timescales and lower-level decisions of managerial
analytics, we return to retail (remember we said it was one of the most advanced
sectors?), but cross to the other side of the globe and The Warehouse Group,
New Zealand’s largest retailer. Warehouse Stationery, part of the group, brought in
a count mechanism, which was based on thermal imaging, so they could understand
how many people were coming into a store. This was put alongside the transaction
data from the tills so that stores could understand how their conversion worked,
and higher-level managers could target stores that needed special emphasis. Kevin
Rowland, Group Business Intelligence Manager, commented “we very quickly
managed to create a series of dashboards that showed by hour, by day, to store level,
what was going on in terms of counts through the door and the conversion rate of
that… at the end of it, the GM of operations stood up and said, ‘I could kiss you,’”
(Ashok, 2014). In this case, the descriptive analytics project relied on knowledge
of techniques in data visualization, and in turn contributed to knowledge of the
domain-patterns of conversion rates in the stores.
Expanding our scope to a worldwide example, the Ford Motor Company oper-
ates the Global Warranty Measurement System. This is now available in the form
of an application, enabling Ford dealers to carry out descriptive analytics on war-
ranty repair costs, claims, and performance. It also enables Ford to gauge if dealers
will meet Ford’s expectations on their performance, so moving into the predictive
category. Cohen and Kotorov (2016) quote Jim Lollar, Ford’s Business Systems
Manager of Global Warranty Operations as saying: “We have always had data
on actual costs but we weren’t giving it to dealers” (p.107). The crucial knowl-
edge focus here was knowledge of action-process; how to deliver the data in a way
(an application) that was convenient to use. One of the most important contribu-
tions of the Global Warranty Measurement System is its effect on the dealerships’
performance management practice. Given the volume of data, the capabilities to
generate reports, and the governance of the data, it has been possible to develop
metrics that provide benchmarks as an indication of how the dealerships are operat-
ing. For example, it is possible to predict and manage the cost per vehicle serviced;
repairs per 1,000 vehicles serviced; and cost per repair (Cohen and Kotorov, 2016).
Dealerships can thus adjust their business processes to obtain better results.
Knowledge Management for Action-Oriented Analytics ◾ 15
vehicles (Davenport and Dyché, 2013). The data recorded from their trucks
includes, for example, their speed, direction, braking, and drive train perfor-
mance. UPS has embarked on a higher-level use of managerial analytics, which is
clearly OR on a big scale—predictive analytics. “The data is not only used to mon-
itor daily performance, but to drive a major redesign of UPS drivers’ route struc-
tures. This initiative, called On-Road Integrated Optimization and Navigation
(ORION), is arguably the world’s largest operations research project.” (Davenport
and Dyché, 2013, p.4) Davenport and Dyché (2013) also report that ORION had
“already led to savings in 2011 of more than 8.4 million gallons of fuel by cutting
85 million miles off of daily routes.” This is an example of using both knowledge
of suitable OR techniques, to take advantage of real online map data, and knowl-
edge of how to put them into action, the technology to carry out the calculations
quickly enough.
The next example uses somewhat similar technology—sensors in transportation—
but this time we see analytics that makes a contribution to knowledge of the domain.
It comes from Caterpillar Marine (Marr, 2017). Caterpillar’s Marine Division pro-
vides services to shipping fleet owners for whom fuel usage is a crucial factor affecting
profitability. One of Caterpillar’s customers knew that hull contamination (corro-
sion, barnacles, seaweed, etc.) must affect fuel consumption, but had previously had
no way of quantifying its effect. “Data collected from ship-board sensors as the fleet
performed maneuvers under a variety of circumstances and conditions—cleaned
and uncleaned—was used to identify the correlation between the amount of money
spent on cleaning, and performance improvements.” (Marr, 2017)
The outcome of the predictive analytics was the conclusion that ship hulls
should be cleaned much more frequently; roughly every six months as opposed
to every two years. Potential savings from these changes would amount to several
hundred thousand U.S. dollars per year (Marr, 2017).
We complete this section of managerial analytics examples with two exam-
ples where the knowledge focus is on the data, both related to healthcare. The
first is from the healthcare sector in the Netherlands (SAS Institute, 2016b).
A key performance indicator for surgery in public hospitals in the Netherlands,
as in the United Kingdom, is the length of time patients must wait before being
admitted for surgery. The Gelderse Vallei hospital’s waiting time for hernia sur-
gery had jumped. Rik Eding, Data Specialist and Information Analyst, explains.
“When we looked closer, it was because two patients postponed their operations
due to holidays. If we left these two cases out of consideration, our waiting period
had actually decreased” (SAS Institute, 2016b, p.13). Understanding outliers like
this in the data is a crucial element of descriptive analytics. Perhaps postpone-
ments by the patient should not be included in a calculation of the average wait-
ing time, but that would depend on the reason. A postponement because the
patient’s condition had worsened surely should be included. Standardization of
data is necessary for reliable analytics results, but it is not always easy to do it
appropriately.
Knowledge Management for Action-Oriented Analytics ◾ 17
Wider issues still can arise. North American retailer Target offers a caution-
ary tale (Hill, 2012). They analyzed purchase histories of women who had entered
themselves on Target’s baby registry, and found patterns so clear that they could
estimate not just if a customer was pregnant, but roughly when her baby was due.
They went on to send coupons for baby items to customers according to their preg-
nancy scores. (It seems highly unlikely that this behavior would have been legal in
most European countries.) Unfortunately, they did not restrict this to customers on
the baby registry. This hit the press when an angry father went to the Target store
manager in Minneapolis to complain about the coupons that were being sent to his
daughter, who was still in high school. Now, the store manager had little under-
standing of the analytics process behind this marketing campaign, and was very
surprised to discover what was happening. It did transpire a few days later that the
girl was indeed pregnant. So, it was a technically sound use of predictive analytics
based on knowledge of data, but Target had not had the sense to restrict it only
to women already on the baby registry, and it earned them a great deal of dubious
publicity. Hence we have classified this under the managerial category. It is not
clear if Target still uses the pregnancy scores in its marketing.
Operational Analytics
The Coca-Cola orange juice case mentioned earlier is an example of operational
analytics, with the knowledge focus being on action—the technology needed to
carry out the predictive analytics fast enough.
Another brand name that is well-known worldwide gives us an operational ana-
lytics example where the knowledge focus is about the data and the techniques.
eBay uses machine learning to translate listings of items for sale into different lan-
guages, thus reaching more potential buyers (Burns, 2016). This is not a straight-
forward use of natural language processing (NLP), because the eBay listings that
form the data have specific characteristics. For example, does the English word
“apple” in a listing refer to the fruit (which should be translated) or the technology
company or brand (which should not)? Similarly, should an often-used acronym
like NIB, which stands for “new in box,” be translated or not? Some non-English
speakers will still recognize the meaning of NIB even if they do not know what the
acronym actually stands for. Therefore, eBay has had to develop its own statistical
algorithms to supplement more common NLP techniques, thus contributing to
knowledge of techniques (Burns, 2016). Machine learning for NLP does not fit as
well into the descriptive/predictive/prescriptive classification as most techniques,
but is most often regarded as predictive—the translation effectively serves as a fore-
cast of what the best translation might be.
Equifax, the credit scoring agency, has also developed its own techniques (the
knowledge focus) for key operational decisions (Hall, Phan, & Whitson, 2016).
Automated credit lending decisions have typically been based on logistic regres-
sion models. However, Equifax has now developed its own Neural Decision
18 ◾ Analytics and Knowledge Management
Customer-Facing Analytics
Gaining insights into customers and their behavior is one of the most active areas
of analytics at present, perhaps because of the perception that this has led to the
success of companies like Amazon.
For our first example in this section, we look to Tesco again. In the 1990s, Tesco
was the first UK retailer to make a success of using customer data from loyalty
Knowledge Management for Action-Oriented Analytics ◾ 19
Lopez-Rabson and McCloy (2013) carried out a descriptive analytics study of col-
leges in Toronto, Canada, for the Higher Education Quality Council of Ontario.
Their study discovered factors that provide insights to education institutions to
develop strategies and tactics to support students’ success, so it is indeed customer-
facing even though it is in the strategic knowledge category.
Closing this section, we look at Bank of America for an example of predictive
analytics where knowledge of the data is the main driver (Davenport and Dyché,
2013), although the business goal, as with all customer-facing analytics, is under-
standing the customer. The developments come from combining customer data
across all channels. The predictive models establish that “primary relationship cus-
tomers may have a credit card, or a mortgage loan that could benefit from refinanc-
ing at a competitor. When the customer comes online, calls a call center, or visits a
branch, that information is available to the online app, or the sales associate to pres-
ent the offer” (Davenport and Dyché, 2013, p.16) There is also a minor action focus
on process knowledge here, in that having combined the data from all channels, the
offer is also made available across all channels, and an incomplete application (e.g.,
online) can be followed up through a different channel (e.g., an e-mail offering to
set up a face-to-face appointment).
Scientific Analytics
Our chapter devotes less space to scientific analytics than the other categories, since
by definition, the purpose of work in this category is to add to an organization’s
intellectual capital. The close links between intellectual capital and knowledge
management should mean that the case for the relevance of knowledge to action
does not need to be made. Nevertheless, one example is particularly instructive. The
so-called Panama Papers (McKenna, 2016a) represented a success for descriptive
analytics using the new technique of graph databases. The papers comprised a huge
set of documents (11.5 million) from the Panama-based legal services firm Mossack
Fonseca, which specializes in offshore organizations, including those in countries
often regarded as tax havens. The new technique made it possible for news orga-
nizations such as the International Consortium of Investigative Journalists (ICIJ),
the BBC, UK newspaper the Guardian, and German newspaper the Süddeutsche
Zeitung to analyze connections in the data in ways that would have been near-
impossible before. So the knowledge of the technique turned the Panama Papers
into a resource they would not otherwise have been, especially for the ICIJ.
Techniques
Data
(analytics)
Patterns/
Action
insights
Knowledge
Figure 1.2 The interactions between data, analytics, and human knowledge
over several studies. (From Edwards, J.S., and Rodriguez, E., Proc. Comp. Sci., 99,
36–49, 2016.)
and publication. However, even in blogs and press releases, relatively few articles
look at the evolution of analytics in an organization, and the way that the orga-
nization progresses from one project to another. We have already hinted at some
examples of this, and in this section we consider them in more detail.
The essential relationships between data, analytics techniques, patterns/insights,
knowledge, and action remain the same across a series of projects or studies, but
we expect to see three additional influences, as shown by the dashed arrows in
Figure 1.2. These are all types of learning: influences of knowledge on data (what to
collect and standardization of meaning); of knowledge on techniques (new analyt-
ics techniques developed in the light of past examples); and of action on knowledge
(learning from skilled practice).
Beginning again with Tesco, many of the analytical activities are still the same
as when they first went into analytics decades ago, like sales forecasting. But there,
for example, their priority is now real-time analytics: doing the same things, but
doing them much faster (Marr, 2016). This involves both newer technology, like
Teradata, Hadoop, and a data lake model, and also changing business processes. So
we see the contribution of knowledge of action—processes and technology—with
Tesco learning a very great deal from their existing skilled practice, and trying to
preserve their lead over competitors in this respect. The actual examples can be
managerial, operational, or customer-facing, depending on the detail. However,
especially at the managerial level, at present it can take 9–10 months to progress
from data to action (Marr, 2016). This is not unusual: the findings of Halper (2015)
were that just the stage of putting an agreed predictive model into action can take
several months. She found that only 31% of respondents said this process took
a month or less, with 14% responding nine months or more, and the mean and
modal response being three to five months.
Other examples of learning from skilled practice include Ocado and SEB. Before
the customer email project, Ocado already used machine learning in its warehouse
operations, co-locating items that are frequently bought together to reduce packing
22 ◾ Analytics and Knowledge Management
and shipping time (Donnelly, 2016). Similarly, SEB first used Amelia to answer que-
ries coming to its internal IT help desk (Flinders, 2016) before rolling out their
customer-facing system. Both of these were operational examples of predictive ana-
lytics, whereas the later projects were customer-facing. We thus see that the analytics
type and the knowledge focus can change between related studies or projects, without
affecting the possibility of transferring learning—of action influencing knowledge.
Another change of type is the plan for UPS’s ORION to include online map and
traffic data, so that it will reconfigure a driver’s pickups and drop offs in real time,
moving from managerial to operational (Davenport and Dyché, 2013). However, as
far as we know, this part of the system has not yet been rolled out.
In both the Warehouse Group and Marks & Spencer, the new knowledge cov-
ers the whole domain of using analytics. Kevin Rowland of the Warehouse Group
again: “I would actually design something in the beginning and give it to the busi-
ness, and ask them what they think, as opposed to going to the business with
a blank page and asking them what they want. Now they understand what the
tool can do but back then they didn’t.” (Ashok, 2014). Equally, Paul Williams,
M&S Head of Enterprise Analytics, observes: “People are asking questions they
wouldn’t have asked before. And asking one question breeds curiosity, often leading
to another 10, 20, or 30 questions.” (TIBCO Software, 2015).
In terms of learning about the data, the UK banking example has relied crucially
on standardization of meaning, by shifting from account number to a customer
identifier, even when that leads to counter-intuitive structures (in database terms)
such as two “unique” customer identifiers for each joint account. As for Caterpillar
Marine, James Stascavage, Intelligence Technology Manager, concluded: “I think
the best lesson we learned is that you can’t collect too much information…the days
of data storage being expensive are gone…if you don’t collect the data, keep the
data and analyse the data, you might not ever find the relationships.” (Marr, 2017)
Equifax is a prime example of developing new techniques in the light of past
examples. The NDT’s increased accuracy could also lead to credit lending in a
broader portion of the market, such as new-to-credit consumers, than previously
possible (Hall et al., 2016).
The history of the online dating website eHarmony shows the evolution of
data knowledge and the operational analytics solutions that were developed
from it. eHarmony started at the end of the last century with the original use of
questionnaire-style variables and data to predict matching and compatibility for
good relationships among couples. Given the need to develop trust among the
service users, eHarmony evolved to use social media data to manage (some of)
the dating uncertainty. Nevertheless, the business these days is affected not only
by the customers’ use but also by the business model. As Piskorski, Halaburda,
and Smith (2008) pointed out “eHarmony’s CEO needs to decide how to react to
imitations of its business model, encroachment by competing models and ascen-
dance of free substitutes.” Thus, as we said earlier, all five types of analytics may
turn out to be linked.
Knowledge Management for Action-Oriented Analytics ◾ 23
Future Developments
In this section we present three main aspects of analytics and knowledge in the
action-oriented view: organization environment evolution, political environment
evolution, and transforming the analytics workflow into models that can be embed-
ded into business processes and evolve according to new data input.
Organizational Environment
One of the key difficulties in organizational innovation revealed by the busi-
ness process re-engineering movement in the 1990s (Davenport and Short, 1990;
Hammer, 1990) was the presence of organizational silos: vertical divisions that
acted as barriers to better organizational communication and hence performance.
Over 25 years later, silos still rule in far too many organizations. Henke et al.
(2016) point out that the old organizational silos that often still exist have been
joined by data silos: different systems that cannot be connected to each other.
They give examples of the barriers presented by these silos in three sectors: bank-
ing, manufacturing, and smaller specialized retailers. A survey carried out by Cap
Gemini Consulting in November 2014 (reported by Burns, 2015, p.11) backs this
up. They found that “data scattered in silos across various units,” cited by 46% of
respondents, was the greatest challenge in making better use of Big Data.
Virgo (2017) understands these issues better than most, because of his particu-
lar interest in IT in national and local government, observing that there seems to
have been little improvement generally in the past 10 years. He comments on a
successful example of smart city analytics from New York (unfortunately his article
gives no more details) that “the breakthrough had been to get departments to share
sensitive information on problems they were shy of discussing with others.” Virgo
goes on to identify the three main reasons for this shyness as:
1. Fear that others will learn just how poor (quality, accuracy, etc.) our own
data is
2. The inability to agree on common terminology
3. Fear that others will abuse “our” data
These are specific examples of what Alvesson and Spicer (2012) have labeled a
stupidity-based theory of organizations.
Organizational silos come under the people heading of action issues, and data
silos under the technology (and people) headings. However, taking action to move
forward can also raise difficulties of changing processes. In Equifax, the company’s
operations are entirely structured around the existing analytics techniques. This is
an instance of the well-known problem that first emerged in the 1970s: it is usually
easier to computerize a manual process than to switch from one computer system
to another.
24 ◾ Analytics and Knowledge Management
The last big silo, especially for customer-facing analytics, is the one surround-
ing personal data. How much of our personal data are we willing to share with
organizations? A report for the Future Foundation/SAS about UK attitudes (SAS
Institute, 2016a) finds that younger people are much more willing to share personal
data than older people when there is some direct benefit in it for them, but less
willing when the benefits are for society as a whole (e.g., sharing energy use data to
reduce the risk of brownouts or blackouts).
Despite these barriers, some organizations have succeeded in analytics and will
endeavor to continue to do so. Marr (2016) reports four key challenges for Tesco in
the future. Below we relate them to our discussion:
Political Environment
As the use of analytics, especially predictive and prescriptive systems based on
deep machine learning, becomes more widespread, the regulatory and gover-
nance environment will undoubtedly need to change to cope with it. This is not
a new issue. The practice of “redlining” in the banking and insurance industries,
whereby loans or insurance are refused to anyone in a particular geographic area
(originally shown by a red line on a map) has been outlawed by many laws, legal
rulings, and judgments over the past 50 years. Similarly, the European Court of
Justice ruled in 2011 that it was against European Union equality laws to use gen-
der in calculating car driver insurance premiums. A machine learning system that
“reinvented” redlining or gender discrimination would put its owners in a difficult
legal position.
As a consequence, the UK Labour Party (currently the main opposition) is call-
ing for the algorithms used in big data analytics to be subject to regulation (Burton,
2016). This would not be a long step from current policy: the UK National Audit
Knowledge Management for Action-Oriented Analytics ◾ 25
Office already has a detailed framework for the evaluation of models used in gov-
ernment departments (White & Jordan, 2016). However, enforcement in the pri-
vate sector would present a new challenge.
With a worldwide perspective, many feel that artificial intelligence (AI) needs
regulation, resulting in developments such as the Asilomar AI Principles (Future
of Life Institute, 2017), endorsed by Elon Musk and Stephen Hawking amongst
others. Two of the principles are very relevant to our discussions:
Judicial use of analytics would certainly be a big step forward from credit scoring,
a chatbot, or even a self-driving car.
Conclusion
We have explained why we believe that knowledge management is the key to suc-
cessfully putting analytics into action. We discussed many examples, looking at
data, the analytics techniques, and the people, process, and technology issues of
implementing the system.
Despite barriers such as silos, and the potential legal issues of relying too much
on a machine learning system that cannot explain its decisions in terms that a
human can understand, the use of analytics will surely continue to expand rapidly.
The millennial generation seems to be very much at home with it.
However, at the heart of any analytics project there has to be data. Sometimes
the data will pose problems that cannot be overcome. For example, predicting how
long it will take to develop computer software has been an issue for over 40 years.
Sadly, the data required to drive a predictive model giving an accurate (within
10%) forecast is simply not available at the time when the estimate is needed for
business purposes (Edwards and Moores, 1994). Alternatively, the data may not be
collected, even now. As we saw, Hall et al. (2016) reported Equifax as claiming that
the new technique delivers “measurably more accurate” results than their older one.
This begs the question of what data, if any, they had on loans that were declined.
It is almost a logical impossibility to know that a loan that was declined should in
fact have been offered.
Thus an increasing strategic risk stems from poor understanding of what to do
with the knowledge created from data. Knowledge creation through analytics can
be costly in both the search for solutions and in their implementation. Companies
can lose this investment because they are insufficiently prepared to use the new
knowledge (in our terms, lack of domain knowledge). Another strategic risk is to
consider data as a resource and not as an asset. If data are an asset, adequate gover-
nance and care should be provided. The beauty of data and analytics development
is that in many cases organizations will be able to do more with the same resources.
Scope economies will take a predominant place in benefits generation instead of
continuing only with the paradigm of scale economies and growth. Several com-
panies are returning to their core business to use their resources in a better way to
create wealth (GE is one example); data are a particular resource to use to manage
uncertainty and risk.
Turning to the strategic risks of data analytics, these are mainly associated with
the creation of an analytics process that can support accuracy in predictions and
prescriptions, and possibly most important the generation of meaning and under-
standing for the organization’s management. And of course, it all depends on the
accuracy of the data…
So, even in the age of Big Data, analytics, and deep machine learning, an adage
that is as old as the use of computers holds true—“garbage in, garbage out”!
Knowledge Management for Action-Oriented Analytics ◾ 27
References
Alford, J. (2016). Keep them coming back: Your guide to building customer loyalty with
analytics. Cary, NC: SAS Institute.
Alvesson, M., & Spicer, A. (2012). A stupidity-based theory of organizations. Journal of
Management Studies, 49(7), 1194–1220. doi:10.1111/j.1467-6486.2012.01072.x.
Anthony, R., & Govindarajan, V. (2007). Management control systems. New York, NY:
McGraw-Hill.
Argote, L. (2012). Organizational learning: Creating, retaining and transferring knowledge.
New York, NY: Springer Science and Business Media.
Ashok, S. M. (2014). Case study: Warehouse Group does more with data. Computerworld
New Zealand. Retrieved January 30, 2018 from http://www.computerworld.co.nz/
article/541672/case_study_warehouse_group_does_more_data/
Berger, T., Chen, C., & Frey, C. B. (January 23, 2017). Drivers of disruption? Estimating
the Uber effect. Retrieved March, 11, 2017, from http://www.oxfordmartin.ox.ac.uk/
downloads/academic/Uber_Drivers_of_Disruption.pdf
Burns, E. (2015). Big data challenges include what info to use—and what not to.
The Key to Maximizing the Value of Data (pp. 7–16): Information Builders/
SearchBusinessAnalytics. Retrieved January 30, 2018 from http://docs.media.bitpipe.
com/io_12x/io_122156/item_1108017/InformationBuilders_sBusinessAnalytics_
IO%23122156_Eguide_032315_LI%231108017.pdf.
Burns, E. (2016). Ebay uses machine learning techniques to translate listings. Retrieved
January 30, 2018 from http://searchbusinessanalytics.techtarget.com/feature/EBay-
uses-machine-learning-techniques-to-translate-listings?utm_content=control&utm_
medium=EM&asrc=EM_ERU_72462481&utm_campaign=20170210_ERU%
20Transmission%20for%2002/10/2017%20(UserUniverse:%202298243)&utm_
source=ERU&src=5608542
Burton, G. (2016). Labour Party to call for regulation of technology companies’ algo-
rithms. Computing. Retrieved January 30, 2018 from http://www.computing.co.uk/
ctg/news/3001432/labour-party-to-call-for-regulation-of-technology-companies-
algorithms
Chambers, M., & Dinsmore, T. W. (2015). Advanced analytics methodologies: Driving busi-
ness value with analytics. Upper Saddle River, NJ: Pearson Education.
Chandler, N., Hostmann, B., Rayner, N., & Herschel, G. (2011). Gartner’s business analytics
framework. Stamford, CT: Gartner.
Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics:
From big data to big impact. MIS Quarterly, 36(4), 1165–1188.
Chenoweth, M., Moore, N., Cox, A., Mele, J., & Sollinger, J. (2012). Best practices in sup-
plier relationship management and their early implementation in the Air Force materiel
command. Santa Monica, CA: The RAND Corporation.
Cohen, G., & Kotorov, R. (2016). Organizational intelligence: How smart companies use infor-
mation to become more competitive and profitable. New York, NY: Information Builders.
Davenport, T. H. (2013). Analytics 3.0. Harvard Business Review, 91(12), 64–72.
Davenport, T. H., & Dyché, J. (2013). Big data in big companies: International Institute
for Analytics. Retrieved January 30, 2018 from http://www.sas.com/resources/asset/
Big-Data-in-Big-Companies.pdf
28 ◾ Analytics and Knowledge Management
Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning.
Boston, MA: Harvard Business School Review Press.
Davenport, T. H., & Short, J. E. (1990). The new industrial engineering: Information tech-
nology and business process redesign. Sloan Management Review 31(4), 11–27.
Donnelly, C. (2016). Machine learning helps Ocado’s customer services team wrap up
email overload. Computer Weekly. 29 November 2016, 10–12.
Edwards, J. S. (2009). Business processes and knowledge management. In M. Khosrow-
Pour (Ed.), Encyclopedia of information science and technology (2nd ed., Vol. I, pp.
471–476). Hershey, PA: IGI Global.
Edwards, J. S., & Moores, T. T. (1994). A conflict between the use of estimating and
planning tools in the management of information systems. European Journal of
Information Systems, 3(2), 139–147.
Edwards, J. S., & Rodriguez, E. (2016). Using knowledge management to give context
to analytics and big data and reduce strategic risk. Procedia Computer Science, 99,
36–49. doi:10.1016/j.procs.2016.09.099.
Fearnley, B. (2015). IDC MaturityScape: Big data and analytics in financial services.
Framingham, MA: IDC Research Retrieved January 30, 2018 from https://www.idc.
com/getdoc.jsp?containerId=US40619515.
Flinders, K. (2016). Case study: How Swedish bank prepared robot for customer services. In B.
Glick (Ed.), Artificial intelligence in the enterprise: Your guide to the latest thinking in AI and
machine learning (pp. 22–26): Computer Weekly. Retrieved January 30, 2018 from http://
www.computerweekly.com/ehandbook/Focus-Artificial-intelligence-in-the-enterprise.
Future of Life Institute. (2017). Asilomar AI principles. Retrieved February 11, 2017, from
https://futureoflife.org/ai-principles/
Goodwin, B. (2014). M&S turns to predictive analytics to keep shelves stocked over Christmas.
Computer Weekly. Retrieved January 30, 2018 from http://www.computerweekly.com/
news/2240236043/MS-turns-to-predictive-analytics-to-keep-shelves-stocked-over-Christmas
Haight, J., & Park, H. (2015). IoT analytics in practice. Boston, MA: Blue Hill Research.
Hall, P., Phan, W., & Whitson, K. (2016). The evolution of analytics: Opportunities and chal-
lenges for machine learning in business. Sebastopol, CA: O’Reilly Media.
Halper, F. (2015). Operationalizing and embedding analytics for action. Renton, WA: TDWI
Research.
Hammer, M. (1990). Re-engineering work: Don’t automate, obliterate. Harvard Business
Review, 68(4), 104–112.
Hawley, D. (2016). Implementing business analytics within the supply chain: Success and
fault factors. Electronic Journal of Information Systems Evaluation, 19(2), 112–120.
Henke, N., Bughin, J., Chui, M., Manyika, J., Saleh, T., Wiseman, B., & Sethupathy, G. (2016).
The age of analytics: Competing in a data-driven world: McKinsey Global Institute.
Retrieved January 30, 2018 from http://www.mckinsey.com/business-functions/
mckinsey-analytics/our-insights/the-age-of-analytics-competing-in-a-data-driven-world
Hill, K. (2012). How Target figured out a teen girl was pregnant before her father did.
Retrieved January 30, 2018 from https://www.forbes.com/sites/kashmirhill/2012/02/16/
how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#5f0167436668
Holsapple, C., Lee-Post, A., & Pakath, R. (2014). A unified foundation for business analyt-
ics. Decision Support Systems, 64, 130–141.
Hong Kong Efficiency Unit. (2017). Efficiency unit. Retrieved March 12, 2017, from http://
www.eu.gov.hk/en/index.html
Knowledge Management for Action-Oriented Analytics ◾ 29
Kirby, M. W., & Capey, R. (1998). The origins and diffusion of operational research in
the UK. Journal of the Operational Research Society, 49(4), 307–326. doi:10.1057/
palgrave.jors.2600558.
Kiron, D., Ferguson, R. B., & Prentice, P. K. (2013). From value to vision: Reimagining the
possible with data analytics. MIT Sloan Management Review, 54(3), 1–19.
Lopez-Rabson, T., & McCloy, U. (2013). Understanding student attrition in the six greater
Toronto area (GTA) colleges. Toronto, Canada: Higher Education Quality Council of
Ontario.
Luhn, H. P. (1958). A business intelligence system. IBM Journal of Research and
Development, 2(4), 314–319.
Marr, B. (2016). Big data at Tesco: Real time analytics at the UK grocery retail giant. Retrieved
January 30, 2018 from http://www.forbes.com/sites/bernardmarr/2016/11/17/
big-data-at-tesco-real-time-analytics-at-the-uk-grocery-retail-giant/#1c4d5cd4519a
Marr, B. (2017). IoT and big data at Caterpillar: How predictive maintenance saves
millions of dollars. Retrieved January 30, 2018 from https://www.forbes.com/sites/
bernardmarr/2017/02/07/iot-and-big-data-at-caterpillar-how-predictive-maintenance-
saves-millions-of-dollars/#7a18a9867240
McKenna, B. (2016a). Case study: Panama papers revealed by graph database visualisation
software. In B. McKenna (Ed.), Business intelligence in the world of big data (pp. 22–25):
Computer Weekly. Retrieved January 30, 2018 from http://www.computerweekly.
com/ehandbook/IT-Project-Business-intelligence-in-the-world-of-big-data.
McKenna, B. (2016b). Case study: The Christie speeds up SPC charts to improve clinical pro-
cesses. In B. McKenna (Ed.), Business intelligence in the world of big data (pp. 14–17):
Computer Weekly. Retrieved January 30, 2018 from http://www.computerweekly.
com/ehandbook/IT-Project-Business-intelligence-in-the-world-of-big-data.
Mezei, J., Brunelli, M., & Carlsson, C. (2016). A fuzzy approach to using expert knowl-
edge for tuning paper machines. Journal of the Operational Research Society, Advance
Online Publication. doi:10.1057/s41274-016-0105-3
MHR Analytics. (2017). Plotting the data journey in the boardroom: The state of data
analytics 2017. Nottingham, UK: MHR Analytics Retrieved January 30, 2018 from
http://www.mhr.co.uk/analytics
Moustakerski, P. (2015). Big data evolution: Forging new corporate capabilities for the long
term. London, UK: Economist Intelligence Unit.
Nexidia. (2014). Nexidia and Blue Cross and Blue Shield of North Carolina–voice of the
customer (VoC) analytics to increase clarity and ease of use for customers. Mountain
View, CA: Frost & Sullivan. Retrieved January 30, 2018 from http://www.nexidia.
com/media/2222/fs_casestudy-bcbsnc_final.pdf
Paulk, M. C., Curtis, B., Chrissis, M. B., & Weber, C. V. (1993). Capability maturity
model, version 1.1. IEEE Software, 10(4), 18–27.
Pechter, R. (2009). What’s PMML and what’s new in PMML 4.0? ACM SIGKDD
Explorations Newsletter, 11(1), 19–25.
Piskorski, M. J., Halaburda, H., & Smith, T. (2008). eHarmony. Cambridge, MA: Harvard
Business School. Retrieved January 30, 2018 from http://www.hbs.edu/faculty/Pages/
item.aspx?num=36554
Pittman, D. (2016). Big data, analytics and the retail industry: Luxottica. Retrieved
March 11, 2017, from http://www.ibmbigdatahub.com/presentation/big-data-
analytics-and-retail-industry-luxottica
30 ◾ Analytics and Knowledge Management
Contents
Introduction to Data Analytics Processes .............................................................32
Knowledge Discovery in Databases Process .....................................................32
Cross-Industry Standard Process for Data Mining ......................................... 34
Step 1—Business Understanding................................................................35
Step 2—Data Understanding .....................................................................35
Step 3—Data Preparation ..........................................................................36
Step 4—Model Building ............................................................................37
Step 5—Testing and Evaluation .................................................................38
Step 6 —Deployment ................................................................................39
Sample, Explore, Modify, Model, Assess Process .............................................39
Step 1—Sample ........................................................................................ 40
Step 2—Explore .........................................................................................41
Step 3—Modify .........................................................................................41
Step 4—Model...........................................................................................41
Step 5—Assess .......................................................................................... 42
31
32 ◾ Analytics and Knowledge Management
Internalization
Data mining
Knowledge
Data 5
1 2 3 4
transformation
Extracted
patterns
Data
cleaning
Transformed
data
Data
selection Preprocessed
data
Target
data
Feedback
Raw data
many individual steps or tasks to convert data into knowledge (i.e., actionable
insight). A pictorial representation of the KDD process is given in Figure 2.1.
In Figure 2.1, the processing steps are shown as directed arrows with callout
labels, and the result of each step is shown as a graphical image representing the arti-
fact. As shown therein, the input to the KDD process is a collection of data coming
from organizational databases and/or other external mostly structured data sources.
These data sources are often combined in a centralized data repository called a data
warehouse. A data warehouse enables the KDD process to be implemented effec-
tively and efficiently because it provides a single source for data to be mined. Once
the data are consolidated in a unified data warehouse, the problem-specific data
is extracted and prepared for further processing. As the data are usually in a raw,
incomplete, and dirty state, a through preprocessing need to be conducted before
the modeling can take place. Once the data are preprocessed and transformed into
a form for modeling, a variety of modeling techniques are applied to the data to
convert it into patterns, correlations, and predictive models. Once the discovered
patterns are validated, they need to be interpreted and internalized so that they can
be converted into actionable information (i.e., knowledge). One important part of
this process is the feedback loop that allows the process flow to redirect backward,
from any step to any other previous steps, for rework and readjustments.
34 ◾ Analytics and Knowledge Management
1 2
Business Data
understanding understanding
3
Data
preparation
6
4
Deployment
Model
Data
building
5
Testing and
evaluation
To better understand the data, the data scientist often uses a variety of statistical
and graphical tools and techniques, such as simple statistical descriptors or summa-
ries of each variable (e.g., for numeric variables the average, minimum, maximum,
median, and standard deviation are among the calculated measures, whereas for
categorical variables, the mode and frequency tables are calculated), correlation
analysis, scatterplots, histograms, box plots, and cross tabulation. A through pro-
cess of identification and selection of data sources and the most relevant variables
can make it more straightforward for downstream algorithms to quickly and accu-
rately discover useful knowledge patterns.
Data sources for data selection can vary. Normally, data sources for business
applications include demographic data (such as income, education, number of
people in a household, and age), sociographic data (such as hobby, club member-
ship, and entertainment), and transactional data (such as sales record, credit card
spending, and issued checks), among others. Regardless of the sources, data can be
categorized as quantitative and qualitative. Quantitative data are measured using
numeric values. It can be discrete (such as integers) or continuous (such as real
numbers). Qualitative data, also known as categorical data, contain both nominal
and ordinal data. Nominal data have finite nonordered values (e.g., gender data,
which has two values: male and female). Ordinal data have finite ordered values.
For example, customer credit ratings are considered ordinal data because the ratings
can be excellent, fair, and bad. Quantitative data can be readily represented by some
sort of probability distribution. A probability distribution describes how the data are
dispersed and shaped. For instance, normally distributed data are symmetric and is
commonly referred to as being a bell-shaped curve. Qualitative data may be coded
to numbers and then described by frequency distributions. Once the relevant data
are selected according to the analytics project objectives, another critical task—data
preparation and preprocessing—would be conducted.
and web pages, they need to be converted to a consistent and unified format. In gen-
eral, data cleaning means to filter, aggregate, and fill in missing values (also known
as imputation). By filtering the data, the selected variables are examined for outliers
and redundancies. Outliers differ greatly from the majority of data, or data that are
clearly out of range of the selected data groups. For example, if the age of a customer
included in the data is 190, it should be a data entry error and should be identified
and fixed (perhaps, taken out from the data mining project that examines the vari-
ous aspects of the customers and age is perceived to be a critical component of the
customer characteristics). Outliers may be caused by many reasons, such as human
errors or technical errors, or may naturally occur in a dataset due to extreme events.
Suppose the age of a credit card holder is recorded as “12.” This is likely a data entry
error, most likely by a human. However, there might actually be an independently
wealthy pre-teen with important purchasing habits. Arbitrarily deleting this outlier
could dismiss valuable information.
Redundant data are the same information recorded in several different ways.
Daily sales of a particular product are redundant to seasonal sales of the same
product, because we can derive the sales from either daily data or seasonal data. By
aggregating data, data dimensions are reduced to obtain aggregated information.
Note that although an aggregated dataset has a small volume, the information will
remain. If a marketing promotion for furniture sales is considered in the next three
or four years, then the available daily sales data can be aggregated as annual sales
data. The size of sales data is dramatically reduced. By smoothing data, missing val-
ues of the selected data are found and new or reasonable values then added. These
added values could be the average number of the variable (mean) or the mode.
A missing value often causes no solution when a data-mining algorithm is applied
to discover the knowledge patterns.
Depending on the business need, the analytics task can be of a prediction (either
classification or regression), an association, or a clustering or segmentation type.
Each of these tasks can use a variety of analytics and data mining methods and
algorithms. For instance, classification type data mining tasks can be accomplished
by developed neural networks, decision trees, support vector machines (SVMs), or
logistic regression, among others.
The standard procedure for modeling in data mining is to take a large prepro-
cessed dataset and divide it into two or more subsets for training and validation
or testing. Then, use a portion of the data (the training set) for development of
the models (no matter what modeling technique or algorithms is used), and use
the other portion of the data (the test set) for testing the model that is just built.
The principle is that if you build a model on a particular set of data, it will of course
test quite well on the data that is was built on. By dividing the data and using part
of it for model development and a separate part of it for testing creates objective
results for the accuracy and reliability of the model. The idea of splitting the data
into components is often carried to additional levels with multiple splits in the
practice of data mining. Further details about data splitting and other evaluation
methods can be found in Delen (2015).
Step 6—Deployment
Development and assessment of the models is not the end of the analytics project.
Even if the purpose of the model is to have a simple exploration of the data, the
knowledge gained from such exploration will need to be organized and presented
in a way that the end user can understand and benefit from it. Depending on the
requirements, the deployment phase can be as simple as generating a report or
as complex as implementing a repeatable computer-based decision support system
across the enterprise (Delen, Sharda, and Kumar, 2007). In many cases, it is the
customer, not the data analyst, who carries out the deployment steps. However,
even if the analyst will not carry out the deployment effort, it is important for the
customer to understand up front what actions need to be carried out to actually
make use of the created models.
The deployment step may also include maintenance activities for the deployed
models. Because everything about the business is constantly changing, the data that
reflect the business activities also is changing. Over time, the models (and the pat-
terns embedded within them) built on the old data becomes obsolete, irrelevant, or
misleading. Therefore, monitoring and maintenance of the models are important,
if the analytics results are to become a part of the day-to-day business decision-
making environment. A careful preparation of a maintenance strategy helps to
avoid unnecessarily long periods of incorrect usage of analytics results. To monitor
the deployment of the analytics results, the project needs a detailed plan on the
monitoring process, which may not be a trivial task for complex analytics models.
The CRISP-DM process is the most complete and most popular data mining
methodology being practiced in industry as well as in academia. As opposed to
using it as is, practitioners add their own insight to make it specific to their organi-
zation’s style of practice.
Sample
(Generate a representative
SEMMA sample of the data)
Assess Explore
(Evaluate the accuracy and (Visually explore and
usefulness of the models) describe the data)
Feedback
Model Modify
(Use a variety of statistical and (Select variables, transform
machine learning models) variable representations)
By assessing the outcome of each stage in the SEMMA process, one can
determine how to model new questions raised by the previous results, and thus
proceed back to the exploration phase for additional refinement of the data.
That is, as is the case in CRISP-DM, SEMMA also driven by a highly iterative
experimentation cycle. Here are short descriptions for the five steps that consti-
tute SEMMA.
Step 1—Sample
This is where a portion of a large dataset (big enough to contain the significant
information yet small enough to manipulate quickly) is extracted. For optimal
cost and computational performance, some (including the SAS Institute) advocate
a sampling strategy, which applies a reliable, statistically representative sample
of the full detail data. In the case of very large datasets, mining a representa-
tive sample instead of the whole volume may drastically reduce the processing
time required to get crucial business information. If general patterns appear in
the data as a whole, these will be traceable in a representative sample. If a niche
(rare pattern) is so tiny that it is not represented in a sample and yet so important
that it influences the big picture, it should be discovered using exploratory data
Data Analytics Process ◾ 41
A more detailed discussion and relevant techniques for assessment and validation of
data mining models can be found in Sharda et al. (2017).
Step 2—Explore
This is where a user searches for unanticipated trends and anomalies to gain a
better understanding of the dataset. After sampling your data, the next step is to
explore it visually or numerically for inherent trends or groupings. Exploration
helps refine and redirect the discovery process. If visual exploration does not reveal
clear trends, one can explore the data through statistical techniques including
factor analysis, correspondence analysis, and clustering. For example, in data min-
ing for a direct mail campaign, clustering might reveal groups of customers with
distinct ordering patterns. Limiting the discovery process to each of these distinct
groups individually may increase the likelihood of exploring richer patterns that
may not be strong enough to be detected if the whole dataset is to be processed
together.
Step 3—Modify
This is where the user creates, selects, and transforms the variables upon which
to focus the model construction process. Based on the discoveries in the explora-
tion phase, one may need to manipulate data to include information such as the
grouping of customers and significant subgroups, or to introduce new variables.
It may also be necessary to look for outliers and reduce the number of variables,
to narrow them down to the most significant ones. One may also need to modify
data when the “mined” data changes. Because data mining is a dynamic, iterative
process, you can update data mining methods or models when new information
is available.
Step 4—Model
This is where the user searches for a variable combination that reliably predicts a
desired outcome. Once you prepare your data, you are ready to construct models
that explain patterns in the data. Modeling techniques in data mining include
artificial neural networks (ANN), decision trees, rough set analysis, SVMs, logistic
models, and other statistical models—such as time series analysis, memory-based
42 ◾ Analytics and Knowledge Management
reasoning, and principal component analysis. Each type of model has particular
strengths, and is appropriate within specific data mining situations depending on
the data. For example, ANN are very good at fitting highly complex nonlinear
relationships while rough sets analysis is known to produce reliable results with
uncertain and imprecise problem situations.
Step 5—Assess
This is where the user evaluates the usefulness and the reliability of findings from
the data mining process. In this final step of the data mining process, the user
assesses the model to estimate how well it performs. A common means of assessing
a model is to apply it to a portion of a dataset put aside (and not used during the
model building) during the sampling stage. If the model is valid, it should work
for this reserved sample as well as for the sample used to construct the model.
Similarly, you can test the model against known data. For example, if you know
which customers in a file had high retention rates and your model predicts reten-
tion, you can check to see whether the model selects these customers accurately. In
addition, practical applications of the model, such as partial mailings in a direct
mail campaign, help prove its validity.
The SEMMA process is quite compatible with the CRISP-DM process. Both
aim to streamline the knowledge discovery process. Both were created as broad
frameworks, which need to be adapted to specific circumstances. In both, once
models are obtained and tested, they can then be deployed to gain value with
respect to business or research application. Even though they have the same goal
and are similar, the SEMMA and CRISP-DM processes have a few differences.
Table 2.1 presents these differences.
Project initiation Business understanding N/A In this phase, CRISP-DM includes activities like project initiation,
problem definition, and goal setting. SEMMA does not have a
step for this phase.
Data access Data understanding Sample In this phase, both CRISP-DM and SEMMA have the steps to
Explore access, sample, and explore the data.
Data transformation Data preparation Modify In this phase, both CRISP-DM and SEMMA process the data to
make it amenable to machine processing.
Model building Modeling Model In this phase, both CRISP-DM and SEMMA suggest building and
testing various models.
Project evaluation Evaluation Assess In this phase, both CRISP-DM and SEMMA suggest assessing the
findings against the project goals.
Project finalization Deployment N/A In this phase, CRISP-DM prescribes deployment of the results
while SEMMA does not explicitly state it.
Data Analytics Process
◾
43
44 ◾ Analytics and Knowledge Management
Step 1
Define
(Understand the problem and
define the project goals)
Step 2
Measure
(Measure the suitability and
quality of data)
Step 3
Analyze
(Experiment with different
models to identify the best fit)
Step 4
Improve
(Assess the knowledge
against the project goals)
Step 5
Control
(Deploy the solutions and
Feedback control its usability)
Step 1—Define
This is the first step in DMAIC where several tasks are to be accomplished to get
the project set up and started. These tasks include (1) a thorough understanding of
the business needs; (2) identifying the most pressing problem, (3) defining the goals
and objectives, (4) identifying and defining the data and other resources needed to
investigate the business problem, and (5) developing a detailed project plan. As you
may have noticed, there is a significant overlap between this step and the “Business
Understanding” which was the first step in the CRISP-DM process.
Step 2—Measure
In this step, the mapping between organizational data repositories and the business
problem is assessed. Since data mining requires problem-relevant, clean, and usable
data, identification and creation of such a resource is of critical importance to the
success of the project. In this step, the identified data sources are to be consolidated
and transformed into a format that is amenable to machine processing.
Data Analytics Process ◾ 45
Step 3—Analyze
Now that the data are prepared for processing, in this step, a series of data mining
techniques is used to develop models. Since there is not a single best technique for
a specific data mining task (because there are many and most of them are machine
learning techniques with many parameters to optimize), several probable tech-
niques need to be applied and experimented with to identify and develop the most
appropriate model.
Step 4—Improve
Once the analysis results are obtained, in this step, the improvement possibilities
are investigated. Improvements can be at the technique level or they can be at the
business problem level. For instance, if the model results are not satisfactory, other
more sophisticated techniques (e.g., ensemble systems) can be used to boost the
performance of the models. Also, if the modeling results are not clearly addressing
the business problem, via a feedback loop to previous steps, the very structure of the
analysis can be re-examined and improved, or the business problem can be further
investigated and restated.
Step 5—Control
In this step a final examination of the project outcomes is assessed and if found
satisfactory, the models and result are disseminated to decision makers and/or inte-
grated into the existing business intelligence systems for automation.
The Six Sigma-based DMAIC methodology has a lot of resemblance to the
CRISP-DM process. We do not have any evidence that suggests one is inspired
from the other. That said, since what these two processes portray are rather logical
and straightforward steps in any business system analysis effort, they may not have
inspired each other. The users of DMAIC are rare compared to CRISP-DM and
SEMMA.
4.7%
A domain-specific methodology
2.0%
5.3%
My organizations'
3.5%
7.3%
KDD process
7.5%
13.0%
SEMMA
8.5%
19.0%
My own
27.5%
42.0%
CRISP-DM
43.0%
0.0% 10.0% 20.0% 30.0% 40.0% 50.0%
Figure 2.5 Preference poll for standard analytics processes. (From Piatetsky, G.,
CRISP-DM, still the top methodology for analytics, data mining, or data science
projects, KDnuggets. Retrieved from http://www.kdnuggets.com/2014/10/crisp-
dm-top-methodology-analytics-data-mining-data-science-projects.html, 2014.)
complete and most matured methodology for data analytics projects. Many of the
ones that fall under “My Own” are also known to be small deviations (specializations)
of CRISP-DM.
In this application case, the focus will be on predicting student attrition for better
management of student retention in higher education institutions.
Student retention is a critical part of many enrollment management systems.
Affecting university rankings, school reputation, and financial well-being, student
retention has become one of the most important priorities for decision makers in
higher education institutions. Improving student retention starts with a thorough
understanding of the reasons behind the attrition. Such an understanding is the
basis for accurately predicting at-risk students and appropriately intervening to
retain them. In this study, using five years of institutional data along with several
data mining techniques (both individuals as well as ensembles), we developed ana-
lytical models to predict and to explain the reasons behind freshmen student attri-
tion. The comparative analysis results showed that the ensembles performed better
than the individual models, while a balanced dataset produced better prediction
results than an unbalanced dataset. The sensitivity analysis of the models revealed
that the educational and financial variables are among the most important predic-
tors of the phenomenon.
of civic life, social cohesion and the appreciation of diversity, and the improved
ability to adapt to and use technology; and (4) individual social benefits, such as
improved health and life expectancy, improved quality of life for offspring, better
consumer decision-making, increased personal status, and more hobbies and leisure
activities (Hermaniwicz, 2003)
Traditionally, student attrition at a university has been defined as the num-
ber of students who do not complete a degree in that institution. Studies have
shown that the vast majority of students withdraw during their first year of col-
lege than during the rest of their higher education (Deberard, Julka, and Deana,
2004; Hermaniwicz, 2003). Since most of the student dropouts occur at the end
of the first year (the freshman year), many of the student retention and attrition
studies (including this study) have focused on first-year dropouts (or the number
of students not returning for the second year). This definition of attrition does not
differentiate between the students who may have transferred to other universities
and obtained their degrees there. It only considers the students dropping out at the
end of the first year voluntarily and not by academic dismissal.
Research on student retention has traditionally been survey driven (e.g., sur-
veying a student cohort and following them for a specified period of time to deter-
mine whether they continue their education) (Caison, 2007). Using such a design,
researchers worked on developing and validating theoretical models including
the famous student integration model developed by Tinto (1993). Elaborating on
Tinto’s theory, others have also developed student attrition models using survey-
based research studies (Berger and Braxton, 1998; Berger and Milem, 1999). Even
though they have laid the foundation for the field, these survey-based research
studies have been criticized for their lack of generalized applicability to other insti-
tutions and the difficulty and costliness of administering such large-scale survey
instruments (Cabrera, Nora, and Castaneda, 1993). An alternative (and/or a com-
plementary) approach to the traditional survey-based retention research is an ana-
lytic approach where the data commonly found in institutional databases is used.
Educational institutions routinely collect a broad range of information about their
students, including demographics, educational background, social involvement,
socioeconomic status, and academic progress. A comparison between the data-
driven and survey-based retention research showed that they are comparable at
best, and to develop a parsimonious logistic regression model, data-driven research
was found to be superior to its survey-based counterpart (Caison, 2007). But in
reality, these two research techniques (one driven by surveys and theories and the
other driven by institutional data and analytic methods) complement and help each
other (Miller and Tyree, 2009). That is, the theoretical research may help identify
important predictor variables to be use in analytical studies while analytical studies
may reveal novel relationships among the variables that may lead to development of
new and betterment of the existing theories.
To improve student retention, one should try to understand the non-trivial rea-
sons behind the attrition. To be successful, one should also be able to accurately
Data Analytics Process ◾ 49
identify those students that are at risk of dropping out. So far, the vast majority
of student attrition research has been devoted to understanding this complex, yet
crucial, social phenomenon. Even though these qualitative, behavioral, and survey-
based studies revealed invaluable insight by developing and testing a wide range of
theories, they do not provide the much-needed instrument to accurately predict
(and potentially improve) student attrition (Delen, 2011; Miller and Herreid, 2010;
Veenstra, 2009). In this project, we propose a quantitative research approach where
the historical institutional data from student databases is used to develop models
that are capable of predicting, as well as explaining, the institution-specific nature
of the attrition problem. Though the concept is relatively new to higher educa-
tion, for almost a decade now, similar problems in the field of marketing have
been studied using predictive data mining techniques under the name of “churn
analysis,” where the purpose is to identify among the current customers who are
most likely to leave the company so that some kind of intervention process can be
executed for the ones who are worthwhile to retain. Retaining existing customers
is crucial because the related research shows that acquiring a new customer costs
roughly ten times more than keeping the one that you already have (Lemmens and
Croux, 2006).
Analytics Methodology
In this research, we followed a popular data mining methodology called CRISP-DM
(Shearer, 2000), that, as explained in the previous section, is a six step process: (1)
understanding the domain and developing the goals for the study, (2) identifying,
accessing, and understanding the relevant data sources, (3) preprocessing, clean-
ing, and transforming the relevant data, (4) developing models using comparable
analytical techniques, (5) evaluating and assessing the validity and the utility of the
models against each other and against the goals of the study, and (6) deploying the
models for use in decision-making processes. This popular methodology provides
a systematic and structured way of conducting data mining studies, and hence
increasing the likelihood of obtaining accurate and reliable results. The attention
paid to the earlier steps in CRISP-DM (i.e., understanding the domain of study,
understanding the data, and preparing the data) sets the stage for a successful data
mining study. Roughly 80% of the total project time is usually spent on these first
three steps.
The method evaluation step in CRISP-DM requires comparing the data mining
models for their predictive accuracy. Traditionally, in this comparison process the
complete dataset is split into two subsets, two-thirds for training and one-third for
testing. The models are trained on the training subset and then evaluated on the
testing subset. The prediction accuracy on the testing subset is used to report the
actual prediction accuracies of all evaluated models. Since the dataset is split into
two exclusive subsets randomly, there always is a possibility of those two datasets
not being “equal.” To minimize this bias associated with the random sampling
50 ◾ Analytics and Knowledge Management
of the training and testing data samples, we used an experimental design called
k-fold cross-validation. In k-fold cross-validation, also called rotation estimation,
the complete dataset is randomly split into k mutually exclusive subsets of approxi-
mately equal size. The classification model is trained and tested k times. Each time,
it is trained on all but one fold and tested on the remaining single fold. The cross-
validation estimate of the overall accuracy is calculated as simply the average of the
k individual accuracy measures as in the following equation:
k
∑ PM
1
CV = i
k i =1
where, CV stands for the cross-validation result for a method, k is the number of
folds used, and PM is the performance measure for each fold.
In this case study, to estimate the performance of the prediction models, a ten-
fold cross-validation approach was used. Empirical studies showed that 10 seems
to be an optimal number of folds (that optimizes the time it takes to complete the
test while minimizing the bias and variance associated with the validation process)
(Kohavi, 1995). In tenfold cross-validation the entire dataset is divided into 10
mutually exclusive subsets (or folds). Each fold is used once to test the performance
of the prediction model that is generated from the combined data of the remaining
ninefolds, leading to 10 independent performance estimates.
A pictorial depiction of this evaluation process is shown in Figure 2.6. With this
experimental design, if k is set to 10 (which is the case in this study and a common
practice in most predictive data mining applications), for each of the seven model
types (four individual and three ensembles), 10 different models are developed and
tested. Combined with the tenfold experimentation conducted on the original (i.e.,
unbalanced) datasets using the four individual model types, the total number of
models developed and tested for this study was 110.
Data Description
The data for this study came from a single institution (a comprehensive public uni-
versity located in the Midwest region of the United States) with an average enroll-
ment of 23,000 students, of which roughly 80% are residents of the same state and
roughly 19% of the students are listed under some minority classification. There is
no significant difference between the two genders in the enrollment numbers. The
average freshman student retention rate for the institution is about 80%, and the
average 6-year graduation rate is about 60%.
In this study we used five years of institutional data, which entailed 16,066
students enrolled as freshmen between (and including) the years of 2004 and 2008.
The data was collected and consolidated from various university student databases.
A brief summary of the number of records (i.e., freshman students) by year is given
in Table 2.2.
Data Analytics Process ◾ 51
Data 10%
preprocessing 10% 10%
10% 10%
Model Experiment results
10% 10%
testing (confusion matrixes)
10% 10%
Preprocessed Design of 10%
data YES NO
experiments
# of correctly # of incorrectly
Model YES predicted YES predicted YES
building
# of incorrectly # of correctly
NO
Prediction models predicted NO predicted NO
Model
deployment Sensitivity
analysis
Importance
importance
1 100
f(z) = 90
1 + z−1 80
70
60
50 50
40 40
30
Support vector machines Logistic regression
Variable names
Figure 2.6 Analytics process employed for the student attrition prediction study.
14 Ethnicity Nominal
23 Age Number
(Continued)
Data Analytics Process ◾ 53
international student records from the dataset because they did not contain some
of the presumed important predictors (e.g., high school GPA and SAT scores). In
the data transformation phase, some of the variables were aggregated (e.g., “Major”
and “Concentration” variables aggregated to binary variables MajorDeclared and
ConcentrationSpecified) for better interpretation for the predictive modeling.
Additionally, some of the variables were used to derive new variables (e.g., Earned/
Registered and YearsAfterHighSchool).
EarnedHours
EarnedByRegistered =
RegisteredHours
YearsAfterHighSchool = FreshmenEnrollmentYear – HighSchoolGraduationYear
Backpropagation
∫
Socio-demographic
Predicted
∫ =
vs. Actual
Financial Returned for
∫ the second fall
semester?
Educational ∫ (Yes/No)
Other
Figure 2.7 MLP-type artificial neural network architecture used in this study.
C4.5, C5, and Breiman et al. (1984)’s classification and regression trees
(CART) and chi-squared automatic interaction detector (CHAID). In this
study, we used the C5 algorithm, which is an improved version of the C4.5
and ID3 algorithms.
Logistic regression is a generalization of linear regression. It is used primarily for
predicting binary or multiclass dependent variables. Because the response vari-
able is discrete, it cannot be modeled directly by linear regression. Therefore,
rather than predicting a point estimate of the event itself, it builds the model
to predict the odds of its occurrence. While logistic regression has been a
common statistical tool for classification problems, its restrictive assumptions
on normality and independence led to an increased use and popularity of
machine learning techniques for real-world prediction problems.
Support vector machines (SVMs) belong to a family of generalized linear mod-
els that achieves a classification or regression decision based on the value of
the linear combination of features. The mapping function in SVMs can be
either a classification function (used to categorize the data, as is the case in
this study) or a regression function (used to estimate the numerical value of
the desired output). For classification, nonlinear kernel functions are often
used to transform the input data (inherently representing highly complex
nonlinear relationships) to a high dimensional feature space in which the
56 ◾ Analytics and Knowledge Management
Sensitivity Analysis
In machine-learning algorithms, sensitivity analysis is a method for identifying
the “cause and effect” relationship between the inputs and outputs of a prediction
model (Delen et al, 2017). The fundamental idea of sensitivity analysis is that it
measures the importance of predictor variables based on the change in model-
ing performance that occurs if a predictor variable is not included in the model.
Hence, the measure of sensitivity of a specific predictor variable is the ratio of the
error of the trained model without the predictor variable to the error of the model
that includes this predictor variable. The more sensitive the network is to a particu-
lar variable, the greater the performance decrease would be in the absence of that
variable, and therefore the greater the ratio of importance. This method is often
followed in machine learning techniques to rank the variables in terms of their
importance according to the sensitivity measure defined in the following equation
(Saltelli, 2002):
Data Analytics Process ◾ 57
Vi V ( E ( Ft X i ))
=Si =
V ( Ft ) V ( Ft )
where V(Ft) is the unconditional output variance. In the numerator, the expectation
operator E calls for an integral over X-i; that is, over all input variables but Xi, then
the variance operator V implies a further integral over X i. Variable importance is
then computed as the normalized sensitivity. Saltelli et al. (2004) showed that the
equation above is the proper measure of sensitivity to rank the predictors in order of
importance for any combination of interaction and nonorthogonality among pre-
dictors. As for the decision trees, variable importance measures were used to judge
the relative importance of each predictor variable. Variable importance ranking
uses surrogate splitting to produce a scale which is a relative importance measure
for each predictor variable included in the analysis. Further details on this proce-
dure can be found in Breiman et al. (1984).
Results
In the first set of experiments, we used the original dataset which was composed
of 16,066 records. Based on the tenfold cross-validation, the SVMs produced the
best results with an overall prediction rate of 87.23%; the decision tree came out
as the runner up with an overall prediction rate of 87.16%; followed by ANN and
logistic regression with overall prediction rates of 86.45% and 86.12% respectively
(see Table 2.4). A careful examination of these results reveals that the prediction
accuracy for the “yes” class is significantly higher than the prediction accuracy of
the no class. In fact, all four model types predicted the students who are likely to
return for the second year with better than 90% accuracy while they did poorly on
predicting the students who are likely to drop out after the freshman year with less
than 50% accuracy. Since the prediction of the “no” class is the main purpose of
this study, less than 50% accuracy for this class was deemed unacceptable. Such a
difference in prediction accuracy of the two classes can be attributed to the skew-
ness of the original dataset (i.e., approximately 80% “yes” and approximately 20%
“no” samples). Previous studies also commented on the importance of having a
balanced dataset for building accurate prediction models for binary classification
problems (Wilson and Sharda, 1994).
In the next round of experiments, we used a well-balanced dataset where the
two classes were represented equally. In realizing this approach, we took all of the
samples from the minority class (i.e., the “no” class herein) and randomly selected
an equal number of samples from the majority class (i.e., the “yes” class herein),
and repeated this process ten times to reduce the bias of random sampling. Each of
these sampling processes resulted in a dataset of 7,018 records, of which 3,509 were
labeled as “no” and 3,509 were labeled as “yes.” Using a tenfold cross-validation
methodology, we developed and tested prediction models for all four model types.
The results of these experiments are shown in Table 2.5. Based on the hold-out
sample results, SVMs generated the best overall prediction accuracy with 81.18%,
followed by decision trees, ANN, and logistic regression with overall prediction
accuracy of 80.65%, 79.85%, and 74.26% respectively. As can be seen in the per-
class accuracy figures, the prediction models did significantly better on predicting
the “no” class with the well-balanced data than they did with the unbalanced data.
Overall, the three machine learning techniques performed significantly better than
their statistical counterpart, logistic regression.
Next, another set of experiments was conducted to assess the predictive ability
of the three ensemble models. Based on the tenfold cross-validation methodology,
the information fusion type ensemble model produced the best results with an over-
all prediction rate of 82.10%, followed by the bagging type ensembles and busting
type ensembles with overall prediction rates of 81.80% and 80.21% respectively.
Matrix Yes 781 2626 779 2673 777 2704 965 2464
(See Table 2.6 for a complete list of results for the ensembles.) Even though the pre-
diction results are slightly better than the individual models, ensembles are known
to produce more robust prediction systems compared to a single-best prediction
model (Delen, 2015).
In addition to assessing the prediction accuracy for each model type, a sen-
sitivity analysis was conducted on the developed models to identify the relative
importance of the independent variables (i.e., the predictors). In realizing the
overall sensitivity analysis results, each of the four individual model types gener-
ated its own sensitivity measures ranking all of the independent variables in a
prioritized list. Each model type generated slightly different sensitivity rankings
of the independent variables. After collecting all four sets of sensitivity numbers,
the sensitivity numbers were normalized and aggregated into a single table (see
Table 2.7).
Using the numerical figures from Table 2.7, a horizontal bar chart is created
to pictorially illustrate the relative sensitivity or importance of the independent
variables (see Figure 2.8). In Figure 2.8, the y-axis lists the independent variables
in the order of sensitivity or importance from top (most important) to bottom (the
least important) while the x-axis shows the aggregated relative importance of each
variable.
The x-axis denotes the normalized relative importance measure for independent
variables.
(Continued)
Data Analytics Process ◾ 61
EarnedByRegistered
SpringStudentLoan
FallGPA
SpringGrantTuitionWaiverScholarship
FallRegisteredHours
FallStudentLoan
MaritalStatus
AddmisionType
Ethnicity
SATHighMath
SATHighEnglish
FallFederalWorkStudy
FallGrantTuitionWaiverScholarship
PermenantAddressState
SATHighScience
CLEPHours
SpringFederalWorkStudy
SATHighComprehensive
SATHighReading
TransferedHours
ReceivedFallAid
MajorDeclared
ConsentrationSpecified
Sex
StartingTerm
HighSchoolGraduationMonth
HighSchoolGPA
Age
YearsAfterHS
0.00 0.20 0.40 0.60 0.80 1.00 1.20
educational success of the student and whether they are getting financial help. To
improve the retention rates, institutions may choose to enroll more academically
successful students, and provide them with financial assistance. Also, it might be
of interest to monitor the academic experience of freshmen students in their first
semester through looking at a combination of grade point average and the ratio of
completed hours over enrolled hours.
The focus (and perhaps the limitation) of this study is the fact that it aims to
predict attrition using institutional data. Even though it leverages the findings of
the previous theoretical studies, this study is not meant to develop a new theory;
rather, it meant to show the viability of predictive analytics methods as a means to
provide an alternative to understanding and predicting student attrition at higher
education institutions. From the practicality standpoint, an information system
encompassing these prediction models can be used as a decision aid to student
success and management departments who are serious about improving retention.
Potential future directions of this study include (1) extending the predictive
modeling methods and ensembles with more recent techniques such as rough set
analysis and meta-modeling, (2) enhancing the information sources by including
the data from survey-based institutional studies (which are intentionally crafted
and carefully administered for retention purposes) in addition to the variables in
the institutional databases, and (3) deployment of the system as a decision aid for
administrators to assess its suitability and usability in the real world.
References
Berger, J.B. and Braxton, J.M. (1998). Revising Tinto’s interactionalist theory of student
departure through theory elaboration: Examining the role of organizational attri-
butes in the persistence process, Research in Higher Education 39(2): 103–119.
Berger, J.B. and Milem, J.F. (1999). The role of student involvement and perceptions of
integration in a causal model of student persistence, Research in Higher Education
40(6): 641–664.
Breiman, L. ( 2001). Random forests, Machine Learning 45(1): 5–32.
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and
Regression Trees. Monterey, CA: Wadsworth and Brooks/Cole Advanced Books and
Software.
Cabrera, A.F., Nora, A., and Castaneda, M.A. (1993). College persistence: Structural equa-
tions modeling test of an integrated model of student retention, Journal of Higher
Education 64(2): 123–139.
Caison, A.L. (2007). Analysis of institutionally specific retention research: A comparison
between survey and institutional database methods, Research in Higher Education
48(4): 435–449.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R.
(2000). CRISP-DM 1.0 step-by-step data mining guide. https://www.the-modeling-
agency.com/crisp-dm.pdf (accessed January 2018).
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines
and Other Kernel-based Learning Methods. London, UK: Cambridge University Press.
64 ◾ Analytics and Knowledge Management
Deberard, S.M., Julka, G.I., and Deana, L. (2004). Predictors of academic achievement
and retention among college freshmen: A longitudinal study, College Student Journal
38(1): 66–81.
Delen, D. (2010). A comparative analysis of machine learning techniques for student reten-
tion management, Decision Support Systems 49(4): 498–506.
Delen, D. (2011). Predicting student attrition with data mining methods, Journal of College
Student Retention: Research, Theory & Practice 13(1): 17–35.
Delen, D. (2015). Real-World Data Mining: Applied Business Analytics and Decision Making.
Upper Saddle River, NJ: FT Press.
Delen, D., Sharda, R., and Kumar, P. (2007). Movie forecast guru: A web-based DSS for
Hollywood managers, Decision Support Systems 43(4): 1151–1170.
Delen, D., Tomak, L., Topuz, K., and Eryarsoy, E. (2017). Investigating injury severity
risk factors in automobile crashes with predictive analytics and sensitivity analysis
methods. Journal of Transport & Health 4: 118–131.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI Magazine 17(3): 37–54.
Gansemer-Topf, A.M. and Schuh, J.H. (2006). Institutional selectivity and institutional
expenditures: Examining organizational factors that contribute to retention and
graduation, Research in Higher Education 47(6): 613–642.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. New York: Springer.
Hermaniwicz, J.C. (2003). College Attrition at American Research Universities: Comparative
Case Studies. New York: Agathon Press.
Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an
unknown mapping and its derivatives using multilayer feed-forward network, Neural
Networks 3: 359–366.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and
model selection, in the Proceedings of the 14th International Conference on AI (IJCAI),
San Mateo, CA: Morgan Kaufmann, pp. 1137–1145.
Lemmens, A. and Croux, C. (2006). Bagging and boosting classification trees to predict
churn, Journal of Marketing Research 43(2): 276–286.
Miller, T.E. and Herreid, C.H. (2010). Analysis of variables: Predicting sophomore persis-
tence using logistic regression analysis at the University of South Florida, College and
University 85(1): 2–11.
Miller, T.E. and Tyree, T.M. (2009). Using a model that predicts individual student attri-
tion to intervene with those who are most at risk, College and University 84(3): 12–21.
Piatetsky, G. (2014). CRISP-DM, still the top methodology for analytics, data mining, or
data science projects, KDnuggets. Retrieved from http://www.kdnuggets.com/2014/10/
crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html
Quinlan, J. (1986). Induction of decision trees, Machine Learning 1: 81–106.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan
Kaufmann.
Saltelli, A. (2002). Making best use of model evaluations to compute sensitivity indices,
Computer Physics Communications 145: 280–297.
Saltelli, A., Tarantola, S., Campolongo, F., and Ratto, M. (2004). Sensitivity Analysis in
Practice—A Guide to Assessing Scientific Models. Hoboken, NJ: John Wiley & Sons.
Data Analytics Process ◾ 65
Sharda, R., Delen, D., and Turban, E. (2017). Business Intelligence, Analytics, and Data
Science: A Managerial Perspective (4th ed.). London, UK: Pearson Education.
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining, Journal of
Data Warehousing 5: 13–22.
Thammasiri, D., Delen, D., Meesad, P., and Kasap, N. (2014). A critical assessment of
imbalanced class distribution problem: The case of predicting freshmen student attri-
tion, Expert Systems with Applications 41(2): 321–330.
Thomas, E.H. and Galambos, N. (2004). What satisfies students? Mining student opinion
data with regression and decision tree analysis, Research in Higher Education 45(3):
251–269.
Tinto, V. (1993). Leaving College: Rethinking the Causes and Cures of Student Attrition.
(2nd ed.). Chicago, IL: The University of Chicago Press.
Veenstra, C.P. (2009). A strategy for improving freshman college retention, Journal for
Quality and Participation 31(4): 19–23.
Wilson, R.L. and Sharda, R. (1994). Bankruptcy prediction using neural networks, Decision
Support Systems 11: 545–557.
Chapter 3
Transforming Knowledge
Sharing in Twitter-Based
Communities Using
Social Media Analytics
Nicholas Evangelopoulos, Shadi Shakeri,
and Andrea R. Bennett
Contents
Introduction ........................................................................................................68
Collective Knowledge within Communities of Practice .......................................69
Evolution of Analytics in Knowledge Management .............................................71
Social Media Analytics.........................................................................................75
Twitter-Based Communities as Communities of Practice ................................76
Twitter-Based Communities as Organizations ................................................ 77
Transforming Tacit Knowledge in Twitter-Based Communities ..................78
Representing Twitter-Based Community Knowledge in
a Dimensional Model .................................................................................78
User Dimension ..................................................................................................81
Interaction among Users......................................................................................85
Time Dimension .................................................................................................92
Location Dimension............................................................................................94
Topic Dimension ................................................................................................97
67
68 ◾ Analytics and Knowledge Management
Introduction
In the early years of knowledge-based systems development, two influential
approaches emerged within the artificial intelligence community. The first
approach was that of expert systems, where knowledge was acquired by
engineers through copious interviews with subject-matter experts. Typically, the
engineer would spend as many as 100–200 hours performing the interviews, to
collect the raw material needed to build a knowledge base that represented the
declarative and procedural knowledge of the human experts in a format that
facilitated its retrieval (e.g., Diederich, Ruhmann, & May, 1987; Gonzalez &
Dankel, 1993; Rich, Knight, & Nair, 2009, p. 427; Sharda, Delen, & Turban,
2018, p. 13). A second approach was rule-induction systems, where algorithms
processed training sets of examples and searched automatically for patterns that
could be coded or quantified (e.g., Gonzalez & Dankel, 1993, p. 62; Rich et al.,
2009, p. 355). This approach had an inherent advantage: interaction between
knowledge engineers and human experts was no longer necessary. All that was
necessary was a good algorithm, a fast machine, and a good amount of train-
ing data. Eventually, this approach led to the explosion of machine learning,
data mining, and stock-trading bots. However, as the effort to acquire knowledge
from individual experts was side-tracked, a new opportunity became practically
possible: the acquisition of socially constructed, tacit knowledge from human
communities and organizations.
In the era of social media, Big Data, and the internet of things, social media
communities fully operate in cyberspace, their primary function is contributing to
a social media document collection (corpus), and the knowledge of their members
is inscribed in the corpus in ways that do not make it readily available without some
processing. Social media analytics (SMAs) offer the opportunity to probe a social
media corpus, extract its inscribed knowledge, and transform its tacit elements to
explicit. In this chapter, we present a systematic approach to this probing and we
demonstrate its application using vignettes that involve various Twitter-based com-
munities (TBCs).
Transforming Knowledge Sharing ◾ 69
Data analytics tools and techniques, as part of KI, have improved access to
rich sources of data and have improved knowledge discovery. They have also con-
tributed to the development of knowledge bases extracted from vast amounts
of unstructured, messy data found on social media and the web. The insights
obtained from Big Data through the application of data or text analytics have
enhanced the production of innovation and helped in sustaining the organization
in the constantly changing business environment. Therefore, the emergence of
analytics offers new avenues for knowledge managers for creating, disseminating,
adopting, and applying the knowledge that was otherwise unusable. Enhanced
with analytic tools, KM can support the decision-making process within an
organization.
Figure 3.1 depicts the evolution of KM. It is around the early 2000s that
modern analytics, such as data and text mining (Tsui, 2003), enter the fold.
These tools, such as those employed in the studies featured in this chapter, enable
analysts to discern meaningful patterns and associations within data (including
words and phrases). Data and text mining tools are important for businesses seek-
ing to engage in direct marketing, implement customer-relationship management
applications, and generate business intelligence, because their outcomes can be
utilized in processes such as decision-making, content management, and match-
ing customer segments to products and services. Though Figure 3.1 begins in the
1970s, the evolution of KM is much older, emerging in early economies based on
natural resources (Wiig, 1997b), as indicated by the emphasis on and apprecia-
tion of the knowledge held and employed by master craftsmen (e.g., blacksmiths,
IBM enters the PC marketplace. Social networking sites and Enterprise 2.0.
The first Knowledge Management Social media analytics.
System (KMS) is developed. Individuals and culture recognized as
Tim Berners-Lee proposes the critical to knowledge creation,
framework for the World Wide Web. dissemination, application.
masons, tailors, etc.) and members of trade guilds. This specialized emphasis on
knowledge remained largely consistent until the late twentieth century (where
the time line of Figure 3.1 begins), when the mass availability of information
technology (IT) systems meant that business leaders had more control over the
efficiency of their firms’ manufacturing, marketing, and logistics networks.
Extensive information gathering on customers and other firms, and the ability to
store this information in databases, led to business practices such as point-of-sale
analysis and total quality management. The 1980s and the 1990s saw an overall
shift in business emphasis away from products, manufacturing, and skills and
toward the use of knowledge and other intellectual capital for increasing market
share (Wiig, 1997b). This emphasis encouraged many organizations to pursue
KM strategies, including working collaboratively with customers to understand
their wants and needs.
The emergence of social networking sites (SNSs) in the early 2000s enabled
individuals to build and maintain larger social networks, including with others
whom they knew online and in person; self publish their ideas and opinions;
and collaborate with others globally to generate and manage knowledge (as evi-
denced by the coproduction of Wikipedia entries) (Hemsley & Mason, 2013).
These SNS-associated abilities mean that SNSs enable users to engage in collective
sense making, construction of meaning, and maintenance of collective knowl-
edge. The current and ongoing tools of social media analysis (SMA) allow for a
contemporary understanding of knowledge as an object, a process, and as access
to information. SMA allows for the conversion of tacit knowledge to an explicit
knowledge object that can be stored, updated, and referenced. SMA allows for the
process of tacit knowledge creation, sharing, distribution, and usage to be explic-
itly traced across social networks. And, because SMA enables the conversion of
tacit knowledge to explicit, knowledge holders can oversee who has access to this
codified knowledge.
Figure 3.2 depicts the evolution of business intelligence and analytics (BI&A)
terminology since the 1970s, when the primary focus of such applications was
the provision of structured reports that business leaders could reference during
broader society level). As Gruber (2008) argues, “true collective intelligence can
emerge if the data collected from all those people is aggregated and recombined”
to develop new insights and new learning that would otherwise be hard to achieve
at the individual level (p. 5). The success in the achievement of this goal depends
on the use of appropriate data analytic techniques. We present some of these
techniques later in this chapter.
◾ Search and retrieval metadata, such as search keys and indices, document
source, versioning, URL, source type, waiting time, username, password,
directory, search engine, etc.
◾ Text mining metadata, such as keywords, topics, clusters, other features,
summaries, opinion categories, etc.
◾ Storage metadata, such as store document index, summary index, URL
index, pathname, translation information, etc.
Author
Time
author_key
time_key
author_id
Corpus_fact timestamp
name
hour
screen_name author_key day
address time_key date
contact_info location_key month
category text year
favorited
favorite_count
Location reply_to_SN
status_source
location_key
retweet_count
longitude
latitude
time_zone
sector
area
dimension tables. The fact table records each Tweet with references to who is the
author, when it was posted, and from which location. Each reference is expanded
as an entry in a dedicated table, or dimension, where details for the referenced
authors, locations, and time periods are provided. This dimensional model pro-
vides a platform that facilitates the efficient generation of reports and visualiza-
tions that constitute a well-established mechanism for providing organizational
intelligence.
Occasionally, custom fact tables and custom dimensions are included in the
dimensional model (Adamson & Venerable, 1998, p, 20). As we will show later in
this chapter, our additional facts and dimensions are produced with the use of social
analytics. Such knowledge might be derived days or months after the recording of
the original core facts, (or, in the case of real-time analytics, ex post, that is, a frac-
tion of a second after the core facts are captured). To present the model, we employ
database management terminology to refer to new facts as derived facts and their
corresponding dimensions as derived dimensions. Figure 3.4 shows the process of
producing these derived facts and dimensions.
The process shown in Figure 3.4 starts with the extraction of topics or opin-
ions expressed in Tweets, using text analytics. In a preliminary phase, the results
of these analytics are stored as Tweet metadata, or, in database modeling terms,
as derived attributes of the fact table. Subsequently, data warehouse extract,
transform, and load (ETL) operations are used to organize topics and opinions
in separate tables, so that they can be managed independently and serve as a
standardized, common reference across the entire data warehouse. As shown in
Figure 3.5, the data warehouse design now looks like an expanded star schema
that includes derived (i.e., computed using analytics or database queries) facts, as
well as derived dimensions. We will refer to this configuration as a knowledge base
data warehouse (KBDW) schema.
Corpus_fact Topic
Corpus_fact author_key topic_key
Corpus_fact time_key
author_key label
author_key location_key topic_type
time_key
time_key text grain_level
location_key
location_key favorited
text
text favorite_count
favorited
favorited reply_to_SN
favorite_count Opinion
favorite_count status_source
reply_to_SN
reply_to_SN retweet_count opinion_key
status_source
status_source retweet_count topic_key opinion_type
retweet_count topic_strength
tweet_topic
opinion_key
tweet_opinion
opinion_strength
Time
Author
time_key
author_key Corpus_fact
timestamp
author_id hour
author_key
name day
time_key
screen_name date
location_key
address month
text
contact_info year
favorited
category
favorite_count Topic
reply_to_SN
status_source topic_key
Location retweet_count label
location_key topic_type
topic_key
longitude grain_level
topic_strength
latitude opinion_key
time_zone opinion_strength Opinion
sector
area opinion_key
opinion_type
Figure 3.5 Expanded dimensional model with derived dimensions and their
interactions.
Our proposed dimensional model for a KBDW offers the opportunity to create
a knowledge base that can measure the community member contributions, identify
time trends, locate the distribution of events and information across geographical
or other types of places, and track the diffusion of sentiments and ideas across time,
space, and communities. On a conceptual level, our model bears a lot of similarity
with the analytic variant of a customer relationship management data warehouse
(Kimball & Ross, 2002, p. 144). However, the focus of this chapter is on a corpus
that reflects the activity, the accomplishments, and the knowledge of a CoP, the
analytics that can uncover this knowledge, and the database models that can orga-
nize this knowledge for easy access, retention, and transfer outside the minds of
the most entrenched community members. In the sections that follow, we examine
various dimensions of the KBDW and their interactions. We begin with the user
dimension, presented in the next section.
User Dimension
The author of a document is arguably its most important attribute. On a collec-
tive level, multiple authors produce the user dimension of a corpus. Understanding
the authors includes an investigation of who they are as community members or
organizational stakeholders. Authors can vary significantly by level of involvement
and document authoring activity. Vignette 3.1 presents an example of how the user
dimension contributes to the knowledge base of a specific TBC.
82 ◾ Analytics and Knowledge Management
1 Hniman (Henry L. Niman) Individual; tracks and analyzes infectious diseases 340 1.95
(e.g., H1N1, Zika, etc.).
2 Zika_News (Zika News) Organization; detects and shares News about Zika 292 3.63
virus.
3 TheFlaviviruses (Flavivirus Virus) Organization; detects and share news about 227 4.93
diseases caused by the Flaviviridae family.
6 MackayIM (Ian M. Mackay) Individual; virologist who provides medical advice. 151 8.01
7 ironorehopper (ironorehopper) Individual; (no information about the author was 136 8.79
available).
8 Crof (Crawford Kilian) Individual; Retired college teacher & writer. 127 9.52
100
90
80
70
60
Tweet %
50
40
30
20
10
0
0.00 20.00 40.00 60.00 80.00 100.00
User %
400
350
300
Tweet frequency
250
200
150
100
50
0
1
32
63
94
125
156
187
218
249
280
311
342
373
404
435
466
497
528
559
590
621
652
683
714
745
776
807
838
869
900
931
962
993
User rank
User/Author
user_key Corpus_fact
user_id
name user_key
screen_name time_key
address location_key
contact_info text
category favorited
favorite_count
user_rank reply_to_ScrNm
user_freq status_source
user_cumulat% retweet_count
Figure 3.8 The user dimension in the data warehouse model, with added derived
attributes.
detection algorithms (CDAs) compute the edge weight for every pair of users in
the network graph and create a weight index. The edge weights are the calculated
edge betweenness measures (i.e., the number of shortest paths between pairs of
vertices), adopted from Freeman (1978)’s vertex betweenness centrality. Then, the
edges are added to a network of n vertices with no edges between them in order of
their weights starting with the strongest weights to the weakest (weighted index).
As a result, communities begin to appear, as users with the most powerful network
ties are linked to one another. Unlike the traditional community detection model,
Girvan and Newman (2002) propose a method that can also determine the struc-
ture of the communities using a different clustering approach. The method focuses
on identifying of the community peripheries (using the edge betweenness), rather
than the ones formed between highly-connected individuals (the individuals with
high betweenness centralities).
As discussed earlier, CDAs compute important network features and detect
communities. As the algorithms have different capabilities in handling networks of
different sizes and complexities, and are thus utilized for different purposes, their
performances have to be tested and evaluated by through the use of benchmarks or
benchmark graphs. The GN (for Girvan and Newman) benchmark, for example,
identifies communities within very small-sized graphs by computing only specific
network features (e.g., distribution, community sizes, etc.) Although most algo-
rithms perform well on the GN benchmark graphs, they might not produce good
results for extremely large, heterogamous networks with overlapping communities
(Yang, Algesheimer, & Tessone, 2016).
Alternatively, the LFR benchmark, which stands for Lancichinetti,
Fortunato, & Radicchi, produces artificial networks to enhance the performance
of CDAs in detecting communities in large, complex networks (Lancichinetti,
Fortunato, & Radicchi, 2008; Lancichinetti & Fortunato, 2009). For instance,
Yang, Algesheimer, and Tessone (2016) employ LFR benchmark graphs to
examine and compare the accuracy and computing time of eight CDAs (i.e.,
edge betweenness, fastgreedy, infomap, label propagation, leading eigenvector,
multilevel, spinglass, and walktrap) available in the R package igraph. The study
results indicate that the multilevel algorithm was superior to all the other algo-
rithms on the LFR benchmark used for the analyses. When the algorithms were
strictly tested for accuracy, (computing time is irrelevant for computing com-
munities in small networks), infomap, label propagation, multilevel, walktrap,
spinglass, and edge betweenness algorithms outperformed the others. However,
for large networks, where computing time is a critical criterion in selecting an
algorithm, infomap, label propagation, multilevel, and walktrap were revealed
to be superior options.
In Vignette 3.2, we present an example of identified communities within
the network of users featured in Vignette 3.1 using the leading eigenvector
algorithm.
88 ◾ Analytics and Knowledge Management
Figure 3.9 Network of users within the Zika social network in April 2016.
Transforming Knowledge Sharing ◾ 89
Figure 3.10 Communities of users within the Zika social network in April 2016.
User/Author
user_key Corpus_fact
Cluster user_id
name user_key
cluster_key screen_name time_key
cluster_rank address location_key
cluster_size contact_info text
category favorited
favorite_count
user_rank reply_to_ScrNm
user_freq status_source
user_cumulat% retweet_count
cluster_key
Figure 3.11 The derived cluster dimension in the data warehouse model.
size and the rank of each cluster of users through SNA, clus-
ters are now standing on their own, represented as separate
derived dimensions. Using a data warehouse with this configu-
ration, we could generate Table 3.2 by running a straightfor-
ward SQL query that draws information from all user clusters
and their associated users.
92 ◾ Analytics and Knowledge Management
Time Dimension
The time at which a document is created not only communicates the author’s
thoughts at that specific date, hour, and minute, but also provides clues to the
environmental context on which he or she is commenting. The time period in
which documents in a collection were generated comprises the time dimen-
sion of a corpus. Examples of time-stamped documents include news stories
(which are “news” only with reference to a specific point in time), emails and
other communication documents, server logs, customer service calls, and social
media postings.
Once the time dimension is established, various corpus statistics can be
tracked over time, forming a data structure known as a time series. These include
counts, averages, ranges, and so on, that can be easily obtained by executing data-
base queries. Time series models include regression with linear and polynomial
time trends, auto-regressive models, exponential smoothing models, time series
decomposition, and autoregressive-integrated-moving average (ARIMA) models.
For more details on these time series analytics, refer to a standard time series
forecasting text, such as Bowerman, O’Connell, & Koehler (2005). Vignette 3.3
presents an example of how the time dimension contributes to the knowledge
base of a TBC.
80000
70000
60000
Tweet volume
50000
40000
30000
20000
10000
0
0 5 10 15 20 25
Hour of the day
Figure 3.12 Tweet volume by hour of the day (1 = 1:00 AM, 13 = 1:00 PM, etc.).
Time
time_key Corpus_fact
date
user_key
month
time_key
date
location_key
year
text
hour
favorited
hourly_frequency favorite_count
reply_to_ScrNm
status_source
retweet_count
Location Dimension
The proliferation of geographic information systems (GIS) has brought attention to
the spatial dimension of various processes in the social, political, financial, and sci-
entific domains. Efficient use of spatial data, including users’ geographic location,
place of origin, place of destination, or place of interest in general, help uncover and
add to the knowledge base the location dimension, which can often be hidden in a
Transforming Knowledge Sharing ◾ 95
Corpus_fact
Location
location_key user_key
longitude time_key
lattitude location_key
text
location_freq favorited
favorite_count
reply_to_ScrNm
status_source
retweet_count
Topic Dimension
As community members communicate their ideas by exchanging oral or written
documents, word usage patterns tend to exhibit characteristics of a spontaneous
order: From long and seemingly flat lists of words, a structure of organized, socially
constructed, corpus-level topics emerges. The modeling of such topics has taken
98 ◾ Analytics and Knowledge Management
two main approaches: the generative, or probabilistic, approach, and the varia-
tional, or statistical, approach.
The generative approach assumes that document authors are already aware of a
number of topics that express issues relevant to their community, and are aware of
the probability distribution of each topic across all terms in the dictionary. They then
generate words for the documents they author by selecting a topic from a mix of top-
ics that characterizes the particular document, and selecting a term from the mix of
terms that characterizes the particular topic. Latent Dirichlet allocation (LDA), the
most widely-cited algorithm that follows the probabilistic approach to topic modeling,
uses a Bayesian approach with Markov chain Monte Carlo simulations to estimate the
parameters of the multinomial distributions of topics across the documents and terms
across the topics (Blei, Ng, & Jordan, 2003; Blei, 2012). The variational approach
assumes that humans acquire meaning by being repeatedly exposed to multiple exam-
ples of documents containing such meaning (Kintsch & Mangalath, 2011). Because
of a need to minimize cognitive effort, they create a mental space that quantifies the
patterns of co-occurrence of terms within similar contexts as linear combinations,
or vectors, of terms and documents. Terms and documents are then projected into
that space in a lower dimensionality that drops thousands of dimensions that describe
the specific document content and keeps only a few abstract concepts that describe
terms and documents in a broad sense. Latent Semantic Analysis (LSA), the most
widely-cited algorithm that follows the variational approach to topic modeling, uses
the matrix operation of singular value decomposition to project the original term fre-
quency matrix to a space of principal components that explain maximum variation
using a minimum number of dimensions (Deerwester et al., 1990; Dumais, 2004).
The two approaches seem to be somewhat complementary, since LDA explains how
documents are produced from topics without explaining how topics were acquired and
LSA explains how topics are acquired without explaining how they were produced.
And, in practice, the two algorithms often extract very similar topics after process-
ing the same corpus. We now continue with Vignette 3.5, where we follow the LSA
approach and perform text analytics to extract topics from a corpus of Tweets.
Topic analysis for the IWD 2017 Tweets produced four overarch-
ing topics. The first topic—Here is to Strong Women—included
100 ◾ Analytics and Knowledge Management
500
450
400
350
300
Eigenvalue
250
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Dimension
Figure 3.16 Scree plot of the first 20 eigenvalues for the IWD 2017 Tweets.
Table 3.3 Topics and Terms for IWD 2017 Tweets and Their Associated
Document Counts
Doc.
Topic Label Top Loading Terms Count
Topic
Corpus_fact
topic_key
author_key label
time_key top_terms
location_key doc_count
tweet_id
text
favorited Tweet_by_topic
favorite_count
topic_key
reply_to_SN
tweet_id
status_source
topic_strength
retweet_count
Topic-Time Interaction
Topics do not have a static preference among TBC members. In a constantly
changing environment, the community members adapt discourse and let the
topics in the corpus wax and wane. In KBDW terms, the dynamic behavior
of topics across time can be studied by considering the interaction between
the topic and time dimensions. Vignette 3.6 provides an example of this
interaction.
0 12 25 7 17
1 10 12 10 14
2 19 48 17 32
3 27 84 23 49
4 40 95 37 57
5 53 105 50 65
6 84 153 81 96
18 55 61 46 58
900
800
700
600
Tweet count
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Hour of the day
Here is to strong women Celebrate women
Love Wear red in support
Figure 3.18 Time trends for the four IWD Tweet topics over the 24-hour day on
March 8, 2017.
Corpus_fact
Topic
Time author_key
time_key topic_key
time_key label
date location_key
tweet_id Tweet_by_topic top_terms
month doc_count
date text topic_key
year favorited tweet_id
hour favorite_count topic_strength
reply_to_SN
hourly_freq status_source
retweet_count
Figure 3.19 The dimensions of topic and time, and their relationship in the
KBDW model.
Opinion Dimension
Textual data can generally be divided into the categories of fact or opinion, with
facts defined as “objective expressions about entities, events, and their properties,”
and opinions defined as “subjective expressions that describe people’s sentiments,
appraisals, or feelings toward entities, events, and their properties” (Liu, 2010,
p. 627). Sentiments, therefore, represent a specific type of textual opinion. The study
of opinions is important, because individuals and organizations seek opinions (or
word-of-mouth recommendations) when engaged in decision-making, and because
an individual’s opinion reflects the knowledge that he or she has gleaned from vari-
ous experiences. The process of uncovering the sentiments conveyed by documents is
called sentiment analysis or opinion mining, which is defined as “the computational
study of opinions, sentiments, and emotions expressed in text” (Liu, 2010, p. 629).
Opinionated texts contain an expression on their left and either an explicit
(overt) or implicit (implied) opinion on their left (Liu, 2010, p. 649). Opinions can
be expressed on a gamut of things, including people, places, organizations, events,
and topics. In sentiment analysis, the entity targeted by an opinion is termed its
object (Liu, 2010, p. 629). Objects are comprised of components (parts) and attri-
butes (properties), which can be further deconstructed into continuous layers of
subcomponents and subattributes, such that the objects of opinionated texts can
be depicted using “part of” the relationships represented by taxonomies, hierar-
chies, and trees. In addition to the object of the opinion, sentiment analysis seeks to
uncover its orientation (i.e., whether it is positive, negative, or neutral) and its source
(i.e., the opinion holder, or the person or organization that originally expresses the
opinion). Analysts seeking to conduct sentiment analysis can either find public opin-
ions about a specific object or find opinions from specific sources about an object.
The advent of the internet has been paramount in enabling the study of opinions,
and the popularity of user-generated content that began with Web 2.0 has made the
mass opinions conveyed via product reviews, forums, discussion groups, and blogs
106 ◾ Analytics and Knowledge Management
readily available for analysis (Liu, 2010, p. 627). Though an opinion holder’s sentiment
might be explicit in a product review, it is likely more implicit in longer texts and more
conversational documents, such as Twitter posts. Therefore, specialized techniques and
applications are required to uncover the underlying sentiments of these documents.
The feature-based sentiment analysis model (Hu & Liu, 2004; Liu, 2006; Liu,
Hu, and Cheng, 2005) instructs researchers to mine direct opinions for discover-
ing all quintuple groupings of opinion objects, object features, polarity, holders,
and time and identifying synonyms for words and phrases that convey opinions
that occur within the documents (Liu, 2010, p. 632). Results can then be conveyed
using opinion summaries and visualized with bar graphs and pie charts (Pang &
Lee, 2004). However, to successfully implement the model, researchers must also
engage in sentiment classification and the generation of an opinion lexicon.
Sentiment classification seeks to classify opinionated documents as expressing
positive, negative, or neutral opinions. The sentiment classification process is five-
fold and consists of identifying terms and their frequencies and then tagging those
terms according to their parts of speech. From there, opinion words and phrases are
isolated and evaluated for both their syntactic dependency (the degree to which the
interpretation of a word or phrase is dependent on its surrounding words or phrases)
and the effects of any negation in the phrase (the degree to which the inclusion of
a negative word, such as “not” or “but,” affects whether the expressed opinion is
positive or negative) (Liu, 2010, p. 638). For example, application rules dictate the
following:
An opinion lexicon is the equivalent to a dictionary to which the text of the opin-
ionated documents will be compared, and its generation depends on the identifica-
tion and compilation of opinion phrases and idioms, in addition to base-type and
comparative-type positive and negative opinion words (Liu, 2010, pp. 641–642).
Positive-opinion words express desired states, while negative-opinion words express
undesired states. Base-type opinion words are basic adjectives, such as “good” or “bad,”
while comparative-type opinion words express opinions of comparison or superlatives
using terms such as “better,” “worse,” “best,” and “worst,” which are derived from their
associated base-type words. Comparative-type opinion words are unique because they
state an opinion that two objects differ from each other on a certain feature, rather
than stating a simple opinion on a single object. Documents can contain four types of
comparative relationships, which can be subcategorized into gradable and nongradable
comparisons. Gradable comparisons express opinions of non-equality (that the quali-
ties of one object are greater or less than those of another), equality (that the qualities of
two objects are the same or similar), or superlative (that one object is the best or worst
Transforming Knowledge Sharing ◾ 107
Though this chapter deals specifically with expressions of positive, negative, and
neutral sentiment, opinion analyses, in general, can be undertaken to assess more
complex phenomena, including emotions (Aman & Szpakowicz, 2007) such
as joy, sadness, anger, fear, surprise, and disgust (Strapparava & Mihalcea, 2008;
108 ◾ Analytics and Knowledge Management
Chaffar & Inkpen, 2011), political leanings (Grimmer & Stewart, 2013; Ceron
et al., 2014), and media slant (Gentzkow & Shapiro, 2010).
Analyses that extend beyond opinion polarity must employ the tactics men-
tioned here, in addition to those for identifying, classifying, and cataloging a lex-
icon of words that align with the various emotions the analysts aim to uncover
(Strapparava & Mihalcea, 2008). This process requires that researchers uncover not
only the words that refer directly to emotional states (e.g., “fearful” or “excited”), but
also the ones that denote an indirect contextual reference (e.g., words like “laugh” or
“cry,” which indicate emotional responses to a stimulus). Then, the researchers pri-
marily employ the same methods for analysis. Regarding assessing political leanings
(e.g., Ceron et al., 2014), there is no doubt that parties, candidates, and news media
worldwide are interested in the promise of utilizing the opinions expressed on SNS
to predict the outcomes of elections by forecasting voting behavior. Advances in this
area even have the potential to render obsolete more traditional means of forecast-
ing, such as public opinion polling. Finally, the modern proliferation of perceived
media biases (Gentzkow & Shapiro, 2010) offers applications of opinion-analysis
techniques to discern whether text-based reports actually contain a media slant, or
whether such leanings are mere matters of readers’ and pundits’ interpretations.
Vignette 3.7 provides an example that illustrates a simple approach to opinion
mining, dictionary-based, sentiment analysis.
−4 0
−3 2
−2 25
−1 265
0 3156
1 2664
2 186
3 15
4 2
3500
3000
2500
Frequency
2000
1500
1000
500
0
−4 −3 −2 −1 0 1 2 3 4
Sentiment score
Corpus_fact
author_key
time_key
location_key
text
favorited
favorite_count
reply_to_SN Opinion
status_source opinion_key
retweet_count opinion_type
opinion_key
opinion_strength
Opinion-Location Interaction
Since human communities tend to self-organize and cluster together in space with
similar others, opinions can vary significantly across geographical locations. In
KBDW terms, the spatial dynamics of opinion can be studied by considering the
interaction between the opinion and location dimensions. Vignette 3.8 provides an
example of such interaction.
Table 3.6 Average Sentiment Score in the 80 Sectors in the 10-by-8 Grid of
Seattle Locations
West East
X1 X2 X3 X4 X5 X6 X7 X8
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3.22 Distribution of average sentiment scores across the grid of 80 Seattle
sectors.
Corpus_fact
author_key
Location time_key
location_key
location_key text
longitude_bound1 favorited
longitude_bound2 favorite_count
latitude_bound1 reply_to_SN Opinion
latitude_bound2 status_source
opinion_key
location_freq retweet_count
opinion_type
location_ave_sentiment opinion_key
opinion_strength
Figure 3.23 The dimensions of opinion and location, and their relationship in
the KBDW model.
Community
of practice
NoSQL
repository Reports
(Corpus) SQL
queries
Corpus_fact
Author/
Creator Time
author_key
Author_by_ time_key
topic location_key
author_key Location document_key
topic_key platform_key Platform/
assoc_strength storage_ref Medium
Topic access_rights
presentation_details
Document_by content_details
_topic metadata Cluster/
Document/ Category
topic_key cluster_key
document_key Artifact
cluster_centrality
assoc_strength opinion_key School of
opinion_strength thought/
Opinion school_key Community
school_position
impact_strength
of creative thought associated with the document or artifact. Additional fact tables
can account for interactions between the author and topic dimensions, or the
document and topic dimensions. Some of these ideas are depicted in Figure 3.25.
We conclude this chapter by reiterating the importance of communities of prac-
tice, the collective knowledge they accumulate, and the proper management of their
knowledge through well-designed KM systems. We believe that KM researchers and
practitioners need to pay more attention to community knowledge. To put it in the
words of Gruber, such collective knowledge should be “taken seriously as a scientific
and social goal” (Gruber, 2008, p. 5). We hope our chapter has shed some light on
analytic and modeling considerations as one begins to accomplish this goal.
Summary
In this chapter, we examine the knowledge that develops within a community of
Twitter users. Focusing our discussion on a community of users that is built around
a certain social or scientific interest, and a subset of actively involved contributors, we
view such a Twitter-based community (TBC) as an online CoP. We view a corpus of
Tweets produced by the TBC as the community’s store of common knowledge. Using
various kinds of SMAs that operate on such a corpus, we uncover the collective tacit
knowledge that is embedded in it and discuss the process of its transfer to a data ware-
house. We present modeling elements of this data warehouse based on the dimensions
of user, time, location, topic, and opinion. We then discuss how physical database
designs would include these dimensions and their interactions as database tables and
116 ◾ Analytics and Knowledge Management
how the execution of simple database queries would then transform the TBC’s tacit
collective knowledge into an explicit form. We include eight illustrative vignettes that
examine various aspects of collective knowledge built within TBCs that discuss the
Zika virus, International Women’s Day, and the city of Seattle.
References
Adamson, C., & Venerable, M. (1998). Data warehouse design solutions. New York, NY:
John Wiley & Sons.
Aman, S., & Szpakowicz, S. (2007). Identifying expressions of emotion in text. In
V. Matoušek and P. Mautner (Eds.), Text, speech and dialogue, lecture notes on artificial
intelligence (Vol. 4629, pp. 196–205). Berlin, Germany: Springer-Verlag.
The Atlantic. (2017). Brazil declares an end to its Zika health emergency. Retrieved from
https://www.theatlantic.com/news/archive/2017/05/brazil-ends-zika-emergency/526509/.
BBC News. (2017). Zika virus: Brazil says emergency is over. BBC News, May 12, 2017.
Retrieved from http://www.bbc.com/news/world-latin-america-39892479.
Berger, P.L., & Luckmann, T. (1966). The social construction of reality. New York, NY:
Random House.
Blei, D.M. (2012). Probabilistic topic models. Communications of the ACM 55(4), 77–84.
Blei, D.M., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine
Learning Research, 3, 993–1022.
Bodnar, T., & Salathé, M. (2013). Validating models for disease detection using twitter.
Proceedings of the 22nd International Conference on World Wide Web Rio de Janeiro,
Brazil (pp. 669–702). doi:10.1145/2487788.2488027.
Bowerman, B.L., O’Connell, R., & Koehler, A. (2005). Forecasting, time series, and regres-
sion (4th ed.). Stamford, CT: Thomson Learning.
Burns, T., & Stalker, G. M. (1961). The management of innovation. London, UK: Tavistock.
Carneiro, H.A., & Mylonakis, E. (2009). Google trends: A web-based tool for real-time
surveillance of disease outbreaks. Clinical Infectious Diseases, 49(10), 1557–1564.
CDC (2017). [Zika] Symptoms. Centers for disease control and prevention. Retrieved from
https://www.cdc.gov/zika/symptoms/symptoms.html.
Ceron, A., Curini, L., Iacus, S.M., & Porro, G. (2014). Every Tweet counts? How sentiment
analysis of social media can improve our knowledge of citizens’ political preferences
with an application to Italy and France. New Media & Society, 16(2), 340–358.
Chaffar, S., & Inkpen, D. (2011). Using a heterogeneous dataset for emotion analysis in
text. In C. Butz & P. Lingras (Eds.), Advances in artificial intelligence, lecture notes in
artificial intelligence (Vol. 6657, pp. 62–67). Berlin, Germany: Springer-Verlag.
Chenoweth, E., & Pressman, J. (2017, February 7). Analysis: This is what we learned
by counting the women’s marches. Washington Post. Retrieved from https://www.
washingtonpost.com/news/monkey-cage/wp/2017/02/07/this-is-what-we-learned-by-
counting-the-womens-marches/.
Coussement, K., & Van Den Poel, D. (2008). Improving customer complaint management
by automatic email classification using linguistic style features as predictors. Decision
Support Systems, 44(4), 870–882.
Culotta, A. (2010). Towards detecting influenza epidemics by analyzing Twitter
messages. Proceedings of the First Workshop on Social Media Analytics, USA.
doi:10.1145/1964858.1964874.
Transforming Knowledge Sharing ◾ 117
Gruber, T. (2008). Collective knowledge systems: Where the social web meets the Semantic
Web. Journal of Web Semantics, 6(1), 4–13.
Grudin, J. (1994). Computer supported cooperative work: History and focus. IEEE, 27 (5), 15–17.
Gunawardena, C.N., Hermans, M.B., Sanchez, D., Richmond, C., Bohley, M., & Tuttle, R.
(2009). A theoretical framework for building online communities of practice with
social networking tools. Educational Media International, 46(1), 3–16.
Hemsley, J., & Mason, R.M. (2013). Knowledge and knowledge management in the social media
age. Journal of Organizational Computing and Electronic Commerce, 23(1–2), 138–167.
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of
the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
(pp. 168–177). Seattle, WA.
Ibarra, H., & Andrews, S. (1993). Power, social influence, and sense making: Effects of
network centrality and proximity on employee perceptions. Administrative Science
Quarterly 38(2), 277–303. doi:10.2307/2393414.
Iedema, R. (2003). Discourses of post-bureaucratic organization. Amsterdam, the Netherlands:
John Benjamins Publishing Company.
International Women’s Day. (2017). About International Women’s Day (March 8).
Retrieved from https://www.internationalwomensday.com/About
Jubert, A. (1999). Communities of practice. Knowledge Management, 3(2), 1999.
Kabir, N., & Carayannis, E. (2013). Big data, tacit knowledge and organizational competi-
tiveness. Journal of Intelligence Studies in Business, 3(3), 54–62.
Khan, G., F. (2015). Seven layers of social media analytics: Mining business insights from
social media text, actions, networks, hyperlinks, apps, search engine, and location data.
Lexington, KY: CreateSpace Independent Publishing Platform.
Kim, S.M., & Hovy, E. (2004). Determining the sentiment of opinions. Proceedings of the
International Conference on Computational Linguistics (COLING). Geneva, Switzerland.
Kimball, R., & Ross, M. (2002). The data warehouse toolkit, (2nd ed.). New York, NY: John
Wiley & Sons.
Kintsch, W., & Mangalath, P. (2011). The construction of meaning. Topics in Cognitive
Science, 3(2), 346–370.
Kulkarni, S., Apte, U., & Evangelopoulos, N. (2014). The use of latent semantic analysis in
operations management research. Decision Sciences, 45(5), 971–994.
Lancichinetti, A., & Fortunato, S. (2009). Benchmarks for testing community detection
algorithms on directed and weighted graphs with overlapping communities. Physical
Review E, 80, 1–9. doi:10.1103/PhysRevE.80.016118
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for test-
ing community detection algorithms. Physical Review E, 78, 1–6. doi:10.1103/
PhysRevE.78.046110
Larsson, R., Bengtsson, L., Henriksson, K., & Sparks, J. (1998). The interorganiza-
tional learning dilemma: Collective knowledge development in strategic alliances.
Organization Science, 9(3), 285–305.
Lave, J. (1988). Cognition in practice: Mind, mathematics, and culture in everyday life. New
York, NY: Cambridge University Press.
Lave, J. (1991). Situated learning in communities of practice. In L. Resnick, J. M. Levine, &
S. D. Teasley (Eds.), Perspectives on socially shared cognition (pp. 63–82). Washington,
DC: American Psychological Association.
Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation.
Cambridge, MA: Cambridge University Press.
Transforming Knowledge Sharing ◾ 119
Leonard, D., & Sensiper, S. (1998). The role of tacit knowledge in group innovation.
California Management Review, 40(3), 112–132. doi:10.2307/41165946.
Levin, D. Z., & Cross, R. (2004). The strength of weak ties you can trust: The mediating
role of trust in effective knowledge transfer. Management Science, 50(11), 1477–1490.
Liu, B. (2006). Web data mining: Exploring hyperlinks, contents, and usage Data. Berlin,
Germany: Springer-Verlag.
Liu, B. (2010). Sentiment analysis and subjectivity. In N. Indurkhya & F. J. Damereau (Eds.),
Handbook of natural language processing (pp. 627–666). Boca Raton, FL: CRC Press.
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opin-
ions on the web. Proceedings of WWW, Chiba. Japan.
Mishra, N., Schreiber, R., Stanton, I., & Tarjan, R.E. (2007). Clustering social networks.
In A. Bonato & F.R.K. Chung (Eds.) Algorithms and models for the web-graph. WAW
2007. Lecture notes in computer science (vol. 4863). Heidelberg, Germany: Springer.
The New York Times. (2017, June 3). India acknowledges three cases of Zika Virus. The New
York Times. Retrieved from https://www.nytimes.com/2017/06/03/world/asia/india-
zika-virus.html.
Nonaka, I. (1991). The knowledge-creating company. Harvard Business Review 69(6), 96–104.
Nonaka, I., & Takeuchi, H. (1995). The knowledge-creating company: How Japanese compa-
nies create the dynamics of innovation. New York, NY: Oxford University Press.
Palmer, J. D., & Fields, A. N. (1994). Computer supported cooperative work. IEEE, 27(5), 15–17.
Panahi, S., Watson, J., & Partridge, H. (2012). Social media and tacit knowledge sharing:
Developing a conceptual model. World Academy of Science, Engineering and Technology
Index 64, International Journal of Social, Behavioral, Educational, Economic, Business
and Industrial Engineering 6(4), 648–655. http://scholar.waset.org/1307-6892/5672.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. Proceedings of the 42nd annual meeting on
Association for Computational Linguistics (p. 271). Barcelona, Spain: Association for
Computational Linguistics.
Pruitt, S. (2017, March 6). The surprising history of International Women’s
Day. The History Channel. Retrieved from http://www.history.com/news/
the-surprising-history-of-international-womens-day.
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., & Parisi, D. (2004). Defining and
identifying communities in networks. Proceedings of the National Academy of Sciences,
101, 2658–2663.
Rich, E., Knight, K., & Nair, S.B. (2009). Artificial intelligence, (3rd ed.). New Delhi,
India: Tata McGraw-Hill.
Riege, A. (2005). Three-dozen knowledge-sharing barriers managers must consider. Journal
of Knowledge Management, 9(3), 18–35.
Salathé, M., & Khandelwal, S. (2011). Assessing vaccination sentiments with online social
media: Implications for infectious disease dynamics and control. PLOS Computational
Biology, 7(10), e1002199.
Sanghani, R. (2017, March 8). What is international women’s day? The Telegraph. Retrieved
from http://www.telegraph.co.uk/women/life/international-womens-day-2017-did-start-
important/.
SAS (2017). SAS® OnDemand for Academics. Retrieved from http://support.sas.com/
software/products/ondemand-academics/.
Sharda, R., Delen, D., & Turban, E. (2018). Business intelligence, analytics, and data science:
A managerial perspective, (4th ed.). Upper Saddle River, NJ: Pearson Education.
120 ◾ Analytics and Knowledge Management
Sidorova, A., Evangelopoulos, N., Valacich, J.S., & Ramakrishnan, T. (2008). Uncovering
the intellectual core of the information systems discipline. MIS Quarterly, 32(3), 467–
482 & A1–A20.
Strapparava, C., & Mihalcea, R. (2008). Learning to identify emotions in text. Proceedings
of the 2008 ACM symposium on Applied computing, Fortaleza, Ceara, Brazil (pp.
1556–1560). New York, NY: ACM.
Sullivan, D. (2001). Document warehousing and text mining. New York, NY: John Wiley & Sons.
Tsui, E. (2003). Tracking the role and evolution of commercial knowledge management
software. In C.W. Holsapple (Ed.), Handbook of knowledge management. Heidelberg,
Germany: Springer-Verlag.
United Nations. (2017). International Women’s Day March 8. Retrieved from http://www.
un.org/en/events/womensday/.
University of Chicago. (2014). International Women’s Day History. Retrieved from https://
iwd.uchicago.edu/page/international-womens-day-history.
Walsh, J. P., & Ungson, G. R. (1991). Organizational memory. Academy of Management
Review, 16, 57–91.
Weick K. E. (1995). Sensemaking in organizations. Thousand Oaks, CA: SAGE Publications.
Wenger, E. (1998). Communities of practice: Learning, meaning, and identity. Cambridge,
MA: Cambridge University Press.
Wenger, E. (2000). Communities of practice and social learning systems. Organization, 7(2),
255–246.
Wenger, E., McDermott, R.A., & Snyder, W. (2002). Cultivating communities of practice: A
guide to managing knowledge. Boston, MA: Harvard Business Press.
Wenger, E.C., & Snyder, W.M. (2000). Communities of practice: The organizational fron-
tier. Harvard Business Review, 78(1), 139–145.
Wiig, K.M. (1997a). Knowledge management: An introduction and perspective. Journal of
Knowledge Management, 1(1), 6–14. doi:10.1108/13673279710800682.
Wiig, K.M. (1997b). Knowledge management: Where did it come from and where will it
go? Journal of Expert Systems with Applications, 13(1), 1–14.
World Health Organization. (2017). Zika virus infection—India. World health organization,
May 26, 2017. Retrieved from http://www.who.int/csr/don/26-may-2017-zika-ind/en/.
Xu, W.W., Sang, Y., Blasiola, S., & Park, H.W. (2014). Predicting opinion leaders in Twitter
activism networks: The case of the wisconsin recall election. American Behavioral
Scientist, 58(10), 1278–1293.
Yang, Z., Algesheimer, A., & Tessone, C.J. (2016). A comparative analysis of community detec-
tion algorithms on artificial networks. Scientific Reports, 6(30750). doi:10.1038/srep30750.
Zappavigna-Lee, M.S. (2006). Tacit knowledge in communities of practice. In Encyclopedia
of communities of practice in information and knowledge management (pp. 508–513).
Hershey, PA: IGI Global.
Zappavigna-Lee, M.S., & Patrick, J. (2004). Literacy, tacit knowledge, and organizational
learning. Proceeding of the 16th Euro-International Systematic Functional Linguistic
Workshop. Madrid, Spain.
Zappavigna-Lee, M.S., Patrick, J., Davis, J., & Stern, A. (2003). Assessing knowledge man-
agement through discourse analysis. Proceedings of the 7th Pacific Asia Conference on
Information Systems. Adelaide, South Australia.
Zarya, V. (2017). A brief but fascinating history of International Women’s Day. Fortune, March 7,
2017. Retrieved from http://fortune.com/2017/03/07/international-womens-day-
history.
Chapter 4
Contents
Introduction ......................................................................................................121
Collecting User Feedback ..................................................................................123
Analyzing User Feedback ...................................................................................124
Opinion Mining ...........................................................................................125
Link Analysis ................................................................................................127
User Feedback Analysis: The Existing Work.......................................................128
Deriving Knowledge from User Feedback..........................................................130
Data Management ........................................................................................133
Data Analytics ..............................................................................................134
Knowledge Management ..............................................................................136
Conclusions and Future Work ...........................................................................137
References .........................................................................................................138
Introduction
User feedback plays a critical role in the evaluation and enhancement of the qual-
ity of any organizational setup. User feedback is the meaningful information given
with the purpose of sharing experiences, suggesting improvements, and expressing
opinions regarding a system. In modern times, in the electronic online ecosystems,
121
122 ◾ Analytics and Knowledge Management
gathering user feedback is not a tedious process. However, there is a need to iden-
tify suitable methods to discover and interpret meaningful patterns and knowl-
edge from these huge datasets. Getting and then acting on user feedback is very
important for running an organization successfully in the present day competitive
market. In this chapter, we will identify the role of analytics in understanding user
feedback of an organizational setup.
The Merriam-Webster dictionary defines feedback as the transmission of evalua-
tive or corrective information about an action, event, or process to the original or con-
trolling source. User feedback is the reaction of a user after using a service or product.
The reaction can be positive, negative, or neutral depending upon the user’s experi-
ence with the service or product. Given the proliferation of the Internet and social
media, user feedback can be instantly captured. People now have unprecedented
opportunities to raise their opinions on public platforms. Public opinion is useful and
right for organizations as they can get to know about the success and failure of their
policies and products from the user feedback easily available on social media.
However, most of the time, people express their opinions indirectly. Using
natural languages, they mix facts with their opinions (Pawar et al. 2016). A fact is
an objective expression regarding an event, experience, policy, or product, whereas
an opinion is a subjective expression (of emotion, perceptions, or perspectives).
Analysis of this subjective expression of the opinion holder is a key consideration
while mining knowledge from publicly shared discourses. Opinion mining, or
sentiment analysis, is the discipline that combines the techniques from different
fields such as natural language processing, text mining, and information retrieval
to segregate facts from opinions as they appear in unstructured text. Extracting
user opinion from billions of user posts and then analyzing it to use it for decision-
making is an insurmountable task, which needs automatic, rather than manual,
ways to accomplish it successfully.
In the past, investigations reported in the research literature in this area have
been largely restricted to reviews of business establishments such as hotels and res-
taurants, or electronic products such as cameras, mobile phones, and so on (Hridoy
et al., 2015). Nowadays, there is tremendous interest in gauging public sentiment
during elections to understand their response toward electoral issues and interest in
different political parties and politicians (Mohammad et al., 2015). However, there
is a need to analyze public response regarding government policies or other issues of
public interest. Government policies sometimes draw a lot of flak from the general
public. Therefore, it will be interesting to watch the Twitter stream to see the public
opinion regarding a government policy. Data analytics of user feedback on social
media may indicate the public’s responsiveness as well as their interest in a govern-
ment’s policies. In this chapter, we analyze public sentiment as expressed on Twitter
regarding the Indian government demonetizing high currency notes in November,
2016 to curb corruption. We stress the importance of extracting knowledge from
user feedback, and then integrating and reusing this knowledge from the crowd in
the decision-making process.
Data Analytics for Deriving Knowledge from User Feedback ◾ 123
The study looks at what is known and not known about user feedback and
seeks to identify the user feedback processes as a valuable resource for improved
understanding of user needs for better decision-making. This chapter suggests an
approach by which progress toward better integration of user feedback within the
decision-making process through a knowledge management system in an organiza-
tion can be realized. Such improvements will assist policy-makers because knowl-
edge obtained from user feedback can be preserved as an asset for future use with
the help of a knowledge management system.
The major goals of this chapter are as follows:
The rest of the chapter is organized as follows: The next section discusses possible
ways to collect user feedback; the third section gives details of opinion mining and
data mining as techniques for user feedback analysis; the fourth section presents the
existing work for user feedback analysis; our proposed approach is mentioned in the
fifth section; and the last section concludes the chapter.
as a graph in which nodes represent the communicating entities (e.g., users), and
edges represent the relationships or links (interactions) between the communi-
cating entities. To understand user feedback, interested parties can analyze the
content part to get the users’ sentiments. Along with this, links help to gauge the
spread of user sentiment through the social interactions of the user. Therefore, it
is important to understand user sentiment through content analysis, augmented
with link analysis to get meaningful insights. This section discusses opinion min-
ing as an approach to content analysis followed by data mining to perform link
analysis.
Opinion Mining
Opinion mining, or sentiment analysis, is an emerging tactic to analyze user feed-
back (Appel et al. 2015). Sentiment analysis pertains to extracting user expression
from user feedback expressed in a natural language. It makes use of text analysis
and computational linguistics to identify and extract attitudes expressed in user
feedback. A user may give positive, negative, or neutral feedback. An opinion is
expressed by a person (opinion holder) who expresses a viewpoint (positive, nega-
tive, or neutral) about an entity (target object, e.g., person, item, organization,
event, policy, and service) (Khan et al., 2014). Therefore, an opinion broadly
has the following parts: (1) opinion holders: people who hold the opinions; (2)
sentiments: positive or negative; (3) opinion targets: entities and their features or
aspects. As opinion may change with time, Liu (2011) adds a dimension of time,
as well.
Therefore, an opinion must have five parts: Opinion holder, a target entity,
aspects or features of the target, sentiment, and time. Appel et al. (2015) give a for-
mal expression: an opinion is represented as a quintuple (ej, ajk, soijkl, hi, tl), where: ej
is a target entity, ajk is an aspect or feature of the entity ej; soijkl is the sentiment value
of the opinion from the opinion holder hi on feature ajk of entity ej at time tl; soijkl
is positive, negative, neutral, or can have more granular ratings; hi is an opinion
holder; and tl is the time when the opinion is expressed.
Identifying all parts of an opinion quintuple from a given user discourse is
a challenging problem. An opinion holder is a user who has expressed an opin-
ion. Sometimes the opinion holder is mentioned explicitly. Otherwise, it has
to be presumed to be of the author of the piece. A user may not express intent
explicitly and rather use pronouns and context to relate different parts of the
quintuple. Target entity can be a product or policy. Sentiment, also known as
polarity of opinion, tells the orientation of an opinion, such as positive, nega-
tive, or neutral. Furthermore, an opinion can be expressed in the form of a
document or a collection of sentences. It could also be that a single sentence
contains a multitude of opinions on different features of an object. There is a
need to develop aspect-based systems to identify sentiments in such situations
(Mohammad, 2015).
126 ◾ Analytics and Knowledge Management
Subjectivity classification: In the first step, a user’s statements are categorized into
opinionated and nonopinionated ones. An opinionated statement is the one
that carries an opinion explicitly or implicitly. It is a subjective sentence. An
objective sentence contains only factual information. Subjectivity classification
deals with distinguishing between subjective sentences and objective sentences.
Sentiment classification: After applying the subjectivity classification function,
objective sentences are dropped from further analysis. Sentiment classifica-
tion focuses on subjective sentences to identify positive or negative sentiments
expressed in them. In order to identify the polarity of a sentiment, several
supervised, as well as unsupervised, learning methods can be employed. In
supervised learning based sentiment classification, Naïve Bayesian classifica-
tion algorithms, nearest neighbor algorithm, decision tree classifier, artificial
neural networks, and support vector machines are the popular methods.
For identifying sentiments, creation of a sentiment resource beforehand
is also necessary (Joshi et al. 2015). A sentiment resource is the knowledge
base that a sentiment analysis tool can learn from. It can be in the form of
a lexicon or a dataset. A lexicon is a collection of simple units such as words
or phrases annotated with labels representing different sentiments. A senti-
ment dataset is a corpus of a higher order collection of words (e.g., sentences,
documents, blogs, and Tweets) annotated with one or more sentiments. So a
sentiment resource has two components: a textual unit and labels.
Annotation is the process of associating textual units with a set of prede-
termined labels. Sentiment lexicon annotation maps textual units with labels
representing different sentiments. There are three schemes to accomplish
sentiment lexicon annotation (Joshi et al. 2015): absolute, overlapping, and
fuzzy. In absolute annotation, only one out of multiple labels is assigned to a
textual unit. The overlapping scheme is used when labels are related to emo-
tions. Multiple emotions may correspond to one positive (or negative) senti-
ment, for example, an unexpected guest or gift not only makes us happy but
surprises us as well. Emotions are more complex to represent than sentiments.
In the third scheme, a label is assigned on the basis of likelihood of a textual
unit belonging to the label. Assigning a label using a distribution of positive:
0.8, and negative: 0.2, means that the textual unit occurs more in a positive
sense but is not completely positive.
In sentiment annotated datasets, major sources are Tweets and blogs
available on social media. It includes sentence-level datasets, and document
(discourse)-level datasets. Existing labeling techniques include manual anno-
tation and distant supervision. Twitter hashtags, as provided by Twitter users,
can be used for the purpose of annotation in the distant supervision scheme.
Data Analytics for Deriving Knowledge from User Feedback ◾ 127
Link Analysis
Data mining uses several techniques such as association rule mining, Bayesian clas-
sification algorithms, rule-based classifiers, and support vector machines to identify
hidden patterns in a dataset (Han, 2006). Data mining techniques, when applied
to link analysis on online social media, can provide insights into user behavior to
help policy-makers understand user feedback in a more meaningful way (Barbier
and Liu, 2011).
In social media, data mining helps to identify influential users. When a user
with a large number of followers gives negative feedback for a policy, product, or
service, the organization representatives have to be proactive to solve the problem,
128 ◾ Analytics and Knowledge Management
otherwise, it may leave a bad impression on a large user segment. The ability to
identify influential users can also help in targeting marketing efforts on people
with greater influence who are most likely to gain support for a policy, product,
or service. Therefore, it is important to understand the factors that determine the
influence of an individual in an online social network community. A simple fac-
tor can be to look at the structure of the social network that the user is a part
of. For example, a user whose Twitter account has a large number of followers,
or whose Tweets get a significant number of replies, or are retweeted widely may
indicate popularity and influence of that individual in the community. Agarwal
and Liu (2009) identify four measures to determine the influence of a blogger in a
blog community. They are recognition, activity generation, novelty, and eloquence.
Recognition is the number of inlinks to a blog post. Activity generation means
the number of comments a blog receives. Novelty follows a blog’s outlinks and the
influence value of the blog post to which the outlink points to. If the outlinked blog
post is an influential post, then the novelty of the current post is less. Lastly, the
length of a blog post determines its eloquence.
Participation of influential users in a topic increases the longevity of the topic in
the social domain. Yang and Leskovec (2011) studied temporal patterns of online
content using time series clustering. They observed that the attention that a piece
of content receives depends upon many factors including the participants who talk
about it and the topic that it relates to.
Clustering also helps to identify a community on a social network. A social
network representing a large user base can be partitioned into smaller subnetworks
on the basis of similarity of user interactions thus creating small communities of
users.
Closely related to data mining are two other multidisciplinary areas: digital
ethnography and netnography. These approaches give a scientific description of a
human society, for example, its culture or its demography. It can help to take more
informed decisions if we know the culture (integrity and honesty of the contribut-
ing users to determine authenticity of the content) of the community to which the
users belong, or their demography (young or old).
functions the users reported about. The study then analyzed the impact of this
feedback on different aspects of the software system such as efficiency, cost,
patient safety considerations, and so on. The authors opined that user feedback
can play an important role in learning and improving the system in future. In
a recent work in this domain, Vergne et al. (2013) proposed to analyze user
feedback of a software project (e.g., open source projects receive feedback from
the user community) for the purpose of understanding user requirements after
the software project is released to end users. When the feedback is analyzed and
combined with the most effective requirement suggestions for improving the
software project, it helps to identify expert users that can contribute effectively
to the requirements analysis.
Qiu et al. (2011) reported sentiment analysis dynamics of the online forum,
American Cancer Society Cancer Survivors Network (CSN), in which the users
are the cancer survivors (or their caregivers) who share their experiences of the
disease and the treatments that they took. Unlike the previous research in this
domain, this study applies computational techniques in collecting and analyzing
user responses. The dataset comprises approximately half a million posts from
the forum participants for a 10-year period of time from July, 2000 to October,
2010. The study uses machine learning classifiers to identify positive or negative
sentiments reflected in the posts. It also analyzes the sentiment dynamics and
contributing factors of this change over the period of time, that is, the change in
sentiment of a user’s posts from positive to negative or vice versa, and identifying
contributing factors, such as the number of positive and negative replies to the
post. The study reports that 75%–85% of users had a positive change in sentiment
after learning about the experiences of others with the disease and its treatment
options. However, in another study related to user feedback regarding the use of
certain types of drugs, it is found that online user reviews of the drugs provided an
exaggerated account of the impact of the drug treatment, and are much different
from the results obtained from clinical trials (Bower, 2017). This may be due to
the fact that such reviews are from people who have perhaps benefitted the most
from the drugs.
Recommendation systems also use user-provided ratings of a product to rec-
ommend a product to other similar users. Collaborative filtering is the method
that identifies the ratings that different user provide for a product. Products with
good ratings are guessed to be eligible for recommendation to other users. Another
scenario is when a user rates an item as good, other similar items (identified using
content or context filtering) are also recommended to that user. Such automatic
rating-based recommendation systems have applications in several areas including
electronic commerce and entertainment. In electronic commerce, recommending
popular items to potential buyers has been found to help in increasing sales. In the
entertainment industry, movie and music ratings on the basis of user feedback help
to recommend artists to a user on the basis of the user’s taste. Many people look for
reviews by movie critics before deciding to watch a movie in the theater. In addition
130 ◾ Analytics and Knowledge Management
to the number of reviews from movie critics for a movie, Sun (2016) takes into con-
sideration 27 other attributes such as the number of faces in the poster, and applies
data analytics to a dataset of more than 5000 movies to understand the goodness
of the movies.
It is easier to find popular items from a large inventory on the basis of user
feedback as the number of popular items is very small. However, it becomes difficult
when the inventory is too big (e.g., online catalogues) and one has to filter to get
items which obviously are not very popular (and lie in the long tail) but may be
kind of “hidden jewels” that some users may like.
As the discussion in this section shows, most of the research studies apply ana-
lytics to analyze and acquire information from data. Currently, as far as we know,
there is no link between user feedback and the knowledge management system in
an organization. A knowledge management system may support the preservation of
information acquired from user feedback as a knowledge asset, and policy-makers
in the organization can use these knowledge assets in decision-making. This paper
proposes a three-stage approach to create a link between user feedback and the
knowledge management system of an organization.
Data
acquisition Aggregating/filtering Effective
User and data with data knowledge
feedback management analytics management
Policy
maker
implementing the new policy was not in the right spirit. However, for some of
them, there was no change in opinion. They appreciated the government for this
bold step.
We envisage a process model in which user feedback is first analyzed from two
perspectives—from the sentiment point of view and from the user point of view. At
a later stage, the knowledge pieces available in the information extracted from the
previous step are preserved as knowledge assets in a knowledge management system
that can further support policy-makers in decision-making. We can explain our
proposed model for extracting knowledge from user feedback using data analytics
with the help of a reference model (see Figure 4.3) in knowledge management as
suggested by Botha et al. (2008).
As Figure 4.3 shows, the user is the main focus of the model by Botha, Kourie,
and Snyman (2008). The model senses the knowledge contributed by users, organizes
it, and then disseminates the knowledge thus created using the available technology
stack. Similarly, in the model envisaged in this research, the user is the main focus.
A government organization can sense the public mood by analyzing their response
regarding a public policy on social media. Data analytics can be employed to extract
knowledge from this feedback, and then use knowledge management tools to main-
tain the knowledge assets. Lastly, the policy-makers can use and share the knowledge
repository for decision-making in the public interest.
Knowledge Knowledge
creation and organizing and
sensing capture
Rich KM
solutions
Knowledge
sharing and
dissemination
Figure 4.3 The knowledge management process model. (From Botha, A. et al.,
Coping with Continuous Change in the Business Environment: Knowledge
Management and Knowledge Management Technology, Chandice Publishing,
Oxford, UK, 2008.)
Data Analytics for Deriving Knowledge from User Feedback ◾ 133
The major challenges involved in different stages of the proposed process model
are identified as follows:
Data Management
Acquisition and recording user feedback: In this age of Web 2.0, users generate a
lot of data every second; examples are Facebook, Twitter, and RSS feeds of
news sources and blogs. Some social media sites such as Twitter and Facebook
provide Application Programming Interfaces (APIs. An example is Twitter
Auth, which helps to retrieve data from data sources. However, these sites
normally put a limit on the number of API transactions per day. For example,
for Twitter, a user cannot get history data beyond one to three weeks. And for
Facebook, the developer application can make 200 calls per hour, per user in
aggregate. This limit gives the data service companies (e.g., podargos or Gnip)
a place in the data market, and they trade data as a business entity. Given the
vast size of social media data available, an individual cannot afford to store
volumes of data even if a crawler can collect it. Moreover, a business cannot
wait for days to collect data when historical data can be purchased in minutes
or hours. Data service companies lay down the infrastructure to collect data,
and then offer data to researchers or businesses at the rate of thousands of
dollars for some months of data. Some free data sources, such as Reddit and
Kaggle, are also available.
Extracting, cleaning, and data annotation: After collecting data, some post-
processing is required to validate and clean up the data. Huge amounts
of subjective, opinionated, text documents are available in a dataset. To
improve the quality of the input, it needs to be cleaned up by removing all
the residual HTML tags; unwanted tags; stop words such as is; repeated
letters such as the “I” in hiiiii; and punctuation marks. The data is often
collected from different data sources, each of which may have the data in
a different representation and format. So, all of this data will have to be
cleaned and prepared for the process of data analysis. Annotation is the pro-
cess of associating textual units with a set of predetermined labels. Tweets
can be labeled positive, negative, or neutral on the basis of the words in the
text. There are still challenges for annotating a Tweet that portrays sarcasm.
A sample Tweet is shown below:
It depicts a positive emotional state of mind of a speaker who actually has a negative
attitude. Furthermore, extracting sentiments from Tweets written in multilingual
text is another challenge. Check this sample tweet:
A solution is to translate the text before using it. But the solution quality depends
on quality of the translator.
Data Analytics
Such large volumes of data are worthless if they are not analyzed and used. Data
analytics refers to the analysis of structured and unstructured data for the purpose
of mining meaningful information (intelligence) out of millions of entries in a
dataset. If we focus on data analytics on user feedback available on social media,
the analytics can be put into two categories: content analytics and link analytics. In
this context, content analytics corresponds to Tweet-level sentiment analysis, and
in link analysis the focus is on user-level analysis.
Various metrics have emerged to extract information from user posts.
◾ Tweet-level Analysis
– Sentiment score: Sentiment score is a numeric metric that presents the
polarity of the sentiment expressed in a Tweet. A lower score indicates
negative sentiment. The sentiment score can be further filtered on the
basis of location and gender to understand the demographic structure of
the users providing feedback.
– Understand the temporal change: The time dimension considers how fre-
quent Tweets are on a topic. Tweet frequency as a function of time may
fade with time. Content on online social media is very volatile, and popu-
larity increases in a matter of hours, and also fades away at the same
speed.
– Change in sentiment score over a period of time: User feedback can turn
from positive to negative or from negative to positive over a period of
time. Therefore, it is interesting to see the change in user response. It hap-
pened in the case of demonetization. People in favor of the move in the
beginning criticized the government when they started facing the cash
crunch, and had to stand in long ATM lines to get cash. They blamed the
government for poorly implementing the policy. Though they consider
that the move may benefit the economy in the long run, poor implemen-
tation puts them off. Sample these Tweets:
“#Demonetization All most three months passed away 90% ATM’s are
still closed as no cash, PMO must intervene and solve the issue”
“Modi is good at theory, bad in implementation. #Demonetization
effect on the poor & unorganized economy is proof ”
“How much Black Money u recovered by #demonetization? who
responsible for 200+ death, job loss, low GDP growth”
Data Analytics for Deriving Knowledge from User Feedback ◾ 135
– Response length: User feedback in the form of longer posts may be more help-
ful. A user who spends a long time writing a response may be more involved
or have a detailed idea of the situation. But one has to look for the use of
tags, or repeated words in a post. For microblogging sites like Twitter, with
word limit of 140 characters, the response length cannot be long.
– What are the commonly used words: Use of words in a post can indicate
not only the emotions, but it can also point toward the issues affecting
the users. For example, the following Tweet highlights the status of the
cottage industry in the wake of demonetization:
(a) (b)
Figure 4.4 (a) A word cloud on the day of demonetization; (b) a word cloud
created after a few days.
136 ◾ Analytics and Knowledge Management
– Understand the spatial change: Spatial change shows how the sentiment
was in different states in a country and in the world. People living in
small cities may have different opinions than people living in big cities.
However, the topic of the Tweet also plays a role in spreading the topic in the
community that people start talking about it. On the other hand, participa-
tion of influential users also increases longevity of a topic in the community
discussion.
Knowledge Management
The purpose of knowledge management is to ensure that the right information is deliv-
ered to the appropriate place or person for making informed decisions. User feed-
back is a valuable source of meaningful information for an organization or enterprise.
There is a need to create a knowledge management system to capture knowledge from
the information extracted from the user feedback in the previous step (explained in
the previous section) of the process. The primary purpose of knowledge management
is to support policy-makers in efficient decision-making. A knowledge base stores and
maintains knowledge assets. A knowledge management system follows standard pro-
cesses to contribute knowledge assets to the knowledge base, and to retrieve knowledge
assets from the data base for reuse in unanticipated situations.
At present, many organizations have knowledge management systems in which
employees share their experiences in a central repository. Other members of the
organization can search the knowledge base, and reuse that knowledge in other sit-
uations. For example, Infosys employees use K-shop as the online knowledge portal
available on the company intranet. Such a system is restricted to employees only.
We propose to extend it to users to incorporate their feedback in the knowledge
base. Every project should have an integrated model with a knowledge management
system at place to learn from the current implementation.
For example, in our case study in this chapter, the government can learn a lot
from the user feedback available on social media. We believe that the policy-makers
had brainstormed before introducing the demonetization move to understand the
impact of the policy on the general public. But if we look at the rule changes intro-
duced by the government after the demonetization, it seems that the policy-makers
learned a lot later when people actually started facing the problems. There were
several changes in rules regarding cash deposits, withdrawals, and exchanges until
the demonetization drive ended on March 31, 2017. Interestingly, the time period
around November 8 is supposed to be auspicious for weddings in India. Several
users mentioned the word in their Tweets on the very first day (see word cloud in
Figure 4.4). But, an exception rule was issued on November 17 (a week later) to
allow families with weddings scheduled to withdraw up to Rs 2,50,000 from their
bank accounts. It shows that policy-makers were either in a hurry or could not
foresee the hardships the genuine public was going to face.
Data Analytics for Deriving Knowledge from User Feedback ◾ 137
This makes a case for the need of a knowledge management system as a compo-
nent of the complete process that caters to the extraction of knowledge from user
feedback, and then manages this knowledge for future use. Policy-makers can learn
from and preserve the commonly used words in the user feedback (see Figure 4.4).
In the future, when such a drive takes place again, they can consult the knowledge
base and avoid repetition of the problems in future implementations.
In addition to this, expert users can also be identified on the basis of their
involvement in topics and their understanding of the issues. When policy-makers
work on suggestions provided in user feedback for current policies or for simi-
lar policies in the future, such users can be included in the discussion group. To
select expert users as knowledge assets for a topic, there is a need to identify poten-
tial users who can contribute to that topic. For example, to discuss the impact of
demonetization on “weddings,” a set of suitable users can be identified and then
ranked on the basis of their involvement in the topic. Most suitable users can be
involved in the discussion later on.
As an example of expert users, a news editor named Narayan Ammachchi intu-
ited almost a year ago the government’s move of demonetization and Tweeted about
it. His Tweet is reproduced below (source Twitter).
restaurants, movies, and so on. No study, to the best of our knowledge, focuses on
user feedback regarding public policies to gauge the involvement of the general pub-
lic in a government’s policy decisions.
For our analysis, we chose the Indian government’s demonetization move in
November, 2016 as a policy decision that affected almost the entire population
of the country. There was a lot of anger on the microblogging site Twitter as peo-
ple complained about the poor implementation of the policy. However, the same
government swept state elections post demonetization which indicates popularity
of the government in a country that follows a democratic system of governance.
Perhaps it points toward the digital divide in the country. Low levels of literacy
combined with a low penetration of smartphones keeps a significant portion of the
population away from the digital main stream. Furthermore, women and girls in
rural areas are deprived from using digital technology due to the dominant ideol-
ogy of patriarchy.
This study is limited to user feedback available only on the microblogging site
Twitter. However, it shows a promising direction in which work can be further
extended to integrate knowledge derived from user feedback with the decision-
making process in a public organization.
References
Abookire, S., Martin, M., Teich, J., Kuperman, G., and Bates, D. (2000). Analysis of user-
feedback as a tool for improving software quality. In Proceedings of the AMIA Symposium.
Agarwal, N., and Liu., H. (2009). Modeling and data mining in blogosphere. Synthesis
Lectures on Data Mining and Knowledge Discovery (Vol. 1), pp. 1–109. https://doi.
org/10.2200/S00213ED1V01Y200907DMK001.
Appel, O., Chiclana, F., and Carter, J. (2015). Main concepts, state of the art and
future research questions in sentiment analysis, Acta Polytechnica Hungarica,
12(3), 87–108.
Barbier, G. and Liu, H. (2011). Data mining in social media. In C. Aggarwal (Ed.), Social
Network Data Analytics, Springer, New york.
Botha, A., Kourie, D., and Snyman, R. (2008). Coping with Continuous Change in the
Business Environment: Knowledge Management and Knowledge Management
Technology, Chandice Publishing, Oxford, UK.
Bower, B. (2017). Online reviews can make over-the-counter drugs look way too effective
“Evidence-based hearsay” and personal anecdotes overshadow clinical trial results,
March 14, 2017, Science and the Public. Available at https://www.sciencenews.org/
accessed on April 14, 2017.
Dong, L., Wei, F., Zhou, M., and Xu, K. (2014). Adaptive multi-compositionality for
recursive neural models with applications to sentiment analysis. In Twenty-Eighth
AAAI Conference on Artificial Intelligence (AAAI), Québec City, Québec.
Han, J. (2006). Data Mining Concepts and Techniques. Morgan Kaufmann, San Diego, CA.
Hridoy, S., Ekram, M. et al. (2015). Localized twitter opinion mining using sentiment
analysis. Decision Analytics 2:8, 1–19.
Data Analytics for Deriving Knowledge from User Feedback ◾ 139
Joshi, A., Ahir, S., and Bhattacharyya, P. (2015). Sentiment resources: Lexicons and
datasets. In D. Das and E. Cambria (Eds.), A Practical Guide to Sentiment Analysis,
Springer, Cham, Switzerland.
Khan, K., Baharudin, B., Khan, A., and Ullah, A. (2014). Mining opinion components
from unstructured reviews: A review. Journal of King Saud University—Computer and
Information Sciences 26, 258–275.
Labrinidis, A., and Jagadish, H. (2012). Challenges and opportunities with big data.
Proceedings of the VLDB Endowment, 5(12), 2032–2033.
Liu, B. (2011). Sentiment Analysis Tutorial, AAAI-2011, University of Illinois, Chicago, IL.
Mohammad, S. M. (2015). Challenges in sentiment analysis. In E. Cambria, D. Das,
S. Bandyopadhyay and A. Feraco (Eds.), A Practical Guide to Sentiment Analysis.,
Springer International Publishing.
Mohammad, S. M. (2016). A practical guide to sentiment annotation: Challenges and
solutions. In Proceedings of the Workshop on Computational Approaches to Subjectivity,
Sentiment and Social Media Analysis. San Diego, CA.
Pawar, A. B., Jawale, M. A., and Kyatanavar, D. N. (2016) Fundamentals of sentiment anal-
ysis: Concepts and methodology. In W. Pedrycz and S. M. Chen (Eds.), Sentiment
Analysis and Ontology Engineering: Studies in Computational Intelligence, Vol. 639.
Springer, Cham.
Qiu, B., Zhaoy, K., and Mitra, P. (2011). Get Online Support, Feel Better—Sentiment
Analysis and Dynamics in an Online Cancer Survivor Community. IEEE.
Salameh, M., Mohammad, S. M., and Kiritchenko, S. (2015). Sentiment after translation:
A case-study on arabic social media posts. In Proceedings of the North American
Chapter of Association of Computational Linguistics, Denver, CO.
Sun, C. (2016). Predict Movie Rating. Available at https://blog.nycdatascience.com/
student-works/github-profiler-tool-repository-evaluation/ retrieved on April 10, 2017.
Van der Meer, J., Boon, F., Hogenboom, F., Frasincar, F., and Kaymak, U. (2011). A frame-
work for automatic annotation of web pages using the google rich snippets vocabu-
lary. In 26th Symposium On Applied Computing (SAC 2011), Web Technologies Track,
pp. 765–772. Association for Computing Machinery.
Vergne, M., Morales-Ramirez, I., Morandini, M. et al. (2013). Analysing user feedback and
finding experts: Can goal-orientation help? In Proceedings of the 6th International i*
Workshop (iStar 2013), CEUR, Valencia, Spain, Vol. 978.
Yang, J., and Leskovec, J. (2011). Patterns of Temporal Variation in Online Media. WSDM’11,
February 9–12, 2011, Hong Kong, China.
Chapter 5
Contents
Introduction ......................................................................................................142
The Wider Concept of Knowledge Management ...............................................143
The Shift in Data Practices and Access ...............................................................144
Data Science as a New Paradigm .......................................................................146
Big Data Cost and Anticipated Value ................................................................147
Information Visualization..................................................................................148
Data Analytics Tools .......................................................................................... 152
Applications and Case Studies ........................................................................... 157
Emerging Career Opportunities ........................................................................160
Conclusion ........................................................................................................163
References .........................................................................................................163
141
142 ◾ Analytics and Knowledge Management
Introduction
The exponential growth in data and information generated on a daily basis has impacted
both personal and organizational work environments. The acquisition, analysis, and
subsequently the transformation of data into information and actionable knowledge
is dependent to a large extent on advances in technology and the development of
highly effective and efficient data-driven analytical tools. Analytical data-driven tools
have become paramount in the sorting and deciphering of large amounts of digital
information. The general consensus today is that we are living in the information age
and are overwhelmed by Big Data and the amount of information generated on a daily
basis, whether that information is at the personal or organizational level, also known
as information overload. Information overload is a reality and it has potential negative
implications both on the health of the individual and the health of the organization.
It is becoming increasingly clear that the acquisition, analysis, and transfor-
mation of Big Data into actionable knowledge will depend to a large extent on
creating cutting-edge technologies, advancing analytical and data-driven tools,
and developing advanced learning methodologies (McAfee & Brynjolfsson, 2012;
Waller & Fawcett, 2013; Chen, Chiang, & Storey, 2012). The application of these
tools will also depend on the level of investment made in training individuals as
well as organizations to manage knowledge with purpose.
Technology is the driving force behind the creation and transformation of Big
Data and data science. Today, there are many tools available for working with Big
Data and data intensive operations, but the most important and significant challenge
facing the industry is the lack of expertise and the knowledge needed for implemen-
tation. We must question, how good are these tools and who has the skillset to
operate these tools? But most importantly, who has the knowledge and expertise to
interpret the results and transform the information into actionable knowledge?
The answer to these questions depends on two types of knowledge-oriented com-
ponents that must work together in tandem. The first component is knowledge in
the form of data and information (explicit knowledge) and the second is knowledge
in the form of skills and competencies (tacit knowledge), both of which are needed
for research, development, and implementation. In the current era of Big Data and
data science-driven initiatives, many CEOs will readily attest that their competitive
advantage is the expertise of their employees, or human capital, but this in itself pres-
ents a challenge as this type of knowledge is not easily captured, stored, or transferred
(Nonaka & Takeuchi, 1995; Liebowitz, 1999; McAfee & Brynjolfsson, 2012). There
is a body of knowledge in literature concerning the relationship between explicit, or
documented knowledge, in comparison to tacit knowledge, which presents itself in
the form of skills and competencies that one would consider to be almost intuitive or
nonverbal (Hedlund, 1994; Orlikowski & Baroudi, 1991). However, simply being in
agreement that both of these components are necessary in the furthering of knowledge
will most likely not substantiate actual growth, but rather requires an active state of
ownership to transform tacit competencies to documented and shareable knowledge.
Relating Big Data and Data Science to the Wider Concept of KM ◾ 143
Business
practices
Process
People
new systems and workflows, with much of the automation coming about through
normalizing a sequence of events that have proven to be efficient and effective. The
intersection of information and technology has been the cornerstone in the way
that data and information are generated and used. It can also be viewed as the tool
needed in this case for data and test analytics. At the middle of this intersection
lie KM processes and practices needed to create the paramount transformation of
tacit knowledge to explicit knowledge. For KM to work, the various components
must be put together to enable us see the big picture and be able to connect the
dots.
Another aspect of KM is the ability to process and integrate information from
multiple sources and in varying formats, inclusive of information in such magnitude
that we now call it “Big Data,” as well as one’s reliance on data-driven analytics from
differing perspectives (Lamont, 2012). This goal is quite similar, if not synonymous,
with the idea of data science, but what is data science? Many individuals grapple
with defining data science based on its interdisciplinary nature, as it involves the
application of statistics, collection and storage of Big Data, creation of visualizations,
and so forth, but at the same time each of these items can also be housed in informa-
tion science, computer science, business, or even graphic design and art.
Data science uses automated tools and mechanics to extract and relay informa-
tion, often from very large amounts of data, to assist individuals and organizations
in analyzing their own data for better decision-making processes. These tools have
been applied to almost every realm of business and industry, as powerful insights
can be derived to then generate critical knowledge and action plans. Impacting
scientific infrastructures, workflows, processes, scholarly communication, and
certainly one’s health and well-being, the application of measurements to over-
whelming amounts of information is at the core of data science. It’s important to
keep in mind that data itself isn’t that important without a context in which to
apply it, as each organization has different needs and curiosities, meaning that a
wide variety of tools are now available to accommodate these demands for more
information.
The stock market is a prime example of business activities that rely heavily on
analytics and analytics-driven tools. Traditionally one could not trade successfully
without relying on the knowledge and expertise of a stock broker who would have
access to information relating to trading stocks. All of the information regarding
the health of the companies that were being traded and changes within the market,
such as an initial public offering (IPO), mergers, or bankruptcy, was handled and
disseminated through a broker. With information and data about the stock market
quickly becoming available online in the late 1990s, the end user suddenly had the
ability to become an actor in what is today is termed as the knowledge economy.
The knowledge economy is a concept that revolutionizes the way we do business
and demolished the traditional geographical market boundaries. It’s difficult to
imagine a time period where we were so dependent on local markets and how
we’ve quickly become accustomed to the culture of instant access to information.
Whether it’s retrieving information via Google, the buying and selling of products
and services online, or conducting one’s banking online in real time, this depen-
dency on instant access to information is a by-product of the increased values of
knowledge, the knowledge economy, and knowledge management.
Smartphones have revolutionized the way we access information and have given
us the ability to conduct searches and perform online transactions 24 × 7, removing
the dependency on middlemen, increasing autonomy, and reducing the associated
waiting time of many transactional processes. This technology has not only reduced
one’s waiting time but has also removed many of the physical barriers that were
once in place, thereby increasing productivity. Online banking through mobile
applications is a good example, as it minimized the need to go to a physical location
to deposit checks, and provides real-time transaction history, including deposits,
withdrawals, and pending items, so that the consumer has the most accurate view
of his or her account. It not only helps the consumer feel more informed but also
assists in establishing a trusting relationship between the consumer and the bank
provider. The elements of trust and privacy were two of the main concerns to many
consumers in the past, especially in the context of online transactions, and while
this issue still remains a concern, as more and more people use online transac-
tion processes, one’s increased access to information minimizes those fears and
fosters a trusting relationship between the consumer and the brand (Rittinghouse &
Ransome, 2016; Chiregi & Navimipour, 2017).
Access to information in the medical field has been long seen as an issue of
a great concern. However, the medical field has seen tremendous change in the
last decade, with almost all hospitals and doctors’ offices transitioning from
paper charts to digital electronic records. This shift to a digital format not only
minimizes errors in record keeping, but it also allows the patient to access their
own records and take charge of their care under their own terms. Interacting
with one’s clinic as well as a pharmacy in an online environment has become
common practice. Online patient portals are becoming commonplace, espe-
cially for specialists who see the same patients on a somewhat frequent basis.
146 ◾ Analytics and Knowledge Management
This portal not only allows users to schedule appointments and find helpful
resources, but it can also work as a receptacle for storing test results, prescrip-
tion dosage amounts, and can even facilitate as a record of one’s entire medical
history. Research pertaining to patient portal use shows a significant increase in
dialog between the patient and medical staff, increased attendance in follow-up
visits, and increased recovery, with patients attributing these successes to the
feeling of being autonomous and in control of their level of care and communi-
cation (Zarcadoolas, Vaughon, Czaja, Levy, & Rockoff, 2013). Studies show that
when people feel autonomous and are made to believe that they are in control
of their choices and actions, especially when they have a medical condition that
needs their continued attention, that their adherence to a regimented protocol
improves as their access to relevant information is increased (Kipping, Stuckey,
Hernandez, Nguyen, & Riahi, 2016).
The notion of increased access to data and information is supported by the
advancement of technologies, such as the advancement of data analytics tools,
which enhances one’s knowledge and awareness. The ability to access information
at the right time within a particular situation, without the concern of one’s privacy
being violated, empowers people to make better decisions and to take control of
their lives.
cloud computing, and database management are highly sought after by employers, as
they want to be reassured that the business’s information will not be compromised.
Companies today value intellectual capital including the employees’ expertise, cus-
tomer profiles, and organizational memories as some of their greatest competitive
advantages (Duhigg, 2012). It’s commonplace in online sales and marketing for orga-
nizations to seek clients’ historical information, an action that is usually included in
the barely legible fine print of the contract that you agree to when making any large
purchases, such as a new car, or even signing up for a credit card (Campbell & Carlson,
2002). The compilation of this information alone isn’t of great value until one puts
forth the time-intensive effort of analyzing the data, with the purpose of making sense
and finding patterns among the records. The activity of analysis is often very time-
intensive, especially for a scientist or researcher who does not have the skillset for
the available technologies. We detail some of the currently popular tools later in this
chapter that have become mainstream in the realm of data science, particularly ones
that provide increased levels of analysis applicable to the needs of all organizations,
even those who do not see themselves in the business of data.
Variety is pretty easy to define, as it’s the inclusion of data in many different for-
mats, from traditionally structured data, such as numeric data, to qualitative data,
such as videos and audio that add complexity and robustness to one’s knowledge
of a subject.
Variability refers to the volatility in data and datasets. It is different from variety
in the sense that variability deals with the changing nature of the datasets due to
uncontrolled variables. We see changes in data over a period of time when analyz-
ing large datasets such as stock market information. The types of data collected
and used in stock market-forecasting changes daily, and is based on a wide range
of variables that include, human, social, political, and economical value. The vari-
ability of that data sometimes makes it complex and difficult to predict or forecast.
Scalability is another important area that’s necessary when dealing with Big
Data. Scalability refers to the capabilities we build to handle Big Data, from com-
puter networks, information systems, databases, analytics tools, and so on. A tool’s
scalability means that it can handle changing amounts of data depending on the
task, giving flexibility to the user to apply one technology to many situations. Some
of the popular tools for the handling of Big Data to be discussed in further sections
include Cloudera, Hadoop, and Spark.
The value of Big Data is not only in the data itself but rather in what an orga-
nization can do with it. Therefore, knowledge discovery from Big Data using
advanced data science and data analytics tools is critical to an organization’s overall
intellectual capital. But without the infrastructure and the technological capabili-
ties to make use of Big Data, questioning how to handle the data could become a
burden to an organization. Organizations will have to make significant investments
in capturing, processing, and managing large datasets, and the ROI can only be
measured by the value of the knowledge that can be harvested from that data. This
could be a challenge for an organization given the tacit nature of that knowledge.
Therefore, an organization must have an intellectual capital and KM strategy in
place that values tacit KM, and should be able to conduct knowledge audits from
time to time to justify the organization’s ROI.
Information Visualization
Data and information visualization is key to the process of making sense of big data
and data analytics. In order for the knowledge discovery from a dataset (large or
small) to be complete, there must be a definitive point of application in which the
outcomes convey meaningful action. Data itself are granular in the large scheme
of knowledge, especially if one hopes that these data will precipitate a change in
human behavior. Let’s think of this within the scope of an item that we are all
familiar with, which is the nutrition facts panel as seen on all prepackaged foods
in the supermarket. This label itself is not dramatic or enticing, probably because
we are so accustomed to these labels being present that the majority of individuals
Relating Big Data and Data Science to the Wider Concept of KM ◾ 149
don’t actually reference the data contained within the panel unless there is a health
concern (situation) caused by something such as diabetes or a desire to lose weight,
and so on.
The panel contains at minimum the amount of calories per recommended
serving, and the grams of carbohydrate, fat, and protein, as well as the daily
percentages needed of these macronutrients for the average healthy individual.
The visual representation of the data is determined by the frame of reference and
the understanding of the metrics used within the panel, but in order to convey
the intended message to the user, one must first have a frame of reference to
make sense of the data (Reber, 1989; Velardo, 2015). Knowledge acquisition and
utilization happen when users are able to understand information conveyed to
them in an easy manner, but very few packaged products provide visual panels to
assist users in having a frame of reference for making decisions, therefore the data
included in the panel often goes unreferenced.
While some visualization techniques are simpler to understand, such as the
nutrition facts label panel example above, not everything visual is easy to compre-
hend without a certain degree of training and expertise. Viewing medical images
without the proper training needed to contextualize the data to understand the
results can be problematic. To perform any level of analysis through visualization,
which is reported to be the most personally valued of the five senses in the Western
world (Kambaskovic & Wolfe, 2014), one must contextualize the data first and
have a degree of familiarity of the attributes of data visualization.
Data visualization is a hot topic at the moment, even commanding entire uni-
versity courses to be devoted to its discussion, as its amorphous nature includes
the joining of both quantitative and qualitative data, while dancing in the realms
of computer science, art, graphic design, and psychology (Healy & Moody, 2014;
Telea, 2014). It’s difficult to assign data visualization to one discipline as the recently
built tools rely on the finesse of computer scientists that have the skillsets to build
a program or software that renders an image based on a programmed code, such
as the outputs of Python or R, but also has a holistic presence and relies on the
expertise of artists and graphic designers. The visualization tool Tableau is espe-
cially popular in the business sector, providing aesthetically pleasing graphs and
diagrams appropriate for many different types of users, from analyzing a client’s
data to actually presenting the final analysis to the client, as its functionality and
use of color and dimension are almost limitless. The selection of colors used to relay
information shouldn’t be considered secondary to the importance of the informa-
tion, as we have been conditioned to associate colors with particular meanings or
feelings, and these associations can sometimes be more blatant than the informa-
tion itself if not appropriately assigned.
In the United States, people associate the color red with having a meaning of cen-
trality or warning, with a heat map usually assigning the deepest shade of red to the
area of most importance. When looking at a weather heat map for a storm, the loca-
tion with the most severe conditions is almost always assigned red, with the intensity
150 ◾ Analytics and Knowledge Management
fading to orange and yellow as you view the perimeter of the storm. The color red has
also been used for stoplight labeling, as a definitive warning to not proceed, and some
manufacturers have even played with using this three-color system (inclusive of yel-
low and green) to assist consumers in more easily understanding the nutritional value
of prepackaged items, such as desserts having a red label, meaning to take caution,
while milk and eggs have a green label, which gives permission to proceed in pur-
chasing and consuming. Regarding the theme of nutrition, these color-coded labels
have proven to be especially helpful for individuals in making more health-conscious
food choices, especially for those who have a limited education (including youth and
adults) and lack certain literacy skills needed to analyze the data included on the
nutrition facts label panel or list of ingredients on the back of the packaged item
(Sonnenberg et al., 2013).
Producing visualizations is especially popular as one can summarize the
importance of the information presented in a significantly faster amount of time
than it would have taken each viewer to actually comb through the data itself
in search of its meaning (Fayyad, Wierse, & Grinstein, 2002). Visualizations
by industry’s definition are simply pictorial or graphical images created for the
purpose of presenting data in a summarized format, and range anywhere from
a simple bar graph or pie chart to a 3D color-saturated heat map with mov-
ing elements (https://www.sas.com/en_us/insights/big-data/data-visualization.
html). With this definition in mind, we challenge the reader to consider all of
the visualizations that one encounters on a daily basis, as we take many items for
granted in our quick analyses that represent quite complex amounts of informa-
tion. One of the most popular visualizations that summarizes a large amount of
data into one single image, is Napoleon’s successive losses as his French troops
crossed Russia from 1812–1813, as drawn by Charles Minard in 1869 (Tufte,
2001; Friendly, 2002). Minard is considered the first to use graphics in combina-
tion with engineering and statistics, specifically numeric data as represented on
graphical maps.
The map below combines six different types of data: the number of soldiers
within the troops; the direction of the travel (it changes as they advance and
then retreat); the distance traveled across Russia; the temperatures encountered;
latitude and longitude; and the location of each report. The thick beige line
represents the amount of soldiers within the troop at the start of their journey,
which is seen to diminish as the line extends from left to right. The black line
then represents the amount of soldiers that were still alive as they then began
the second half of their journey, which was to return to the western-most point
of the map, as seen when the black line diminishes from right to left. It’s evident
that based on the scale of these two lines, the majority of the troops did not
complete the entire march across Russia and back again. This image is often
included within the coursework of visualization classes, as it compels the viewer
to consider all of the different aspects and the amount of information included
in such a simple graphic.
Relating Big Data and Data Science to the Wider Concept of KM ◾ 151
Minard’s carte figurative of Napoleon’s 1812 campaign. (From Tufte, E., The
Visual Display of Quantitative Information, 2nd ed., Graphics Press, Cheshire,
CT, 2001.)
SAS EM is powerful data mining software, and the makers claim that it offers
more predictive modeling techniques than any other currently available data min-
ing or data analytics tool in the market. Within the interface, models are built with
a visually pleasing process flow diagram environment, giving users the ability to test
the accuracy of their predictions using graph-oriented displays. The appeal to users
in both business and academic sectors is that the software is scalable, meaning that
it can be utilized by an individual as well as an entire team of people, can operate
on datasets of varying sizes, and that it is available in a cloud-based environment.
Built on the SEMMA methodology, which stands for sampling, exploring, modi-
fying, modeling, and assessing, SAS EM provides users the ability to find patterns
and relationships in Big Data. With user-friendliness always being a concern, SAS
EM offers a clean interface simple enough to not overwhelm an individual with
minimal statistical knowledge, but complex enough to still be of use to someone
well-versed in modeling, designing decision trees, neural networks, and performing
either linear or logistic regression (Matignon, 2007).
In 2008, researchers used the neural network method of SAS EM to diagnose
heart disease, based on the patient profiles of individuals with and without heart
disease, and produced an 89% accuracy in their predictions (Das, Turkoglu, &
Sengur, 2009). This ensemble-based methodology was made possible by splitting
the dataset into two parts, one containing 70% of the data for testing, and the
remaining 30% for validation of the proposed system, which is an integral part of
neural network modeling, as it’s impossible to confirm the accuracy of a model if
it’s not applied to data similar to what was used in its creation.
Visualizations have come a long way in a very short amount of time, and com-
bine the resources of art and graphic design with analytics, to create concise images
that represent much more complex amounts of information (Fox & Hendler, 2011).
For example, almost all of the Fortune 500 companies utilize infographics, which
are visual images that typically include text, charts, and diagrams. Tableau uses
drag and drop dimensions and offers users simple data mining capabilities that have
historically only been available to those with a complex coding skillset (Tableau.
com, 2017). Tableau gives users the ability to create time series-driven data, such
as a moving depiction of the path of Hurricane Katrina across the Gulf of Mexico
in 2005. Based on the data provided, the system was not showing only the move-
ment but also the change in intensity of the storm through the use of color as well
as varying degrees of opacity. Users can easily use the software interface to create
visualizations and save and share the results in a sharable format, such as a PDF,
JPG, or PNG.
To spark interest and plant a seed of inspiration, Tableau’s website offers thou-
sands of examples of visualizations that have been built using real-world data
examples. To incentivize educators to use Tableau within their classrooms, instruc-
tors as well as students are given a free annual license, and instructors have the
added benefit of access to course materials created by Tableau that can be integrated
within their own course research objectives. Librarians at Ohio State University
154 ◾ Analytics and Knowledge Management
(OSU) utilized Tableau to visualize library data on three very different datasets,
all of which produced results that better enabled them to meet the needs of their
university patrons. The datasets illustrated the library’s potential to promote library
collections in tandem with marketing the library’s programming, how combining
data from multiple sources could provide support for the prioritization of digitizing
special collections, and lastly the ability to better understand user satisfaction from
survey data through the Tableau’s filtering functions (Murphy, 2015).
While these tools are considered to be cutting edge, their accessibility comes at
a cost, most commonly seen as an annual licensing fee either by person or institu-
tion. For parties who desire the same results, such as text analysis and data mining,
but don’t have the resources to purchase the aforementioned software, there are a
number of free alternatives. Users should be reminded however that anything adver-
tised as being “free” usually has hidden costs, an idea popularized with the acronym
TANSTAAFL, meaning “There ain’t no such thing as a free lunch” (Grossmann,
Steger, & Trimborn, 2013). Within the context of tools and technologies, this
cost is invariably the user’s time and attention required in the beginning, to learn
how to navigate the tool and generate the desired results. A prime example of this
would be the open source and very popular coding languages of R and Python.
Python is a multifaceted language and offers everything from data cleaning to graph-
based visualizations, with over 40% of data scientists reporting it as their “go-to” for
problem solving, per the 2013 survey by O’Reilly (http://www.oreilly.com/data/free/
stratasurvey.csp). It’s really no wonder why this language has been so widely accepted,
as it supports all kinds of data formats, from CSV to SQL tables, and the online
community is robust in offering examples and inquiry support (Perkel, 2015). R on
the other hand, while also open source, is considered by many users to have a steeper
learning curve than Python, due to its complicated syntax, but it is built with the data
scientist in mind, as its language is focused on solving statistical problems and generat-
ing scientific visualizations (Hardin et al., 2015; Edwards, Tilden, & Allevato, 2014).
The results provided by writing code from either of these languages can assist the
user and spark hidden discovery, but to arrive at this point, the user will first need to
spend a great deal of time acquainting themselves with the syntax for both of these
languages. Almost all undergraduate computer science programs offer courses in pro-
gramming languages, with Python currently being the most popular introductory
language offered at colleges within the United States, compared to business schools,
which advertise their access to the proprietary tools mentioned prior (Brunner &
Kim, 2016). As discussed more deeply in other sections of this chapter, there is an
apparent disconnect in the data analytics-driven tools introduced to students within
the collegiate environment depending on their registered course of study, which lends
these areas of expertise to being siloes rather than shared for the mutual benefits of all
parties who have the same interests (Thompson, 2017).
Additional platforms and tools for the mining of large datasets and creating
models include Cloudera, and Apache’s Hadoop and Spark. Hadoop has made
its name in the software arena in two ways, through its unconventional storage
Relating Big Data and Data Science to the Wider Concept of KM ◾ 155
capabilities (splitting files into large blocks that are then distributed across multiple
storage systems, with the data being processed in parallel across multiple nodes), as
well as its lightning-fast processing procedures that are built upon the MapReduce
framework (Padhy, 2013). Similar to Hadoop, Spark can handle large datasets and
its speed is marketed as being up to 10 times faster than that of Hadoop. Both of
these tools have a learning curve, but are considered to be some of the more user-
friendly options for tackling gigantic datasets using open-source packages. These
tools are useful to companies that have large and complex datasets, and can also
be used in tandem, as they are not mutually exclusive. It depends on the size of the
dataset and the needs of the end users as to which option is more appropriate at that
time (Lin, Wang, & Wu, 2013).
A precursor to the age of Big Data, was the advent of e-commerce, or the selling
of products and services online, which we can see as early as the late 1990s (Kohavi,
2001), and it was in this online environment that the transactional process was
made completely trackable. Google has also made this tracking process too easy to
not take advantage of, especially for small businesses with limited budgets, through
its free Google Analytics dashboard (Clifton, 2012). To benefit from the bevy of
data available for collection through Google Analytics, snippets of code are placed
on each page of the specified website, and within seconds, the user can monitor
the engagement, activity, and trends forming based on visitors to the site (Fang,
2007). These analytics are available in real time in an interactive dashboard that is
extremely user-friendly, and can even be integrated through APIs to populate other
third party dashboards. While there are many other offerings in the Google suite
of services, we focused on this one item due to its market share, which is estimated
to having been used in more than 90% of the websites available today.
In 2008, Google launched an effort to help predict trends in illness, specifi-
cally the flu, based on users’ search queries. The Center for Disease Control (CDC)
has a lag time of up to three weeks in collecting data and then making the formal
announcement of an epidemic, but Google was able to predict trends in almost
real time through their data collection mechanisms (Lazer, Kennedy, King, &
Vespignani, 2014). When initially introduced, the idea that illness and disease had
the potential to be minimized through pinpointing the volume and location of
searches based on similar symptoms, the announcement was met with great fanfare
and excitement. This venture gave hope to not only the medical communities in the
United States, but to much poorer countries as well, where medical resources are
often limited and time is of the essence.
However, in August 2015, Google quietly terminated this initiative, due to an
increasing amount of inaccurate predictions, as Google’s model failed to adequately
account for changes in user search behavior (Pollett et al., 2016). As users, we’ve
become much more adept and accurate in our search queries, as the popular search
engines’ algorithms now all rely on aspects of semantic analysis, or the meaning
and context behind the entire search phrase, rather than just the meaning of each
individual word used within a phrase. This improvement in user search behavior,
156 ◾ Analytics and Knowledge Management
coupled with the fact that not everyone who was searching using terms that are con-
nected to having the flu, actually had the flu, produced inaccurate results based on
the model built. Someone may have searched based on having a simple headache,
while another user may have entered a query to learn more about flu season itself,
or where to get a flu shot in order to be preventative. Researchers are hopeful that in
time the model first introduced will be finetuned and be able to deliver an accurate
analysis and prediction based on real-time search data as originally promised.
While not normally called “data analysts,” those who monitor any kind of tech-
nology that records data to be retrieved and used at a later date, certainly utilize
a similar skillset. The world of medicine and medical devices has certainly been
positively impacted by the integration of analytics, namely digital analytics, as
healthcare professionals now have greater precision and clarity to provide the safest
and most efficient care possible (Alyass, Turcotte, & Meyre, 2015). There is a long
list of technologies that have transformed the way we conduct business, but the
foundational element to all of these tools is that the user is now in control of being
able to generate new information that will in turn become knowledge, and through
application, the promise of progressive change is possible (Howe et al., 2008).
Every aspect of business has been impacted by increased access to data analyt-
ics, even including the wine industry, which has traditionally been perceived as a
more manual field, steeped in the romanticized idea of picking and sorting grapes
by hand to preserve a certain level of product quality (Dokoozlian, 2016). Many
wineries are starting to integrate optical sorting machines in their facilities, the
leading branding currently being WECO (http://www.weco.com), which operates
based on a camera with LED lights that illuminates each individual grape during
the sorting process, measuring the fruit’s surface for defects as well as sorting based
on the desired size of the grape, and discarding all of the items that don’t meet the
set standards for optimal wine production, including stray twigs, leafs, and debris.
The system can even be calibrated for color and density, selecting grapes that
have particular hues that represent the ripeness or saturation of sugars within the
grape, as each wine maker desires different levels of these components. While one
may argue that this is a stretch for an example of data analytics, analytics themselves
don’t have to be a physical printable image or graph, but rather information that is
derived from the application of measurements or statistics. These same machines
are also used within the farming and food manufacturing channels, revolution-
izing entire industries through minimizing manual labor and risk while increasing
accuracy for a wide gamut of items that we consume on a daily basis, including
but not limited to nuts, olives, berries, and grains. This type of optical sorting has
been used for some time now, specifically within the grain industry, as a method of
removing grains that are infected or contaminated, minimizing the risk of illness
in both humans as well as livestock (Dowell, Boratynski, Ykema, Dowdy, & Staten,
2002). While a data-driven tool may have been originally created to facilitate the
needs of a more traditional or scientific disciplines, with time, most industries have
also started to integrate and adapt these practices.
Relating Big Data and Data Science to the Wider Concept of KM ◾ 157
Even if you don’t live in a smart city (yet), you probably go to the grocery store,
or have a membership to one of the bulk-item retailers, such as Costco or Sam’s
Club, and the chances are pretty good that you have a loyalty card tied to one of
these merchants, either physically presented or keyed-in using your name or phone
number at the point of transaction. Users are generally incentivized to use these
cards to receive discounts, and at the same time the retailer collects data on which
items were purchased together and at what frequency, and so on. This process has
become the norm and most of us no longer give the sharing of this information
a second thought, but what if the data contained within these historical records
could actually be of a much more beneficial use than originally estimated?
A prime example of harnessing this innocent, some would even say unneces-
sary, data for knowledge utilization and the betterment of humanity, is the research
conducted at the University of Oxford, with a keen interest in both Big Data and
dementia (Deetjen, Meyer, & Schroeder, 2015). Dementia is a good example of a
chronic disease that often causes an individual to exhibit particular behaviors for
quite a long period of time before diagnosis, and as many chronic diseases, it is
speculated to be heavily influenced by diet.
The researchers posit that medical professionals should utilize grocery store
loyalty card data to have more information regarding the prior dietary choices of
individuals who have a current diagnosis of dementia, as well as to be proactive in
spotting the early signs of this disease in its first stages. One of the behaviors exhib-
ited in the early stages of dementia is that of forgetfulness or decreased memory,
and this is especially prevalent when grocery shopping, with some people buying
the exact same item over and over again with increased frequency. Multiple grocery
store brands have already volunteered to share this data with the medical profes-
sion, and if approved by the legislation, the release of this longitudinal data for
analysis could prove to be quite life-altering.
Another topic of recent debate is the 2016 United States presidential election.
The outcome of the election surprised both analysts and statisticians who analyzed
and presented data, which led the majority to believe that the Democratic candidate
would win by a landslide. A popular website predicting the outcome of the election
was http://www.FiveThirtyEight.com, hosted by Nate Silver, who used vast amounts
of data collected, paired with rigorous methodology, to provide a quantitative analy-
sis of the discussions and opinions heard across a broad audience. They correctly
predicted the 2008 election won by Barack Obama, with his final presidential elec-
tion forecast accurately predicting the winner of 49 of the 50 states, missing only
Indiana, as well as the District of Columbia. He again shocked the nation by accu-
rately predicting the outcomes of all 50 states in 2012, including the 9 swing states,
but then in 2016 the model failed. They predicted that Hillary Clinton would win
by a 71% majority vote, giving Donald Trump only 29% of the vote.
The data analyzed included a wide range of polling methods, a similar approach
to the data collected for the two previous elections. A reason given for this error
in accuracy was not the method used, but rather in the reliance on the final data
Relating Big Data and Data Science to the Wider Concept of KM ◾ 159
collected, an example that the variability of Big Data could change the predictive
outcome. Social media platforms played a large role in this false feeling of security,
as users’ posts so often lack independence in this age of digital transparency, and
while a single Tweet can be viewed hundreds of thousands of times, it really only
represents the sentiment and opinion of a single person (Rossini et al., 2017).
The election has brought increased attention to the information gathered, and
subsequent knowledge formed, by users who rely solely on data presented online,
specifically from social media sites such as Facebook and Twitter.
Social media as a channel for communication has reshaped the way that we
form our social circles, with conversations never actually dying as online banter
and behavioral data are constantly tracked and stored. Businesses are encouraged to
take advantage of the immense amount of data that we have chosen to share about
ourselves through our social media profiles to improve their marketing strategies,
with the most popular and robust platform currently being Facebook. The behav-
ioral tracking and data collection efforts of Facebook include behaviors online and
offline, making the connection between user profiles by using credit card informa-
tion as well as email addresses and phone numbers (Markovikj, Gievska, Kosinski, &
Stillwell, 2013). Advertisements can then be tailored and shown just to the intended
online audience, saving the business both time and financial resources, while maxi-
mizing their ROI.
Researchers and academics are also benefiting from stored information as it pro-
vides a more concrete method of discovering communicative patterns exhibited across
networks, including informal peer-to-peer interactions as well as business to business
(B2B) or business to consumer (B2C) data. Twitter data has been especially useful for
having a deeper understanding of users’ thoughts and opinions, through the inclu-
sion of hashtags, which note the meaning of the conversation and connect users of the
same verbiage to one another. For example, if I tweeted about the 2016 presidential
election and used one of the president campaign hashtags, such as #MAGA (Make
America Great Again), my Tweet would then be added to the list of all other Tweets
that used the same hashtag, essentially creating a conversation around this topic. It
is also possible to download all of the Tweets using this hashtag and perform a senti-
ment analysis to determine the positive or negative feelings associated with this topic.
Twitter’s API allows for batch downloads of Tweets, currently up to 10,000 or
more Tweets at once, and includes the user’s location as well as the dissemination
and breadth of share across the Twitter network. Technologies are abundant in
fulfilling the need to visualize this type of data. Within the context of the afore-
mentioned example of downloading Twitter data, NodeXL could be an appropriate
tool for visualization as it provides a visual representation of the network breadth
and sharing of a Tweet or hashtag topic, and is user-friendly as it’s an Excel plugin,
found at http://www. NodeXL.com. This plugin also provides a list of the top 10
influencers within these conversations based on their centrality, including likes and
retweets, as well as the top URLs included in the Tweets, making it much more
evident as to who is leading and participating in certain conversations.
160 ◾ Analytics and Knowledge Management
acknowledge that many of their students will be at least partially employed dur-
ing the time that they complete their master’s degree, and need the flexibility in
the amount of coursework taken each semester. Many of the degrees also include
completing an internship or team-based project that is then presented to a panel
of executives, greatly emphasizing the importance of communication and team-
oriented skills within the workplace.
Positions that one can apply for upon earning their Master of Business or
Science degree with an emphasis on data analytics, are on the rise as more and more
companies are creating the roles of data scientist, manager, analyst, consultant, or
software or project engineer, in order to stay competitive (De Mauro et al., 2016,
2017; Gardiner et al., 2017). Upon review of some of the job postings that included
the keywords “data analytics,” “data analyst,” and “data scientist” on three popular
recruitment sites Glassdoor.com, LinkedIn.com, and Indeed.com, we identified a list
of the common minimal qualifications desired by employers for their future hires:
Optional but preferred skills that help candidates stand out from the crowd:
Access to particular tools learned in academia differ greatly by employer and it’s impor-
tant that students realize that although they may be proficient in one software, that
the possibility is almost guaranteed that they will need to learn new tools and new
systems upon employment. For example, it’s rare that a small employer who is in the
beginning stages of handling large datasets would have the budget for a yearly Tableau
or SAS license, and instead prefer relying on free or budget-friendly resources. In this
162 ◾ Analytics and Knowledge Management
case, the open-source items discussed prior would most likely be the more attractive
option; therefore, a data analyst must be agile and flexible in learning new tools that
are similar to ones having been taught within the classroom. As is with any technology,
the industry that makes these data-driven tools is quickly advancing and products are
constantly being upgraded. This means that a data analyst must be open to learning and
keeping up with the new market trends (McAfee & Brynjolfsson, 2012).
Regarding the desired experience, students should utilize internships and become
involved in research projects that utilize real datasets that would be of interest to future
employers, to establish a portfolio of work experience (http://data.gov). Certifications
can also add credibility and enhance one’s resume, especially for programs or software
that have multiple levels of proficiency. For example, a leading online instructor, Lynda.
com, offers courses and tutorials in learning Python, while Tableau holds a tighter rein
and true to its proprietary nature, is the only one to have ownership to offer the two-
level certification for its software. Both of Google’s Analytics and AdWords platforms,
which measure one’s proficiency in analyzing a website’s online presence, health, and
the process of running an online advertisement campaign, have multiple proficiency
exams, but to stay active and up-to-date, these must be retaken every 18–24 months.
And while no one really enjoys having to be recertified on a continual basis, Google
has incentivized users to stay current by eliminating the certification fee, making all of
the exams now free. Coursera.com offers short online classes that cover many of the
concepts considered to be part of data analytics, with the lessons often being offered
free of charge, giving the users the option to pay for the course if they want to take the
exams and earn a certificate of completion.
Data-related internships, boot camps, hackathons, and incubators are other
options to gain relevant real-world experience outside of the classroom (Anslow,
Brosz, Maurer, & Boyes, 2016). For example, the Data Incubator program is
funded by Cornell’s data science training organization (http://thedataincubator.
com), and offers a free advanced 8-week fellowship for PhD students and grad-
uates who are about to start searching for industry positions. Companies that
have partnered with the Data Incubator for their hiring needs include LinkedIn,
Genentech, Capital One, and Pfizer, with alumni now employed by companies
such as Verizon, Facebook, KPMG, Cloudera, and Uptake. This fellowship is
typically 8 weeks in length, with time requirements similar to that of a Monday
to Friday, 8 a.m. to 5 p.m. job, and are physically held in the New York City,
San Francisco, Seattle, Washington DC, and Boston metros. There is also an
online option that typically requires participants to dedicate 2–3 months of time,
but at a part-time commitment. Technical training curriculum in the fellowship
includes software engineering and numerical computations, NLP, statistics, data
visualization, database management, and parallelization. The soft skills gained
from the training include ramping up one’s communication skills, as academics
and those in industry communicate in very different ways, along with face-to-
face networking, and practice interviews to assist fellows in being the most pre-
pared when applying for jobs upon completion of the program.
Relating Big Data and Data Science to the Wider Concept of KM ◾ 163
Conclusion
This chapter served as a review of some of the pertinent areas of analytics (data science,
data analytics, and Big Data) and related them to the wider concept of KM. The value
and importance of these concepts is linked directly to the increased shift in the econ-
omy and the increased emphasis on knowledge and intellectual capital as a key driver of
the knowledge-based economy. The currently available tools and the applications that
we’ve outlined will soon be considered ancient history, but the fact remains that the
acquisition of knowledge and its systemic management will continue to challenge both
scientists and practitioners. We would be remiss to not stress the increasing importance
of research and academia and their role in creating more consistent theories, models,
and frameworks to guide the development and implementation of data science and data
analytics. It is also important to create a curriculum to support knowledge for students
who desire to be part of the ever-present data and digital revolution. The task of manag-
ing knowledge effectively is built on the collective wisdom and knowledge of each of the
contributing people within the organization.
References
Alavi, M., & Leidner, D. E. (2001). Knowledge management and knowledge manage-
ment systems: Conceptual foundations and research issues. MIS Quarterly, 25,
107–136.
Al-Hawamdeh, S. (2002). Knowledge management: Re-thinking information manage-
ment and facing the challenge of managing tacit knowledge. Information Research,
8(1), 143.
Alyass, A., Turcotte, M., & Meyre, D. (2015). From big data analysis to personalized medi-
cine for all: Challenges and opportunities. BMC Medical Genomics, 8(1), 33.
Anslow, C., Brosz, J., Maurer, F., & Boyes, M. (2016, February). Datathons: An experience
report of data hackathons for data science education. In Proceedings of the 47th ACM
Technical Symposium on Computing Science Education (pp. 615–620). ACM.
Brunner, R. J., & Kim, E. J. (2016). Teaching Data Science. Procedia Computer Science,
80, 1947–1956.
Campbell, J. E., & Carlson, M. (2002). Panopticon. com: Online surveillance and the com-
modification of privacy. Journal of Broadcasting & Electronic Media, 46(4), 586–606.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From
big data to big impact. MIS Quarterly, 36(4), 1165–1188.
Chiregi, M., & Navimipour, N. J. (2017). A comprehensive study of the trust evaluation
mechanisms in the cloud computing. Journal of Service Science Research, 9(1), 1–30.
Clifton, B. (2012). Advanced web metrics with Google Analytics. Hoboken, NJ: John
Wiley & Sons.
Das, R., Turkoglu, I., & Sengur, A. (2009). Effective diagnosis of heart disease through
neural networks ensembles. Expert Systems with Applications, 36(4), 7675–7680.
Davenport, T., & Patil, D. (2012, October). Data Scientist: The Sexiest Job of the 21st Century.
Retrieved from hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.
De Mauro, A., Greco, M., Grimaldi, M., & Nobili, G. (2016). Beyond data scientists:
A review of big data skills and job families. Proceedings of IFKAD, 1844–1857.
164 ◾ Analytics and Knowledge Management
De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2017). Human resources for big data
professions: A systematic classification of job roles and required skill sets. Information
Processing & Management, 54(3).
Deetjen, U., Meyer, E.T., & Schroeder, R. (2015). Big data for advancing dementia research:
An evaluation of data sharing practices in research on age-related neurodegenerative
diseases. OECD Digital Economy Papers, No. 246, Paris: OECD Publishing.
Dokoozlian, N. (2016, February). Big data and the productivity challenge for wine grapes.
In Agricultural Outlook Forum 2016 (No. 236854). United States Department of
Agriculture.
Dowell, F. E., Boratynski, T. N., Ykema, R. E., Dowdy, A. K., & Staten, R. T. (2002). Use
of optical sorting to detect wheat kernels infected with Tilletia indica. Plant Disease,
86(9), 1011–1013.
Duhigg, C. (2012). How companies learn your secrets. The New York Times, 16, 2012.
Edwards, S. H., Tilden, D. S., & Allevato, A. (2014, March). Pythy: Improving the intro-
ductory python programming experience. In Proceedings of the 45th ACM technical
symposium on Computer science education (pp. 641–646). New York, NY: ACM.
Fang, W. (2007). Using Google analytics for improving library website content and design:
A case study. Library Philosophy and Practice (e-journal), 121.
Fayyad, U. M., Wierse, A., & Grinstein, G. G. (Eds.). (2002). Information visualization in
data mining and knowledge discovery. San Francisco, CA: Morgan Kaufmann.
Fox, P., & Hendler, J. (2011). Changing the equation on scientific data visualization.
Science, 331(6018), 705–708.
Friendly, M. (2002). Visions and re-visions of Charles Joseph Minard. Journal of Educational
and Behavioral Statistics, 27(1), 31–51.
Gardiner, A., Aasheim, C., Rutner, P., & Williams, S. (2017). Skill requirements in big
data: A content analysis of job advertisements. Journal of Computer Information
Systems, 35, 1–11.
Grossmann, V., Steger, T. M., & Trimborn, T. (2013). The macroeconomics of
TANSTAAFL. Journal of Macroeconomics, 38, 76–85.
Guidi, G., Miniati, R., Mazzola, M., & Iadanza, E. (2016). Case study: IBM Watson ana-
lytics cloud platform as analytics-as-a-service system for heart failure early detection.
Future Internet, 8(3), 32.
Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., ... & Ward,
M. D. (2015). Data science in statistics curricula: Preparing students to “think with
data”. The American Statistician, 69(4), 343–353.
Healy, K., & Moody, J. (2014). Data visualization in sociology. Annual Review of Sociology,
40, 105–128.
Hedlund, G. (1994). A model of knowledge management and the N‐form corporation.
Strategic Management Journal, 15(S2), 73–90.
Hey, A. J., Tansley, S., & Tolle, K. M. (2009). The fourth paradigm: Data-intensive scientific
discovery (1st ed.). Redmond, WA: Microsoft Research.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., ... & Twigger, S.
(2008). Big data: The future of biocuration. Nature, 455(7209), 47–50.
Hoyt, R. E., Snider, D., Thompson, C., & Mantravadi, S. (2016). IBM Watson analytics:
Automating visualization, descriptive, and predictive statistics. JMIR Public Health
and Surveillance, 2(2), e157.
Jarvis, J. (2011). What would Google do?: Reverse-engineering the fastest growing company in
the history of the world. New York, NY: Harper Business.
Relating Big Data and Data Science to the Wider Concept of KM ◾ 165
Kambaskovic, D., & Wolfe, C. T. (2014). The senses in philosophy and science: From the
nobility of sight to the materialism of touch. A Cultural History of the Senses in the
Renaissance. Bloomsbury, London, 107–125.
Kipping, S., Stuckey, M. I., Hernandez, A., Nguyen, T., & Riahi, S. (2016). A web-based
patient portal for mental health care: Benefits evaluation. Journal of Medical Internet
Research, 18(11), e294.
Kohavi, R. (2001, August). Mining e-commerce data: The good, the bad, and the ugly.
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discov-
ery and data mining (pp. 8–13). San Francisco, CA: ACM.
Lamont, J. (2012). Big data has big implications for knowledge management. KM World,
21(4), 8–11.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu:
Traps in big data analysis. Science, 343(6176), 1203–1205.
Liebowitz, J. (Ed.). (1999). Knowledge management handbook. Boca Raton, FL: CRC Press.
Lin, X., Wang, P., & Wu, B. (2013, November). Log analysis in cloud computing envi-
ronment with Hadoop and Spark. Broadband network & multimedia technology
(IC-BNMT ), 2013 5th IEEE international conference on (pp. 273–276). IEEE.
Lyon, L., & Mattern, E. (2017). Education for real-world data science roles (Part 2):
A translational approach to curriculum development. International Journal of Digital
Curation, 11(2), 13–26.
Markovikj, D., Gievska, S., Kosinski, M., & Stillwell, D. (2013, June). Mining
Facebook data for predictive personality modeling. Proceedings of the 7th interna-
tional AAAI conference on Weblogs and Social Media (ICWSM 2013) (pp. 23–26).
Boston, MA.
Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable realtime data
systems. Shelter Island, New York: Manning Publications.
Matignon, R. (2007). Data mining using SAS enterprise miner (Vol. 638). Hoboken, NJ:
John Wiley & Sons.
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard
Business Review, 90(10), 60–68.
Murphy, S. A. (2015). How data visualization supports academic library assessment: Three
examples from the Ohio State University Libraries using Tableau. College & Research
Libraries News, 76(9), 482–486.
Nonaka, I., & Takeuchi, H. (1995). The Knowledge-creating company: How Japanese compa-
nies create the dynamics of innovation. Oxford, UK: Oxford University Press.
O’Leary, D. E. (2017). Emerging white-collar robotics: The case of watson analytics. IEEE
Intelligent Systems, 32(2), 63–67.
Orlikowski, W. J., & Baroudi, J. J. (1991). Studying information technology in organiza-
tions: Research approaches and assumptions. Information Systems Research, 2(1), 1–28.
Padhy, R. P. (2013). Big data processing with Hadoop-MapReduce in cloud systems.
International Journal of Cloud Computing and Services Science, 2(1), 16.
Perkel, J. M. (2015). Pick up python. Nature, 518(7537), 125.
Pollett, S., Boscardin, W. J., Azziz-Baumgartner, E., Tinoco, Y. O., Soto, G., Romero,
C., ... & Rutherford, G. W. (2016). Evaluating Google flu trends in Latin America:
Important lessons for the next phase of digital disease detection. Clinical Infectious
Diseases, 64(1), ciw657.
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-
driven decision making. Big Data, 1(1), 51–59.
166 ◾ Analytics and Knowledge Management
Raskin, J. (2000). The humane interface: New directions for designing interactive systems.
Boston, MA: Addison-Wesley.
Reber, A. S. (1989). Implicit learning and tacit knowledge. Journal of Experimental
Psychology: General 118(3), 219.
Rittinghouse, J. W., & Ransome, J. F. (2016). Cloud computing: Implementation, manage-
ment, and security. Boca Raton, FL: CRC Press.
Rossini, P. G., Hemsley, J., Tanupabrungsun, S., Zhang, F., Robinson, J., & Stromer-
Galley, J. (July, 2017). Social media, US presidential campaigns, and public opinion
polls: Disentangling effects. Proceedings of the 8th International Conference on Social
Media & Society (p. 56). New York, NY: ACM.
Russom, P. (2011). Big data analytics. TDWI best practices report, fourth quarter, 19, 40.
Sonnenberg, L., Gelsomin, E., Levy, D. E., Riis, J., Barraclough, S., & Thorndike, A. N. (2013).
A traffic light food labeling intervention increases consumer awareness of health and
healthy choices at the point-of-purchase. Preventive Medicine, 57(4), 253–257.
Stark, H., Habib, A., & al Smadi, D. (2016). Network engagement behaviors of three
online diet and exercise programs. Proceedings from the Document Academy, 3(2), 17.
Tableau.com. (2017). www.tableau.com, accessed 2017.
Tal, A., & Wansink, B. (2016). Blinded with science: Trivial graphs and formulas increase
ad persuasiveness and belief in product efficacy. Public Understanding of Science,
25(1), 117–125.
Tang, R., & Sae-Lim, W. (2016). Data science programs in US higher education: An
exploratory content analysis of program description, curriculum structure, and
course focus. Education for Information, 32(3), 269–290.
Telea, A. C. (2014). Data visualization: principles and practice. Boca Raton, FL: CRC Press.
Tenopir, C., Allard, S., Sinha, P., Pollock, D., Newman, J., Dalton, E., ... & Baird, L.
(2016). Data management education from the perspective of science educators.
International Journal of Digital Curation, 11(1), 232–251.
Thompson, G. (2017). Coding comes of age: Coding is gradually making its way from
club to curriculum, thanks largely to the nationwide science, technology, engineering
and mathematics phenomenon embraced by so many american schools. The Journal
(Technological Horizons In Education), 44(1), 28.
Tufte, E. (2001) The visual display of quantitative information (2nd ed.). Cheshire, CT:
Graphics Press.
Tuomi, I. (1999). Corporate knowledge: Theory and practice of intelligent organizations
(pp. 323–326). Helsinki, Finland: Metaxis.
Velardo, S. (2015). The nuances of health literacy, nutrition literacy, and food literacy.
Journal of Nutrition Education and Behavior, 47(4), 385–389.e1.
Vilajosana, I., Llosa, J., Martinez, B., Domingo-Prieto, M., Angles, A., & Vilajosana, X.
(2013). Bootstrapping smart cities through a self-sustainable model based on big data
flows. IEEE Communications Magazine, 51(6), 128–134.
Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think.
International Journal of Advertising, 33(1), 181–183.
Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data:
A revolution that will transform supply chain design and management. Journal of
Business Logistics, 34(2), 77–84.
Zarcadoolas, C., Vaughon, W. L., Czaja, S. J., Levy, J., & Rockoff, M. L. (2013).
Consumers’ perceptions of patient-accessible electronic medical records. Journal of
Medical Internet Research, 15(8), 284–300.
Chapter 6
Fundamentals of Data
Science for Future
Data Scientists
Jiangping Chen, Brenda Reyes Ayala,
Duha Alsmadi, and Guonan Wang
Contents
Data, Data Types, and Big Data ........................................................................168
Data Science and Data Scientists .......................................................................170
Defining Data Science: Different Perspectives ...............................................170
Most Related Disciplines and Fields for Data Science ...................................172
Data Scientists: The Professions of Doing Data Science ................................173
Data Science and Data Analytics Jobs: An Analysis ........................................... 174
Purposes of Analysis and Research Questions ................................................175
Data Collection ............................................................................................175
Data Cleanup and Integration ...................................................................... 176
Tools for Data Analysis ................................................................................. 176
Results and Discussion..................................................................................177
Characteristics of the Employers...............................................................177
Job Titles ..................................................................................................178
Word Cloud and Clusters on Qualifications and Responsibilities .............178
Summary of the Job Posting Analysis ............................................................182
167
168 ◾ Analytics and Knowledge Management
Data can also be classified based on different facets depending on different perspec-
tives. For example, scientific data can be numeric, textual, or graphical. Or, we can
categorize data as structured, semistructured, or unstructured based on whether
they have been organized.
Data type is an important concept in this regard. It has been a basic concept
in computer science. A computer programmer knows that one needs to determine
the data type of a particular variable before applying appropriate operations to
it. Computer languages can usually handle different types of data. For example,
Python, one of the most popular computer programming languages for data analy-
sis, contains simple data types including numerical data type, string, and bytes, and
collection data types such as tuple, list, set, and dictionary. Data might need to be
changed to a different type to be processed properly.
Most data we talk about in this chapter are actually digital data in elec-
tronic form. Data are generated every day and at every moment. One of the
characteristics of data is that it can be copied easily and precisely, or transferred
from one medium to another with great speed and accuracy. However, that
is not always true, especially when we start to deal with Big Data. Big Data
refer to digital data with three Versus: (1) volume, (2) variety, and (3) velocity
(Laney, 2011). Volume refers to the situation that data grows at a rapid rate,
variety refers to the many data types in which data exist, and velocity refers to
the speed of data delivery and processing. Dealing with Big Data requires theo-
ries, methods, technologies, and tools, which lead to the emergence of data sci-
ence, a new discipline that focuses on data processing, which will be discussed
in the remaining sections.
Data are an important resource for organizations and individuals who use them
to make decisions. When Big Data becomes publicly available, no one can ignore
the huge potential impact of Big Data on business, education, and personal lives.
Therefore, data science as a new discipline has drawn great attention from govern-
ments, industry, and educational institutions.
170 ◾ Analytics and Knowledge Management
◾ The center of data science is data. Especially, Big Data becomes the subject
that is investigated.
◾ The purpose of data science is to obtain information or knowledge from data.
The information will help to make better decisions, and the knowledge may
help an organization, a state, a country, or the whole of humanity to better
understand the development or change of the nature or the society.
◾ Data science is a multidisciplinary field that has applied theories and
technologies from a number of disciplines and areas such as mathemat-
ics, statistics, computer sciences, information systems, and information
science. It is expected that data science will bring change to impact these
disciplines.
Value
Step 5: Use
of information
and knowledge
Step 4: Information
management and Knowledge
knowledge sharing management
Step 3: Data/information Data science
processing, analysis, and presentation
Information science
Step 2: Data collection, Information
acquisition and content management
Figure 6.1 Data science in the context of knowledge process. (Adapted from
Hawamdeh, S., Knowledge Management: Cultivating Knowledge Professionals,
Chandos Publishing, Oxford, UK, 2003.)
other disciplines for the purpose of educating knowledge professionals (p. 168).
He considered knowledge management a multidisciplinary subject. Knowledge
management professionals need to acquire information technology and informa-
tion science skills, which serve as foundation for higher-level knowledge work such
as knowledge management and sharing. We expand on his framework by includ-
ing an additional step, shown in Figure 6.1, called data processing, analysis, and
processing, between the original step 2 (information acquisition and content man-
agement) and step 3 (information and knowledge sharing). In this consideration,
information science, data science, and knowledge management are closely related
and overlapping disciplines or areas. Typically, data science centers on activities in
step 2, but should also include activities and tasks in steps 2, 4, and 5.
This section will present the purposes of the analysis, the research questions, the
process of data collection and analysis, and the results.
Data Collection
We recruited the students of a graduate-level information science class at the
University of North Texas to collect the data. INFO 5717 is a course in which
students learn to build web-based database systems. The students were assigned to
five teams with three members in each team. They were asked to collect, process,
and upload at least 60 job postings pertaining to data analysts, information man-
agement specialists, data scientists, and business intelligence analysts to a MySQL
database.
Prior to data collection, we provided a list of metadata elements to guide the
process. Students were required to collect information for at least the following
elements for each job posting:
Students were instructed to collect the above data on the Internet using methods
they considered appropriate. The data was collected from job aggregator websites
176 ◾ Analytics and Knowledge Management
postings. We also wrote a simple program using the R language to extract the most
frequent keywords and produced a visual representation for the results called a word
cloud (Heimerl, Lohmann, Lange, & Ertl, 2014). Manual data transformation and
adjustment were performed to use these tools for the purpose of answering the
research questions proposed in Section 3.1.
11
9
8
Total
5
4
3
1
1
1
1
1
1
1
1
1
A a
lif ta
lo ia
Fl do
G ida
Ill gia
n n
N isso ta
nn ina
Te ee
W Vir as
hi ia
riz s
K ois
M chi s
or w ey
Te rol k
n
N Ne Jers i
e
n
i a
r
in ga
Ca or
to
Co orn
as gin
Ca tlan
M eso
x
at
s
ew u
M ans
o
ra
r
or
in
es
th Y
ng
eo
St
A
d
te
ni
,U
re
he
w
ny
A
graduates and undergraduates were possible candidates for data science jobs. The
preferred levels of experience ranged from two to more than 10 years. These results
portrayed an image of the job market for data science.
Job Titles
A frequency analysis of the 298 job postings showed that the top job titles used
by the employers were data scientist, data analyst, business intelligence analyst
or intelligence analyst, information management specialist or analyst, and data
engineer. For data scientist and data analysts, the job title could specify different
levels, such as lead or principal, senior, junior, or intern. Some job titles, such as
data analytics scientist and data science quantitative analyst, were a mixture of data
analyst, data science, or data scientist. Other titles included Big Data architect, data
integrity specialist, data informatics scientist, marketing analyst, MIS specialist,
project delivery specialist, and consultant on data analytics (Table 6.1).
Other 30 (10%)
Total 298
Fundamentals of Data Science for Future Data Scientists ◾ 179
We also conducted cluster analysis on these two fields. The purpose of the cluster
analysis was to identify concepts that are implied in the dataset. The process starts
by exploring the textual data content, using the results to group the data into mean-
ingful clusters, and reporting the essential concepts found in these clusters (SAS
Institute, 2014). Using the SASEM cluster analysis function, which is based on
the mutual information weighting method, we obtained five clusters as presented
in Table 6.2. According to the terms in each cluster, we could name cluster 1 as
project management; cluster 2 as machine learning and algorithmic skills; cluster 3
as statistical models and business analytics; cluster 4 as database management and
systems support, and cluster 5 as communication skills. Among them, cluster 2 is
the largest cluster with a relative frequency of 47% among all the terms in these
two fields.
The text filter node functionality in SAS enabled us to invoke a filter viewer to
explore interactive relationships among terms (SAS Institute, 2014). To preview the
180 ◾ Analytics and Knowledge Management
concepts that are highly correlated with the term data, a concepts link map was
generated, as depicted in Figure 6.4. The five clusters in Table 6.2 become the six
nodes (two notes on analysis/analyze) with links to other outer notes.
The word cloud and cluster analysis showed a broader view of the terms from the
job market perspective. Also, further exploration of the concepts link map revealed
that data scientists need to have substantial skills in Hadoop, Hive, Pig, Python,
Machine Learning, C++ and Java programming, R, and SQL, in addition to being
proficient with statistical and modeling packages such as SPSS and SASEM.
+ sas institute + result effectively + tool
actionable
business performance + member
+ methodology + data scientist
+ present + process
business + scientist
+ communicate + project
+ pattern + team
+ work
+ interpret
+ organization
+ source
+ market
+ analyze
performance
metrics
data + insight
+ question
+ business insight
+ actionable insight
+ product
+ opportunity
+ practice
statistical
+ analysis + recommendation
+ model
+ algorithm
+ presentation
predictive
multiple + trend
+ test
+ perform
+ build + conduct
+ data model + perform
+ design
+ predictive model data analysis
Fundamentals of Data Science for Future Data Scientists
◾
Colorado Technical University Computer Science Department Doctor of Computer Science with a
concentration in Big Data Analytics, online,
http://www.coloradotech.edu/degrees/
doctorates/computer-science/
big-data-analytics
Indiana University Bloomington School of Informatics and Computing PhD minor in Data Science, campus or online,
http://www.soic.indiana.edu/graduate/degrees/
data-science-minor.html
Chapman University Schmid College of Science and PhD in Computational and Data Sciences,
Technology campus, http://www.chapman.edu/scst/
graduate/phd-computational-science.aspx
Georgia State University Department of Computer Science PhD (Bioinformatics Concentration), campus,
http://cs.gsu.edu/graduate/doctor-philosophy/
ph-d-bioinformatics-concentration-degree-
requirements/
Fundamentals of Data Science for Future Data Scientists
Indiana University-Purdue School of Informatics and Computing PhD in Data Science, campus, https://soic.iupui.
◾
University-Indianapolis edu/hcc/graduate/data-science-phd/
(Continued)
183
Table 6.3 (Continued) PhD Programs in Data Science
184 ◾
Kennesaw State University Department of Statistics and Analytical PhD in Analytics and Data Science, campus,
Sciences http://csm.kennesaw.edu/datascience/
New York University Tandon School of Engineering PhD in Computer Science with Specialization in
Visualization, Databases, and Big Data,
campus, http://engineering.nyu.edu/
academics/programs/computer-science-phd
Newcastle University School of Computer Science EPSRC CDT in Cloud Computing for Big Data,
campus, http://www.bigdata-cdt.ac.uk/
Oregon Health and Science Department of Medical Informatics and PhD in Biomedical Informatics—Bioinformatics
University Clinical Epidemiology and Computational Biology Track, Clinical
Informatics Track, campus, http://www.ohsu.
edu/xd/education/schools/school-of-medicine/
departments/clinical-departments/dmice/
Analytics and Knowledge Management
educational-programs/clinical-informatics.cfm
(Continued)
Table 6.3 (Continued) PhD Programs in Data Science
University School or Department Degree, Online or Campus, and URL
University of Southern Marshall School of Business PhD in Data Sciences and Operations, campus,
California https://www.marshall.usc.edu/index.php/
departments/data-sciences-and-operations
University of eScience Institute PhD in Big Data and Data Science, campus,
Washington-Seattle http://escience.washington.edu/education/phd/
igert-data-science-phd-program/
Worcester Polytechnic Institute College of Arts and Sciences PhD in Data Sciences, campus, https://www.wpi.
Fundamentals of Data Science for Future Data Scientists
edu/academics/study/data-science-phd
◾
185
186 ◾ Analytics and Knowledge Management
Southern Methodist University Dedman College of Humanities and Master of Science in Data Science,
Sciences, Lyle School of Engineering online option, https://requestinfo.
and Meadows School of the Arts datascience.smu.edu/index10.html
University of California, Berkeley School of Information Master of Information and Data Science,
online option, https://requestinfo.
datascience.berkeley.edu/index3.html
Arizona State University W.P. Carey School of Business Master of Science in Business Analytics,
online option, https://programs.
wpcarey.asu.edu/masters-programs/
business-analytics
Carnegie Mellon University School of Computer Science Master of Computational Data Science,
no online option, https://mcds.cs.cmu.
edu/
cornell.edu/academics/mps
(Continued)
187
Table 6.4 (Continued) Master Programs in Data Science
188 ◾
Illinois Institute of Technology College of Science Master of Data Science, online option,
http://science.iit.edu/programs/
graduate/master-data-science
Indiana University, Bloomington School of Informatics and Computing Master of Science in Data Science,
online option, http://www.soic.indiana.
edu/graduate/degrees/data-science/
index.html
New York University Center for Data Science Master of Science in Data Science, no
online option, http://cds.nyu.edu/
academics/ms-in-data-science/
Analytics and Knowledge Management
North Carolina State University Institute for Advanced Analytics Master of Science in Analytics, no
online option, http://analytics.ncsu.
edu/
Northwestern University McCormick School of Engineering and Master of Science in Analytics, and
Applied Science, and School of Master of Science in Predictive
Continuing Studies Analytics, http://www.mccormick.
northwestern.edu/analytics/
(Continued)
Table 6.4 (Continued) Master Programs in Data Science
University School/College Degree, Online Option, URL
Rutgers, The State University of New Computer Science Department Master of Science in Data Sciences, no
Jersey online option, https://msds-cs.rutgers.
edu/msds/aboutpage
University of California, San Diego Departments of Computer Science and Master of Advanced Study in Data
Engineering Science and Engineering, no online
option, http://jacobsschool.ucsd.edu/
mas/dse/
University of Minnesota—Twin Cities College of Science and Engineering, Master of Science in Data Science, no
College of Liberal Arts and School of online option, https://datascience.umn.
Fundamentals of Data Science for Future Data Scientists
(Continued)
◾ 189
Table 6.4 (Continued) Master Programs in Data Science
190 ◾
University of San Francisco College of Arts and Sciences Master of Science in Analytics, no
online option, https://www.usfca.edu/
arts-sciences/graduate-programs/
analytics
◾ Data collection: Extracting data from the web, from APIs, from databases,
and from other sources.
◾ Data cleaning and managing: Manipulating and organizing the data collected
to make them useful for data analysis tasks.
◾ Exploratory data analysis: Exploring data to understand the data’s underlying
structure and summarizing the important characteristics of a dataset.
◾ Data visualization: Using plotting systems to construct data graphics, analyz-
ing data in graphical format, and reporting data analysis results.
◾ Machine learning: Building prediction functions and models by using super-
vised learning, unsupervised learning, and reinforcement learning.
Bootcamps
Bootcamp programs are non-traditional educational paths. Compared with tradi-
tional degrees, these programs are intense and have faster routes to the workplace.
It is another education option for considering a career as a data scientist. Data science
bootcamps provided by DataScience.Community (Datascience.Community, n.d.)
has a list of bootcamps available for data science and data engineering.
In summary, data science programs at undergraduate, master, and doctoral level
institutions have been developed in the United States. These programs provide a
wide range of choices to students who want to obtain knowledge and skills in data
science. Still, more data science programs are being developed. It may be the time
to explore the characteristics of a competitive data science program.
◾ Fundamental concepts, disciplines, and the profession: The program should offer
one or two courses to introduce the student to basic concepts, disciplines,
and the profession of data science. These courses set up a solid foundation for
students to learn more advanced concepts and applications. They may teach
not only concepts and characteristics of data, information, knowledge, and
the data lifecycle, but also related mathematical concepts, functions, models,
and theorems. Students should develop an affection for data and be willing
to do data science.
◾ Statistical analysis and research methodology: The program should offer multi-
ple courses to teach data collection and data analysis skills, which are usually
taught in master or doctoral level research methodology classes.
192 ◾ Analytics and Knowledge Management
Each student can choose different courses or focus on different areas to improve
their knowledge and skills based on their backgrounds and personal interests. For
example, a student with a business background may want to take mainly courses
on data management and programming languages, while, an information science
student should study business logic in addition to statistical analysis and data visu-
alization through coursework or the practicum.
A course may also teach a student knowledge and skills in different areas. For
example, a database course may teach students data collection, cleaning, analysis,
and report writing through its term project. The instructor can also design projects
consisting of research values or reflecting real industry needs.
In general, a flexible curriculum and close connections with profit and non-
profit organizations provide students with much flexibility and opportunities to
become successful data workers.
References
Ackoff, R. L. (1989). From data to wisdom. Journal of Applies Systems Analysis, 16, 3–9.
Agarwal, R., & Dhar, V. (2014). Editorial—Big data, data science, and analytics: The oppor-
tunity and challenge for IS research. Information Systems Research, 25(3), 443–448.
Banafa, A. (2014). What is data science? Available at: http://works.bepress.com/
ahmed-banafa/15/.
Boston University Libraries. (n.d.). Data life cycle. Available at: https://www.bu.edu/
datamanagement/background/data-life-cycle/.
Buckland, M. K. (1991). Information as thing. Journal of the American Society for Information
Science, 42(5), 351–360.
Burtch Works Executive Recruiting. (2017). Data science salary study. Available at http://
www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/.
Cao, I. (2017). Data science: Challenges and directions. The Communications of ACM, 60(8), 59–68.
Carter, D., & Sholler, D. (2016). Data science on the ground, hype, criticism, and every-
day work. Journal of the Association for Information Science & Technology, 67(10),
2309–2319.
Chang, C. (2012). Data life cycle. Available at: https://blogs.princeton.edu/
onpopdata/2012/03/12/data-life-cycle/
Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of
the field of statistics. ISI Review, 69, 21–26.
Coronel, C., Morris, S., and Rob, P. (2012). Database systems: Design, implementation, and
management, 10th ed. Boston, MA: Course Technology, Cengage Learning.
DataScience.Community. (n.d.). Data science bootcamps. Available at: http://datascience.
community/bootcamps.
Davenport, H. T., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century.
Harvard Business Review, 90(5), 70–77.
Davenport, T., & Prusak, L. (1998). Working knowledge: How organizations manage what
they know. Cambridge, MA: Harvard University Press.
Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.
Doherty, R. (2010). Getting social with recruitment. Strategic HR Review, 9(6), 11–15.
Donoho, D. (2015). 50 years of Data Science. unpublished. Available at: https://dl.dropbox usercon-
tent.com/u/23421017/50YearsDataScience.pdf. University of Michigan’s Data Science Initiative.
Hawamdeh, S. (2003). Knowledge management: Cultivating knowledge professionals. Oxford,
UK: Chandos Publishing.
Hayashi, C. (1998). What is data science? Fundamental concepts and a heuristic example. In
C. Hayashi, K. Yajima, H. H. Bock, N. Ohsumi, Y. Tanaka, & Y. Baba (Eds.), Data
science, classification, and related methods. Springer, Tokyo: Studies in Classification,
Data Analysis, and Knowledge Organization.
Heimerl, F., Lohmann, S., Lange, S., & Ertl, T. (2014). Word cloud explorer: Text analytics
based on word clouds. Proceedings of the Annual Hawaii International Conference on
System Sciences (pp. 1833–1842). doi:10.1109/HICSS.2014.231.
Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2012). Enterprise data analysis and
visualization: An interview study. IEEE Transactions on Visualization and Computer
Graphics, 18(12), 2917–26. doi:10.1109/TVCG.2012.219.
194 ◾ Analytics and Knowledge Management
Laney, D. (2011). 3D data management controlling data volume, velocity, and variety.
META Group: Application Delivery Strategies, Available at: https://blogs.gartner.com/
doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-
Velocity-and-Variety.pdf.
Leopold, G. (2017). Demand, salaries grow for data scientists. Datanami, January 24, 2017.
Available at: https://www.datanami.com/2017/01/24/demand-salaries-grow-data-
scientists/.
Losee, R. M. (1997). A discipline independent definition of information. Journal of the
American Society for Information Science, 48(3), 254–269.
Loukides, M. (2011). What is Data Science? Sebastopol, CA: O’Reilly Media.
Madden, A. D. (2000). A definition of information. Aslib Proceedings, 52(9), 343–349.
McKinsey & Company. (2011). Big data: The next frontier for innovation, competition, and
productivity. McKinsey Global Institute, (June), 156. doi:10.1080/01443610903114527.
Michigan Institute for Data Science (MIDAS). (2017). About MIDAS. Available at: http://
midas.umich.edu/about/.
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-
driven decision making. Big Data 1(1), BD51–BD59.
Saracevic, T. (1999). Information science. Journal of the American Society for Information
Science, 50(12), 1051–1063.
SAS Institute. (2014). Getting Started with SAS ® Text Miner 13.2. Cary, NC: SAS.
Shum, S. B., Hall, W., Keynes, M., Baker, R. S. J., Behrens, J. T., Hawksey, M., &
Jeffery, N. (2013). Educational data scientists: A scarce breed. Available at: http://
simon.buckinghamshum.net/wp content/uploads/2013/03/LAK13Panel-Educ Data
Scientists.pdf.
Spirion. (n.d.). Data lifecycle management. Available at: https://www.spirion.com/
data-lifecycle-management/.
Stodder, D. (2015). Chasing the data science unicorn. Available at: https://tdwi.org/
articles/2015/01/06/chasing-the-data-science-unicorn.aspx.
Chapter 7
Contents
Introduction ......................................................................................................196
Historical Perspective of Social Networks and Social Media...............................199
Evolution of Analytics .......................................................................................201
Social Media Analytics.......................................................................................203
Defining Social Media Analytics ...................................................................203
Processes of Social Media Analytics .............................................................. 204
Social Media Analytics Techniques ....................................................................207
Identifying Data Sources...............................................................................207
Data Acquisition .......................................................................................... 208
Data Analysis Techniques............................................................................. 208
Sentiment Analysis .................................................................................. 208
Topic Modeling........................................................................................209
Visual Analytics ........................................................................................210
Stream Processing ..................................................................................... 211
Social Media Analytics Tools ............................................................................. 211
Scientific Programming Tools ....................................................................... 211
Network Visualization Tools .........................................................................212
Business Applications....................................................................................212
Social Media Monitoring Tools .....................................................................213
Text Analysis Tools........................................................................................213
Data Visualization Tools ...............................................................................213
Social Media Management Tools................................................................... 214
Representative Fields of Social Media Analytics ................................................. 214
Conclusions ...................................................................................................... 215
References .........................................................................................................216
195
196 ◾ Analytics and Knowledge Management
Introduction
Social media analytics (SMA) has become one of the core areas within the exten-
sive field of analytics (Kurniawati, Shanks, & Bekmamedova, 2013). Generally
speaking, SMA facilitates proper analytic techniques and tools to analyze user-
generated content (UGC) for a particular purpose. The content of social media
comes in different forms of repositories (Sinha, Subramanian, Bhattacharya, &
Chaudhary, 2012). For instance, the repositories could be blogs (Tumblr), microb-
logs (Twitter), wikis (Wikipedia), social networking sites (Facebook and LinkedIn),
review sites (Yelp), and multimedia sharing sites (YouTube) (Holsapple, Hsiao, &
Pakath, 2014).
SMA has involved many techniques and tools to mine a variety of UGC,
including sentiment analysis, topic modeling, data mining, trend analysis, and
social network analysis. A primary reason for growing interest in SMA is the real
time accessibility regarding size and diffusion speed of the UGC on social media
(Holsapple et al., 2014). For example, a YouTube video showing a Coke bottle
exploding because of Mentos became the most popular video on YouTube and
subsequently in news shows in 2009 (Kaplan & Haenlein, 2010). Despite the bur-
geoning attention to SMA, social media analysts need to deal with arduous tasks
in the given context to accomplish a specific analytical objective because the UGC
from social media is generally improvised, freeform, and a mixture of relevant and
irrelevant resources.
During the 1990s, the Internet and the World Wide Web (WWW) were
adopted to facilitate social communication. The development and quick diffusion
of Web 2.0 technologies made a revolutionary leap in the social element of the
Internet application. Social media users can take advantage of the user-centered
platforms with UGC while the various set of possibilities exist to connect these
cyberspaces to construct online social networks. Adopting social media platforms
serves diverse purposes, such as social interaction, marketing, digital education,
disaster management, and civic movement, and by user groups, including business,
governments, nongovernment organizations, politicians, and celebrities. For exam-
ple, Psy’s music video, Gangnam Style released on July 15, 2012, became the most
watched video on YouTube history with the view of more than 2.8 billion times,
which also tremendously contributed to making the singer to become a world star.
Facebook, starting its service in 2004, currently reached more than 2 billion
active users worldwide as of June 2017. Launched in 2006, Twitter has obtained
319 million active users as of April 2017 and created 500 million Tweets every day.
Though Twitter has a smaller number of active accounts than YouTube, WhatsApp,
or Facebook Messenger, Facebook and Twitter have been the most visible social
media platforms initiating new functional services by incorporating Web 2.0 com-
ponents into other web-based applications (Obar & Wildman, 2015). When com-
pared to the lists in 2015 and 2017, overall the popular social media platforms are
similar besides the leap of WhatsApp and Facebook Messenger. Figure 7.1 presents
Social Media Analytics ◾ 197
1.4
1.3
1.2
1.1
1.0
0.9
Percent growth
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
−0.1
m
ha bo
Sn pp
Q a
ne
Tw ok
e
r
W at
Fa hat
er
b
bl
Ba kyp
ra
Q
h
itt
tie
zo
ei
A
bo
m
eC
c
ag
ts
ap
S
Tu
ce
u
st
id
na
In
W
Si
Figure 7.1 The percentage growth between 2015 and 2017 among social media
services that have most active users.
the significant growth of users of Tumblr, Instagram, WeChat, Sina Weibo, and
WhatsApp between 2015 and 2017.
An official definition of social media requires distinguishing the two interre-
lated concepts, Web 2.0 and UGC (Kaplan & Haenlein, 2010). The term Web 2.0
was first adopted in 2004 to describe the new web environment where software
developers and end users began to mash up the content and applications coopera-
tively and innovatively. Though Web 2.0 does not indicate any particular technical
term related to the WWW, it includes a few essential functionalities for its perfor-
mances such as Adobe Flash, Really Simple Syndication (RSS), and Asynchronous
JavaScript (AJAX) (Kaplan & Haenlein, 2010).
While considering Web 2.0 as the platform of the social media development and
technological foundation, UGC refers to the collective outcomes of user interac-
tions with social media. Extensively employed in 2005, UGC represents the diverse
formats of content in social media (Kaplan & Haenlein, 2010). The Organization
for Economic Cooperation and Development specified three requirements of UGC:
public accessibility of the content, creativity, and amateurism (Peña-López, 2007).
Drawing upon these descriptions about Web 2.0 and UCG, Kaplan and Haenlein
(2010) define social media as a “group of Internet-based applications that build on
198 ◾ Analytics and Knowledge Management
the ideological and technical foundations of the Web 2.0, and that allow the cre-
ation and exchange of User Generated Content” (p. 61). Extending this perspective,
Batrinca and Treleaven (2015) delineate social media as “web-based and mobile-
based Internet applications that allow the creation, access and exchange of UGC
that is ubiquitously accessible” (p. 89).
To characterize social media, Kaplan and Haenlein (2010) categorize social
media platforms by two dimensions: social presence and media richness, and self
presentation and self disclosure. Table 7.1 presents the six different types of social
media. Regarding social presence and media richness, collaborative social media
applications, such as Wikipedia, blogs represent the lowest scores because these are
usually text-based services and limited to simple trade regarding content sharing.
The middle level presents content communities, such as YouTube, and social
networking sites, such as Facebook and Instagram, which allow sharing images
and multimedia content as well as text-based communication. The highest level
includes virtual game and social worlds, such as World of Warcraft, Second Life,
and virtual reality role-playing games, which pursue any possible aspects of direct
interactions in a cyber world (Kaplan & Haenlein, 2010). As for self-presentation
and self-disclosure, blogs ranked higher than collaborative projects because they are
more focused on particular content areas. Likewise, social networking sites present
more self-disclosure than content communities. Lastly, cyber social worlds demand
further self-disclosure than virtual game worlds because disclosers are strictly
bounded by game rules (Kaplan & Haenlein, 2010).
Obar and Wildman (2015) focus on the importance of user profile elements in
defining social media. Despite the substantial differences in the options of identi-
fying users and information requested depending on the social media platforms,
they usually require creating a user name along with contact details and a profile
picture. The requirement enables social media users to make connections and share
their content without many doubts because without verifying user information,
discovering and linking to other users could be a challenge. Obar and Wildman
(2015) describe the functions of social media platforms as connecting the online
social networks. Along this line, boyd and Ellison (2010) considered user profiles as
the backbone of social media platforms.
A majority of Americans now state that they obtain news more increasingly via
social media (Kohut et al., 2008). According to a recent study, 60% of all Americans
use social media to get news (Gottfried & Shearer, 2016), whereas approximately
80% of online Americans are currently Facebook users and among these users, 66%
obtain news on the site (Greenwood, Perrin, & Duggan, 2016). On Twitter, around
60% of the users receive news on the site, which results in a bigger percentage with
a smaller user foundation (16% of American adults). In addition, research studies
have continuously discovered that the more people read news media, the more likely
they are civically and politically involved across various measures (Wihbey, 2015).
During the 1980s, the Whole Earth Lectronic Link (WELL), General Electric
Network for Information Exchange (GEnie), Listserv, and IRC were introduced.
The WELL started in 1985 as a conversation-based community through BBS and
is one of the long-standing online communities. GEnie is an Internet application
programmed with ASCII language and was in a competitive relationship with
CompuServe. Launched in 1986, Listserv was the first electronic mailing applica-
tion through software installed on the server, and it enabled automatic and machine-
processed electronic emailing distribution, which allowed an e-mail to reach a group
of people registered on the mailing list. IRC (Internet relay chat) was created for
group conversations as a type of real time chat, online text messaging, concurrent
conferencing, and data transferring between two people (Ritholz, 2010).
During the 1990s, many social networking sites, such as SixDegrees.com,
MoveOn.org, BlackPlanet.com, and AsianAve.com, were created. These web-
sites were Internet niche social networking sites with which users can interact.
Additionally, blogging sites were created, such as Blogger and Epinions where
users can find and write review comments about commodities (Edosomwan et al.,
2011). In this period, ThirdVoice, a free plug-in user-commenting service on web
pages, and Napster, a peer-to-peer music file sharing software application, were
generated. Opponents of ThirdVoice criticized that the comments were frequently
insulting. Through Napster, users could exchange music files by deviating from
the legal distribution methods; this eventually determined the end user to be a
violator of copyright laws (Ritholz, 2010).
In 2000, many social networking sites sprang up, including LunarStorm, Cyworld,
and Wikipedia. LunarStorm (www. LunarStorm.se), a Swedish commercial virtual site
and Europe’s first Internet community, was specially designed for teenagers. In 2001,
Fotolog, sky blog, and Friendster started their services. In 2003, MySpace, LinkedIn,
tribe.net, Last.fm and in 2004, Facebook Harvard, Mixi, and Dogster emerged.
In 2005, Yahoo! 360, BlackPlanet, and YouTube evolved (Junco, Heibergert, &
Loken, 2011). Though BlackPlanet was created in the 1990s, it became popular and
evolved in 2005. Twitter launched its service in 2006, and the number of users rapidly
increased because of its microblogging style and celebrity adoption (Jasra, 2010).
2010 was Facebook’s year. Facebook took over Google’s position as in the big-
gest website regarding market share in July 2010, and Time magazine recognized
Facebook CEO Mark Zuckerberg as Man of the Year (Cohen, 2010). In the Harvard
Business Review, Armano (2014) describes six social media trends for 2010. He
points out that there will be more scaled social initiatives beyond one-off market-
ing or customer relations efforts. For instance, Best Buy’s employees were directed
to participate in customer support on Twitter through a company-built system that
monitors their participation. With approximately two-thirds of organizations pro-
hibiting access to social network applications from corporate-owned devices, while
simultaneously the sales of smartphones have skyrocketed, employees are likely to
feed their social media cravings on their mobile devices (Armano, 2014). Sharing
content through online social networks rather than by email became mainstream
Social Media Analytics ◾ 201
for people. For example, the New York Times created a content sharing application
that enables them to quickly distribute an article across social media networks such
as Facebook and Twitter (Armano, 2014).
Communication methods, such as face-to-face meetings, phone calls, and
email, have restrictions. For example, people can easily forget and lose the content
of a conversation because of memory loss or missing notes. Social media aids inter-
action and communication among individuals and helps them to reach a larger
audience. Adopting social media has effectively raised the channels of communica-
tion (Edosomwan et al., 2011) because it has become much easier for people to send
timely messages via a Tweet or an instant messenge, which saves time and energy
(interacting with people using social media technology). In addition, engaging
social media reinforces brand experience, which helps to establish a brand reputa-
tion. Customer brand awareness begins with the employees’ experiences about the
company (Carraher, Parnell, & Spillan, 2009).
Evolution of Analytics
The area of analytics started in the mid-1950s along with new methods and tools
that created and captured a significant amount of information and distinguished
patterns in it faster than unaided human intelligence ever could. The concepts and
techniques of business analytics have continuously changed based on analytical
circumstances. The era of Analytics 1.0 can be described as the era of “business
intelligence (BI).” In this period, an analytic technique made progress on providing
an impartial comprehension of major business developments and provided manag-
ers with fact-based understanding for better insights to make marketing decisions.
From this period, production data, including processes, sales, and customer care,
was recorded, accumulated, and innovative computing techniques played a critical
role. For the first time, customized information systems with considerable scale
increased the investment to the systems. Analysis process required far more time to
have data than to analyze it. The process, which took weeks to months to conduct
analysis, was slow and meticulous. Therefore, it was critical to discover the most
important questions and concentrate on them. In this era, analytics focused on
better decision-making opportunities to enhance predictions on specific primary
components.
The first generation of analytics was followed by the era of Analytics 2.0, known
as the era of Big Data. Analytics 1.0 continued until the mid-2000s, when inter-
net-based companies, such as Google, Facebook, Twitter, and LinkedIn, started
to collect and analyze new types of information (Davenport & Court, 2014). In
this Analytics 2.0 phase, the industry adopted new tools to have innovative ana-
lytic techniques and attract customers. For instance, LinkedIn built many big data
products, such as Network Updates, People You May Know, Groups You May Like,
Jobs You May Be Interested In, Companies You May Want to Follow, and Skills
202 ◾ Analytics and Knowledge Management
and Expertise (Davenport & Court, 2014). This improvement required powerful
infrastructure and analytics talents who can develop, obtain, and master different
types of analytics tools and technologies. A single server could not perform analysis
of big data.
Hadoop, an open-source parallel server software structure for fast batch data
processing, was introduced. A NoSQL database system was employed to manage
relatively unstructured data. The cloud-computing system was adopted to man-
age and analyze lots of information. Machine-learning (ML) techniques were also
applied to quickly create models from the rapidly-shifting data. Data analysts
had to make it organized for statistical analysis with scripting languages, such as
Python, Hive, and Pig. Spark, an open source cluster computing system especially
for streaming data, and R, a framework for statistical computing and graphics,
have become popular as well. The required capabilities for Analytics 2.0 were much
higher than those for Analytics 1.0. The quantitative analysts of Analytics 2.0 were
named data scientists, who had both computational and analytical skills.
The revolutionary companies in Silicon Valley started to pay attention to ana-
lytics to promote shopper-facing products and services. They drew consumers’
attention to their websites through certain computational algorithms, recommen-
dations from other users, product reviews, and targeted advertisements, which are
driven by analytics drawn from enormous quantities of data.
When analytics entered the 3.0 phase, almost every company in every industry
makes decisions on products and services based on data analytics. An enormous
amount of data has been created whenever companies produce, deliver, and do
marketing for products and services or interact with customers. For example, UPS
entirely reorganized its delivery routes by implementing big data analytics. UPS
has traced and recorded package shipment and delivery since the 1980s by accu-
mulating information on the 16.3 million (on average) package transactions and
movements daily, which obtains 39.5 million daily tracking requests. The telemat-
ics sensors installed in more than 46,000 UPS trucks trace metrics such as speed,
braking, direction, and training performance. This tracked data demonstrates daily
performance as well as information about a substantial reconstruction of delivery
routes. UPS’s On-Road Integrated Optimization and Navigation (ORION) is a
proprietary route optimization software that relies greatly on online map data and
optimization algorithms.
This analytics solution will eventually be able to redesign a UPS driver’s pick-
ups and distributions in real time. It is also known as “the world’s largest opera-
tions research project” due to its scale and scope (Davenport & Court, 2014). In
2015, UPS adopted ORION to 70% of U.S. routes identified as part of the initial
deployment, which aided UPS to diminish driving miles, fuel consumption, and
car emissions. With the full employment of the ORION system in 2016, UPS
expected to save $300–$400 million by saving 10 million gallons of fuel, reducing
100 million miles, and reducing 100,000 metric tons of carbon dioxide production
annually, which is equal to taking more than 20,000 passenger vehicles off the road
Social Media Analytics ◾ 203
each year. ORION costs $250 at full deployment, but ORION has already saved
UPS more than $320 million at the end of 2015 (Informs.org, 2017).
The ORION project demonstrates that a company’s application of analytics
not only improved internal business decisions but also generated values, including
environmental protection. This is the core of Analytics 3.0. Companies employed
analytics to assimilate information and provide examples while hinting at other
possible applications. Internet-based firms, such as Google, Facebook, Amazon,
eBay, and Facebook, processing enormous amounts of streaming data, have become
frontiers of this approach. These online companies have bloomed mainly by assist-
ing their customers to make effective decisions and actions, and not by providing
simple information for them. This strategic shift signifies an advanced function for
analytics within institutions. Companies need to identify a division of related tasks
and react to new competence, environments, and preferences.
social media data, analyzing the gathered data, and disseminating findings as
appropriate to support business activities such as intelligence gathering, insight
generation, sense-making, problem recognition/opportunity detection, problem
solution/opportunity exploitation, and/or decision-making undertaken in response
to sensed business needs” (p. 4).
tools (e.g., Google Trends). Lastly, data are available through an application
programming interface (API). An API exposes data over HTTP for system-
to-system data transfer. Examples include Wikipedia’s DBpedia, Twitter, and
Facebook (Batrinca & Treleaven, 2015).
Raw data are the data as the source creates it. It may contain errors and may
have never been fixed. Cleaned data are raw data that has had some amount of pre-
processing to remove errors such as typos, wrong facts, outliers, or missing informa-
tion. Value-added data are cleaned data that have been augmented with additional
knowledge gleaned from the analysis (Batrinca & Treleaven, 2015).
As for data formats, data is most commonly encoded into hypertext markup
language (HTML), extensible markup language (XML), JavaScript object nota-
tion (JSON), or comma-separated value (CSV) files. HTML is a markup language
for authoring web pages. It is responsible for delivering page structure directives
(tables, paragraphs, and sections) and content (text, multimedia, and images) to
a web browser. XML is a markup language used to create structured textual data.
XML markup functions as metadata that describes the content contained within
the markup. Both XML and HTML wrap content between start and end tags
(e.g., </div>). JSON is a lightweight data structure derived from a subset of the
JavaScript programming language. It has a simple key-value structure that is simul-
taneously a machine and human readable format. Because JSON is familiar to
C language software developers, it is used in many data feeds. CSV refers to any
file that has a single record per line, fields that are separated by a comma, and text
that is encoded using ASCII, Unicode, or EBCDIC (Batrinca & Treleaven, 2015).
After data are captured, analysts can proceed to the understanding phase. In
this step, the actual value of SMA is revealed. Value is defined through the lens
of the analyst. An analyst working under the auspices of a corporation has a dif-
ferent concept of value than an analyst working to support a political candidate.
Holsapple et al. (2014) offer the following benefits of business SMA: improved
marketing strategy, better customer engagement, better customer service, reputa-
tion management, and new business opportunities. Like epinion.com and Amazon,
review websites allow customers to post comments about an experience with a
product or service. SMA can provide insights regarding these experiences that in
turn inform marketing strategy.
Customer engagement improves when businesses target customer values and
provide additional channels for two-way communications. Customer service
improves as companies become more in tune with the needs of their clients. As
an example, SMA identified stock picking experts on the Motley Fools CAPS
voting site. Picks by these experts consistently outperformed other voters. These
stock picks were provided to Motley Fool customers creating a better product
and improving customer service. SMA assists with reputation management
by allowing a business to extract sentiment insight from social media content
related to the company. Finally, SMA can identify new business opportunities by
monitoring distinctive phrases that spread rapidly on social media applications.
206 ◾ Analytics and Knowledge Management
These phrases offer insights that can help a firm’s decision on providing a new
product or what new features to add to an existing product. That is because
SMA has played a role in each stage of the product development lifecycle: design
development, production, utilization, and disposal (Fan & Gordon, 2014). In the
design development phase, trend analysis can identify a change in customer atti-
tude and desires for a product. Application development firms regularly release
multiple product versions and then solicit feedback directly from potential end
users. Responses then drive the next product feature sets. In extreme cases, end
users and software developers cocreate products using social media as a collabora-
tion platform. In the production phase, SMA allows firms to anticipate changing
demand by scaling production either up or down. A company can also use social
media data to track issues posted by competing companies regarding a supplier,
thus allowing the company to avoid the same problem.
Furthermore, SMA is most commonly leveraged during the product adoption
phase. Firms closely monitor brand awareness (introduction to the brand), brand
engagement (connection to the brand), and word of mouth (customer chatter about
a brand). Metrics, like the number of Tweets and followers, indicate an excite-
ment (positive or negative) for a brand. Customer segmentation (grouping custom-
ers based on shared attributes) allows a firm to tailor marketing messages for a
particular group. In the product disposal stage, SMA can follow, and companies
can directly communicate with, consumers on disposal strategy. This is especially
useful when product disposal creates an environment risk (e.g., toxic batteries).
Because product disposal is often accompanied with product replacement, firms
can use SMA to market replacement products selectively.
Similar to people in business, politicians benefit from SMA as well. The ben-
efit, however, is different as politicians have wildly divergent needs. Reputation
and impression management are of utmost importance politically. Politicians are
interested in how people discuss them, new topics that might trigger crises or
sway sentiment against them, and measuring the degree of influence they exert.
Additionally, SMA is applied in an exploratory fashion to monitor for “late-
breaking” topics thus gaining advanced notification and a longer preparation
period (Stieglitz & Dang-Xuan, 2013). Politicians use social media to gain sup-
port, encourage civic engagement, and promote themselves. President Obama’s
2008 campaign is widely considered the first campaign where social media and
SMA had a measurable and significant effect on the election outcome (Grubmüller,
Götsch, & Krieger, 2013).
Stieglitz and Dang-Xuan (2012) propose an analytics framework for con-
ducting SMA in the political realm consisting of two major parts: data tracking
and monitoring, and data analysis. Data tracking and monitoring are concerned
with techniques for acquiring social media data while data analysis pertains to
analysis methods and reporting. For data analysis, they suggest three approaches:
topic or issue-related, opinion or sentiment-related, and structural. Topic or issue-
related analysis or issue management uses text mining techniques to identify topics
Social Media Analytics ◾ 207
(issues) that might become a crisis or scandal and damage reputation. Opinion or
sentiment-related analysis is an attempt to identify citizen (voter) opinion regarding
a topic. Politicians use such information to make informed decisions about upcom-
ing votes, which issues to address or refute, and decision-making. The structural
analysis attempts to identify key players in a community network. This information
allows a politician to seek favor with influential people. The same approach can get
extended to entire communities if it appears that individual communities exert
more influence on an issue.
◾ Historical data: Recently amassed and stocked social, news, and business-
related data.
◾ Real-time data: Streaming data from social media, news agencies, and mone-
tary trade, telecommunication services, and global positioning system (GPS)
gadgets.
Data also subdivides into raw, cleaned, and value-added data based on the level of
the process. Raw data are primary and unprocessed data directly from the source
that includes errors. Cleaned data is processed and modified raw data by removing
erroneous parts. Value-added data are cleaned, analyzed, tagged data, or augmented
data with information. Analysts need access to historical and real-time social media
data, particularly the primary sources, to conduct comprehensive research. In many
cases, it is beneficial to combine different data sources with social media data. For
example, opinions about negative financial news, such as a downslide of the stock
market, might be presented in social media. Thus, when conducting textual data
analysis of social media, considering multiple data sources will potentially help
detect the context and perform deeper analysis with a better understanding of
the data. The aggregation of data sources is certainly the focus of future analytics
(Batrinca & Treleaven, 2015).
208 ◾ Analytics and Knowledge Management
Data Acquisition
There are four ways to directly acquire data from social media, including scraping,
application programmable interfaces (API), RSS feed, and file-based data acquisi-
tion. Scraping is gathering social media data from the Internet, and this is usually
unstructured text data. Scarping is also known as site scraping or web data extrac-
tion, web harvesting, and web data mining. Analysts can collect social media data
systematically if social media data repositories guarantee programmable HTTP-
based approach to the data through APIs. Facebook, Twitter, and Wikipedia pro-
vide access via APIs. An RSS feed is a systematic method of streaming social media
data to deliver to its subscribers. An RSS feed is considered semi-structured data.
File-based data is usually acquired from spreadsheets and text files.
Sentiment Analysis
Web 2.0 users can form, participate in, and collaborate with virtual online commu-
nities. Microblogs, such as Twitter, became user-generated information-abundant
resources, because users began sharing information and opinions in diverse online
events and domains, ranging from breaking news to celebrity gossip, to product
reviews, and to discussions about recent incidents, such as the Orlando massacre in
June 2016, the U.S. Presidential election, and hurricanes Harvey and Irma in 2017.
Social media includes a copious amount of sentiment-embodied sentences.
Sentiment refers to “a personal belief or judgment that is not founded on proof
or certainty” (WordNet 2.1 definitions), which may depict the emotional state of
the user, such as happy, sad, angry, or the author’s viewpoint on a topic. Sentiment
analysis, an important aspect of opinion mining, aims to discover whether the
polarity of a textual corpus, a collection of written texts, leans towards positive,
negative, or neutral sentiments.
During the past decade, sentiment analysis has been a popular research area in
SMA, particularly on Twitter because of the accessibility to diverse fields. Scholars
have applied sentiment analysis to business predictions (Liu, Huang, An, & Yu,
2007), politics (Park, Ko, Kim, Liu, & Song, 2011), finances (Dergiades, 2012),
Social Media Analytics ◾ 209
Topic Modeling
According to Blei, Ng, and Jordan (2003), “topics” are probability distributions
across all terms in the dictionary. For example, the topic “education” can be associ-
ated with terms such as “students,” “teachers,” “schools,” and so on, with high prob-
ability, and the likelihood of the rest of the words in the dictionary—which typically
will include hundreds or thousands of terms—will be near 0. LDA assumes that the
author of a document produces text by following a generative model, according
to which, given the document, first a topic is selected from a corresponding con-
ditional multinomial distribution of topics, and then, given the topic, words are
chosen from the multinomial distribution of terms that corresponds to that topic.
210 ◾ Analytics and Knowledge Management
Since the problem of estimating the parameters of the multinomial (Dirichlet) dis-
tributions of documents across topics and topics across terms is intractable, they are
estimated using Markov chain Monte Carlo simulations. LDA-based topic modeling
has two advantages. First, utilizing clues from the context, the topic models connect
words with related meanings and separate uses of words from multiple meanings
(McCallum, 2002). Second, since the topic modeling method is performed auto-
matically based on a mathematical algorithm, subjective bias in analyzing data is
minimized. A tutorial for LDA is available in ConText (Diesner et al., 2015).
LSA facilitated topic extraction as well as document similarity and was introduced
as a novel way of automatic indexing and information retrieval in library systems
in the early 1990s (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990).
According to Deerwester et al. (1990), “this approach takes advantage of implicit
higher-order structure in the association of terms with documents (‘semantic struc-
ture’) in order to improve the detection of relevant documents on the basis of terms
found in queries” (p. 1). LSA has been extended as a theory of meaning where topics
are represented in two equivalent forms: as linear combinations of related terms and
as linear combinations of relevant documents (Foltz, Kintsch, & Landauer, 1998). As
a statistical estimation method, LSA quantifies a collection of records by applying the
vector space model (VSM), which arranges text-based data into a term-by-document
matrix where term frequencies are recorded. LSA then employs singular value decom-
position (SVD), which is an extension of principal component analysis and quantifies
patterns of term-document co-occurrence using least squares estimation. After SVD,
an analysis similar to what is done in numerical factor analysis can produce interpre-
table topics (Evangelopoulos, Zhang, & Prybutok, 2012; Sidorova, Evangelopoulos,
Valacich, & Ramakrishnan, 2008). Despite the more rigorous statistical estimation
in LDA, LSA has higher computational efficiency, provides reproducible results, and
is readily available in several implementation packages (Anaya, 2011).
Visual Analytics
Visual analytics (Cook & Thomas, 2005) draws on methods from information
visualization (Card et al., 1999) and computational modeling. Visual analytics has
considerably contributed to SMA (Diakopoulos et al., 2010; Hassan et al., 2014).
The theory of visual analytics employs interactive visualization in processing ana-
lytical reasoning instead of the static display as the results of the analysis (Cook &
Thomas, 2005). The analytic process starts with a high-level assessment that leads
analysts to interesting features of the data. Then the analysts can reconstruct the
perspective by filtering or generating brand-new visualizations, which help them
explore better via the qualitative data analysis.
Brooker, Barnett, and Cribbin (2016) address the advantage of visual analytic
approach to social media data. By integrating data collection and data analysis
as a single process, exploring a dataset may inspire innovative ideas that eventu-
ally result in new levels of data collection with findings discovered during the
Social Media Analytics ◾ 211
Stream Processing
Data analytics of real-time social media characterizes large quantity of temporal data
with little latency. This process requires applications that support online analysis of
data streams. However, traditional database management systems (DBMSs) lack the
pre-established concept of time and cannot manage online data in real time, which
results in developing data stream management systems (DSMSs) (Hebrail, 2008).
DSMSs can process main memory while saving the data on the system, which enables
it to deal with the transient online data streams and to process constant queries of
the streaming data (Botan et al., 2010). Commercial DSMSs includes CEP engine
(Oracle), StreamBase, and StreamInsight (Microsoft) (Chandramouli et al., 2010).
Taking Twitter as an example, Tweets generated from public accounts that represent
more than 90% of Twitter accounts including replies and mentions, can be retrieved
in JSON format. For example, the Twitter Search API is used to request past data
on Twitter and streaming API, filtered by user ID, search keyword, and geospatial
location, is used to request a real time stream of Tweets (Batrinca & Treleaven, 2015).
Business Applications
Business applications refer to commercial software tools that users can collect, search,
and analyze text for business purposes. For example, SAS Sentiment Analysis Manager
included in the SAS Text Analytics program allows users to access source content,
such as social media outlets, text data inside the organization, websites, and generate
reports about customers and competitors in real time (Batrinca & Treleaven, 2015).
RapidMiner is also a popular tool providing an open-source community edition as
well as a fee-based enterprise edition (Hirudkar & Sherekar, 2013). RapidMiner
offers data mining and ML procedures. These procedures include data extraction,
transformation, and loading (ETL), data visualization, structuring, assessment, and
employment. RapidMiner, written in Java, adopts learning schemes and attribute
evaluators using the Weka ML environment, and the R project is employed for sta-
tistical modeling schemes in RapidMiner. Similar to SAS Enterprise Miner, IBM
SPSS Modeler is one of the most popular applications that support various data
analytics tasks.
Social Media Analytics ◾ 213
Examples of data visualization tools include SAS Visual Analytics, Tableau, Qlik
Sense, Microsoft Power BI, and IBM Watson Analytics.
In marketing, the UGC of social media is used for detecting online and offline
opinions of customers. Anderson and Magruder (2012) applied a regression dis-
continuity approach to identifying a causal relationship between Yelp ratings and
dining reservations in restaurants. The study discovered that the more star ratings
the restaurants get, the more reservations they obtained, resulting in a complete
booking for the top-rated restaurants. Srinivasan, Rutz, and Pauwels (2016) exam-
ined the influence of a mix of marketing activities on users’ Internet activity and
sales. Applying a vector autoregressive (VAR) model, the researchers investigated
the relationship between price and marketing channels, TV advertisements, the
number of commercial search clicks and website visits, and Facebook activity on
sales. In addition, paid search was influenced by TV advertisements and the num-
ber of Facebook likes, while it affects marketing channels, website visits, likes on
Facebook, and sales (Moe, Netzer, & Schweidel, 2017).
In the field of biosciences, social media is used to quit smoking, deal with obe-
sity, and supervise diseases by gathering related data on peers for behavioral change
initiatives. Penn State University biologists (Salathe et al., 2012) created novel
applications and methods to trace the transmission of contagious diseases through
the data from news websites, blogs, and social media.
In computational social science, social media has been adopted to observe pub-
lic responses to political issues, public agenda, events, and leadership. As an illustra-
tion, Lerman, Gilder, Dredze, and Pereira (2008) applied computational linguistics
to forecast the news impact on the public regarding political candidates. Others
explored how Twitter is used within the election context to predict election results
(DiGrazia, McKelvey, Bollen, & Rojas, 2013) by discovering candidates’ patterns
of political practice (Bruns & Highfield, 2013). These applications focused more on
the behaviors of the candidates and paid less attention to the behavior of the public.
Social media provides a new channel for political candidates to get closer to the
public, and these public spheres also open communication channels for the online
audience to connect with each other and get involved with antagonistic politics.
Conclusions
Social media is a fundamentally measurable, information communication technol-
ogy that extends web-based communications. Web 2.0 is the second transformative
phase of the WWW and the social nature of Web 2.0 makes it possible to foster
social media platforms, such as Facebook, Twitter, LinkedIn, Reddit, YouTube, and
wikis. These social media platforms provide open environments for collective intel-
ligence that creates collaborative content, and this cogenerated content increases its
value with increased adoptions. The convenient access to APIs of the social media
platforms has resulted in an explosion of social data creation and the use of SMA.
Besides identifying definitions and processes of SMA, this chapter focused on
introducing major techniques and tools for SMA. Presented techniques include social
216 ◾ Analytics and Knowledge Management
media data scraping, sentiment analysis, topic modeling, visual analytics, and stream-
ing processing. Representative fields of social media analytics are business, bioscience,
and computational social science. One of the critical issues regarding SMA is that
social media platforms are increasingly limiting access to their data to make profits
from their content. Data scientists and researchers have to find ways to collect a large
scale of social media data for research purposes with reasonable costs. Otherwise,
computational social science could be the privilege of big organizations, resourceful
government agencies, and elite scholars that can afford costly social media data and
therefore the studies they conducted would hardly be evaluated or replicated.
References
Anaya, L. H. (2011). Comparing latent dirichlet allocation and latent semantic analysis as classifiers
(Doctoral dissertation). University of North Texas, Denton, TX.
Anderson, M., & Magruder, J. (2012). Learning from the crowd: Regression discontinuity
estimates of the effects of an online review database. The Economic Journal, 122(563),
957–989.
Armano, D. (2014, July 23). Six social media trends for 2010. Retrieved July 7, 2017, from
https://hbr.org/2009/11/six-social-media-trends
Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: A survey of techniques,
tools and platforms. AI & Society, 30(1), 89–116.
Bengston, D. N., Fan, D. P., Reed, P., & Goldhor-Wilcock, A. (2009). Rapid issue tracking:
A method for taking the pulse of the public discussion of environmental policy.
Environmental Communication, 3(3), 367–385.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
Machine Learning Research, 3, 993–1022.
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of
Computational Science, 2(1), 1–8.
Borders, B. (2009). A brief history of social media. Retrieved December 5, 2010, from
http://socialmediarockstar.com/history-of-social-media.
Botan, I., Derakhshan, R., Dindar, N., Haas, L., Miller, R. J., & Tatbul, N. (2010).
SECRET: A model for analysis of the execution semantics of stream processing sys-
tems. Proceedings of the VLDB Endowment, 3(1–2), 232–243.
boyd, D., & Ellison, N. (2010). Social network sites: Definition, history, and scholarship.
IEEE Engineering Management Review, 3(38), 16–31.
Brooker, P., Barnett, J., & Cribbin, T. (2016). Doing social media analytics. Big Data &
Society, 3(2), 2053951716658060.
Bruns, A., & Highfield, T. (2013). Political networks on twitter: Tweeting the queensland
state election. Information, Communication & Society, 16(5), 667–691.
Bruns, A., & Liang, Y. E. (2012). Tools and methods for capturing Twitter data during
natural disasters. First Monday, 17(4), 1–8.
Card, S. K., Mackinlay, J. D., & Shneiderman, B. (Eds.). (1999). Readings in information
visualization: Using vision to think. San Francisco, CA: Morgan Kaufmann.
Carraher, S. M., Parnell, J., & Spillan, J. (2009). Customer service-orientation of small
retail business owners in Austria, the Czech Republic, Hungary, Latvia, Slovakia, and
Slovenia. Baltic Journal of Management, 4(3), 251–268.
Social Media Analytics ◾ 217
Chandramouli, B., Ali, M., Goldstein, J., Sezgin, B., & Raman, B. S. (2010). Data stream
management systems for computational finance. Computer, 43(12), 45–52.
Chung, W. (2016). Social media analytics: Security and privacy issues. Journal of Information
Privacy and Security, 12(3), 105–106.
Cohen, A. H. (2010, December 27). 10 social media 2010 highlights. Retrieved July 8, 2017,
from https://www.clickz.com/10-social-media-2010-highlights-data-included/53386/
Cook, K. A., & Thomas, J. J. (2005). Illuminating the path: The research and development
agenda for visual analytics. Los Alamitos, CA: IEEE Computer Society.
Davenport, T. H., & Court, D. B. (2014, November 5). Analytics 3.0. Retrieved August 26,
2017, from https://hbr.org/2013/12/analytics-30
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6), 391.
Dergiades, T. (2012). Do investors’ sentiment dynamics affect stock returns? Evidence from
the US economy. Economics Letters, 116(3), 404–407.
Diakopoulos, N., Naaman, M., & Kivran-Swaine, F. (2010, October). Diamonds in the
rough: Social media visual analytics for journalistic inquiry. In Visual analytics science
and technology (VAST), 2010 IEEE symposium on (pp. 115–122). IEEE.
Diesner, J., Aleyasen, A., Chin, C., Mishra, S., Soltani, K., & Tao, L. (2015). ConText: Network
construction from texts [Software]. Retrieved from http://context.lis.illinois.edu/
DiGrazia, J., McKelvey, K., Bollen, J., & Rojas, F. (2013). More tweets, more votes: Social
media as a quantitative indicator of political behavior. PloS One, 8(11), e79449.
Edosomwan, S., Prakasan, S. K., Kouame, D., Watson, J., & Seymour, T. (2011). The his-
tory of social media and its impact on business. Journal of Applied Management and
Entrepreneurship, 16(3), 79.
Evangelopoulos, N., Zhang, X., & Prybutok, V. (2012). Latent semantic analysis: Five meth-
odological recommendations. European Journal of Information Systems, 21(1), 70–86.
Fan, W., & Gordon, M. D. (2014). The power of social media analytics. Communications
of the ACM, 57(6), 74–81.
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence
with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.
Gottfried, J., & Shearer, E. (2016, May 26). News use across social media platforms 2016.
Pew Research Center.
Greenwood, S., Perrin, A., & Duggan, M. (2016, November 11). Social media update 2016:
Facebook usage and engagement is on the rise, while adoption of other platforms
holds steady. Pew Research Center.
Grubmüller, V., GÖtsch, K., & Krieger, B. (2013). Social media analytics for future ori-
ented policy making. European Journal of Futures Research, 1(1), 20.
Hansen, D., Shneiderman, B., & Smith, M. A. (2011). Analyzing social media networks with
NodeXL: Insights from a connected world. Burlington, MA: Morgan Kaufmann.
Hassan, S., Sanger, J., & Pernul, G. (2014, January). SoDA: Dynamic visual analytics of big
social data. In Big data and smart computing (BIGCOMP), 2014 international confer-
ence on (pp. 183–188). Piscataway, NJ: IEEE.
Hebrail, G. (2008). Data stream management and mining. Mining Massive Data Sets for
Security, IOS Press, 89–102.
Hirudkar, A. M., & Sherekar, S. S. (2013). Comparative analysis of data mining tools and
techniques for evaluating performance of database system. International Journal of
Computational Science and Applications, 6(2), 232–237.
218 ◾ Analytics and Knowledge Management
Holsapple, C., Hsiao, S. H., & Pakath, R. (2014). Business social media analytics: Definition,
benefits, and challenges. Twentieth Americas Conference on Information Systems,
Savannah, GA.
Informs.org. UPS On-road integrated optimization and navigation (ORION) project.
Retrieved September 6, 2017, from https://www.informs.org/Impact/O.R.-Analytics-
Success-Stories/UPS-On-Road-Integrated-Optimization-and-Navigation-ORION-
Project
Jasra, M. (2010, November 24). The history of social media [Infographic]. Retrieved
December 4, 2010, from Web Analytics World: http://www.webanalyticsworld.
net/2010/11/history-of-social-media-infographic.html
Junco, R., Heibergert, G., & Loken, E. (2011). The effect of Twitter on college student
engagement and grades. Journal of Computer Assisted Learning, 27, 119–132.
Kaplan, A. M., & Haenlein, M. (2010). Users of the world, unite! The challenges and
opportunities of Social Media. Business Horizons, 53(1), 59–68.
Kevthefont (2010). Curse of the Nike advert-it was written in the future. Bukisa, p. 1.
Kohut, A., Keeter, S., Doherty, C., & Dimock, M. (2008). Social networking and online
videos take off: Internet’s broader role in campaign 2008. TPR Center, The PEW
Research Center.
Kontopoulos, E., Berberidis, C., Dergiades, T., & Bassiliades, N. (2013). Ontology-based sen-
timent analysis of Twitter posts. Expert Systems with Applications, 40(10), 4065–4074.
Kumar, A., & Sebastian, T. M. (2012). Sentiment analysis on Twitter. IJCSI International
Journal of Computer Science Issues, 9(4), 372.
Kurniawati, K., Shanks, G. G., & Bekmamedova, N. (2013). The business impact of social
media analytics. ECIS, 13, 13.
Lerman, K., Gilder, A., Dredze, M., & Pereira, F. (2008, August). Reading the markets:
Forecasting public opinion of political candidates by news analysis. Proceedings of the
22nd International Conference on Computational Linguistics, 1, 473–480.
Liu, Y., Huang, X., An, A., & Yu, X. (2007). ARSA: A sentiment-aware model for predict-
ing sales performance using blogs. In Proceedings of the 30th annual international
ACM SIGIR conference on research and development in information retrieval (pp. 607–
614). Amsterdam, the Netherlands, July 23–27.
McCallum, A. K. (2002). MALLET: A machine learning for language toolkit: Topic mod-
eling. Retrieved November 4, 2016, from http://mallet.cs.umass.edu/topics.php
Mejova, Y. (2009). Sentiment analysis: An overview. Comprehensive exam paper.
Retrieved February 3, 2010, from http://www.cs.uiowa.edu/~ymejova/publications/
CompsYelenaMejova.pdf
Moe, W. W., Netzer, O., & Schweidel, D. A. (2017). Social media analytics. In
B. Wierenga and R. van der Lans (Eds.), Handbook of marketing decision models
(pp. 483–504). Cham, Switzerland: Springer. https://www.springerprofessional.de/
en/handbook-of-marketing-decision-models/13301802
Obar, J. A., & Wildman, S. S. (2015). Social media definition and the governance challenge:
An introduction to the special issue. Telecommunications Policy, 39(9), 745–750.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and
Trends® in Information Retrieval, 2(1–2), 1–135.
Park, S., Ko, M., Kim, J., Liu, Y., & Song, J. (2011). The politics of comments: Predicting
political orientation of news stories with commenters sentiment patterns.
In Proceedings of the ACM 2011 conference on computer supported cooperative work
(CSCW ’11) (pp. 113–122). New York, NY: ACM.
Social Media Analytics ◾ 219
Peña-López, I. (2007). Participative web and user-created content: Web 2.0, Wikis, and social
networking. Paris: OECD. Retrieved October 24, 2007, from http://213.253.134.43/
oecd/pdfs/browseit/9307031E.pdf
Rimskii, V. (2011). The influence of the Internet on active social involvement and the for-
mation and development of identities. Russian Social Science Review, 52(1), 79–101.
Ritholz, B. (2010). History of social media. Retrieved December 5, 2010, from http://www.
ritholtz.com/blog/2010/12/history-of-social-media/
Ruggiero, A., & Vos, M. (2014). Social media monitoring for crisis communication: Process,
methods and trends in the scientific literature. Online Journal of Communication and
Media Technologies, 4(1), 105.
Salathe, M., Bengtsson, L., Bodnar, T. J., Brewer, D. D., Brownstein, J. S., Buckee, C., …
Vespignani, A. (2012). Digital epidemiology. PLoS Computational Biology, 8(7),
e1002616.
Sidorova, A., Evangelopoulos, N., Valacich, J.S., & Ramakrishnan, T. (2008). Uncovering
the intellectual core of the information systems discipline. MIS Quarterly, 32(3),
467–482, A1–A20.
Sinha, V., Subramanian, K. S., Bhattacharya, S., & Chaudhary, K. (2012). The contem-
porary framework on social media analytics as an emerging tool for behavior infor-
matics, HR analytics and business process. Management: Journal of Contemporary
Management Issues, 17(2), 65–84.
Srinivasan, S., Rutz, O. J., & Pauwels, K. (2016). Paths to and off purchase: Quantifying
the impact of traditional marketing and online consumer activity. Journal of the
Academy of Marketing Science, 44(4), 440–453.
Stavrakantonakis, I., Gagiu, A. E., Kasper, H., Toma, I., & Thalhammer, A. (2012). An
approach for evaluation of social media monitoring tools. Common Value Management,
52(1), 52–64.
Stieglitz, S., & Dang-Xuan, L. (2013). Social media and political communication: A social
media analytics framework. Social Network Analysis and Mining, 3(4), 1277–1291.
Teufl, P., Payer, U., & Lackner, G. (2010, September). From NLP (natural language pro-
cessing) to MLP (machine language processing). In International conference on mathe-
matical methods, models, and architectures for computer network security (pp. 256–269).
Berlin: Springer.
Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of
the American Society for Information Science and Technology, 62(2), 406–418.
Westerski, A. (2008). Sentiment analysis: Introduction and the state of the art overview
(pp. 211–218). Madrid, Spain: Universidad Politecnica de Madrid.
Wihbey, J. (2015). How does social media use influence political participation and civic
engagement? A meta-analysis. Journalist Resource. Retrieved February 17 2016.
Wittwer, M., Reinhold, O., & Alt, R. (2016). Social media analytics in social CRM-
towards a research agenda. In Bled eConference, p. 32.
Wolfram, M. S. A. (2010). Modelling the stock market using Twitter. Scotland, UK:
University of Edinburgh.
Zeng, D., Chen, H., Lusch, R., & Li, S. H. (2010). Social media analytics and intelligence.
IEEE Intelligent Systems, 25(6), 13–16.
Chapter 8
Transactional
Value Analytics
in Organizational
Development
Christian Stary
Contents
Introduction ..................................................................................................... 222
Value Network Analysis .....................................................................................223
Organizations as Self-Adapting Complex Systems .............................................224
Patterns of Interaction as Analytical Design Elements....................................... 226
Tangible and Intangible Transactions ................................................................ 226
Value Network Representation of Organizations ...............................................227
Analyzing the Value Network ............................................................................231
How to Ensure Coherent Value Creation ..........................................................238
Eliciting Methodological Knowledge .................................................................238
Externalizing Value Systems through Repertory Grids .......................................241
Developing Commitment for an Organizational Move .....................................245
Conclusive Summary ....................................................................................... 246
References .........................................................................................................247
221
222 ◾ Analytics and Knowledge Management
Introduction
Stakeholders are increasingly involved in organizational change and development,
and thus, value creation processes (Tantalo & Priem, 2016). However, stakeholders
whose values are incongruent with the values of their organization are likely to be
disengaged in work (Rich, Lepine, & Crawford, 2010), and finally, organizational
development processes (Vogel, Rodell, & Lynch, 2016). Understanding value con-
gruence, that is, the relationship between stakeholder values and values created on
the organizational level, is therefore key to developing organizations (Edwards &
Cable, 2009), and a prerequisite to developing a culture for high performance
(Posner & Schmidt, 1993).
To avoid the emergence of behavior (patterns), for example, pretending to fit
into an organization of work (Hewlin, Dumas, & Burnett, 2017), when stake-
holders feel that their values do not match those of the organization, a deeper
analysis of values of self-knowledge can help. It allows addressing relation-
ships between value congruence, perceived organizational support, core self-
evaluation, task performance, and organizational citizenship behavior (Strube,
2012). In the following, self-knowledge analytics based on individual stakeholder
transactions as an integral part of an enriched value network analysis (VNA)
approach (Allee, 2008), is introduced. Thereby, interactions between stakeholder
roles relevant for operation are analyzed. They are represented in diagrammatic
networks (termed holomaps) and can be tangible or intangible. Tangible interac-
tions encode role-specific deliverables that are created by persons acting in a cer-
tain role and need to be exchanged for task completion. Intangible interactions
encode deliverables facilitating business operations that are not part of formal role
specifications. However, they are likely to influence the successful completion of
business operations.
Since VNA targets organizational change as considered possible by stake-
holders, individual value systems need to be analyzed before changes can become
effective. This organizational transaction analysis ensures interactional coherence
based on individual value systems. In business-to-customer relationships, value
cocreation is well established (Payne, Storbacka, & Frow, 2008; Vargo, Maglio, &
Akaka, 2008). The subjective analytics in value alignment help to identify indi-
vidual capabilities and needs, which otherwise might be overlooked or unrecog-
nized (Jankowicz, 2001). In large organizations, software repositories may help
in understanding and evaluating requirements and encoding needs for change
(Selby, 2009). Dashboards have been designed not only to display complex infor-
mation and analytics but also to keep indicators with respect to volatility. The latter
requires stakeholder judgement, also termed as a user- or demand-driven approach
in analytics (Lau, Yang-Turner, & Karacapilidis, 2014), which focuses on participa-
tive elicitation processes, recognizing that any data analysis needs to make sense for
the involved stakeholders.
Transactional Value Analytics in Organizational Development ◾ 223
Studies in that context reveal that decision making on variants or proposals not
only requires individual intellectual capabilities but also collective knowledge and
active sharing of expertise (Lau et al., 2014). In particular, in complex work settings,
people with diverse expert knowledge need to work together toward a meaning-
ful interpretation of interactional change (Treem & Leonardi, 2017). Thus, any
consensus-building mechanism needs to accommodate input from multiple human
experts effectively. This work proposes to enrich VNA methodologically by using
repertory grids (Easterby-Smith, 1976; Fransella & Bannister, 1977; Senior, 1996;
Boyle, 2005) to embody individual value systems into organizationally relevant
ones. This enrichment is intended to facilitate decision making in organizational
change management processes.
The chapter starts with an introduction to the methodological challenge by
providing the relevant background on VNA. It identifies those elements in VNA
that may cause fragmentation of value-driven alignment of change proposals,
which should become part of a more structured consensus-building process, in par-
ticular when business processes should be developed on that ground (Stary, 2014).
Several interviews have been conducted with experts dealing with aligning value
systems in stakeholder-driven change management, to identify suitable candidates
to enable value-system-based consensus building. A report on the interviews and
the evaluated candidates is given. Based on the results, the consensus-building
process can be supported by individually-generated repertory grids, which are
introduced subsequently. They allow individuals in certain roles to reflect on
their value system when acting in that role. In this way, individual stakeholders
can rethink organizational behavior changes offered by or to other stakeholders
and assess them according to their role-specific value system. An exemplary case
demonstrates the feasibility of this methodological enrichment and reveals first
insights from a typical use scenario, namely organizational sales development.
We conclude the chapter by summarizing the achievements and providing topics
for further studies.
perspective, as they influence effectiveness and efficiency, and cause possible fric-
tion in operational processes (Allee, 2008).
VNA is meant to be a development instrument beyond engineering, as it aims
to understand organizational dynamics, and thus to manage structural know-
ledge from a value-seeking perspective, for individuals and the organization as a
whole. However, it is based on several fundamental principles and assumptions
(Allee, 1997, 2003, 2008)1:
1 For the sake of preciseness, the VNA description closely follows the original texts provided by
Allee (2003).
Transactional Value Analytics in Organizational Development ◾ 225
formal reporting relationships, inherent dynamics may get overlayed and become
completely controlled externally. In rapidly changing settings, self-organization
of concerned stakeholders is an effective way to handle requirements for change.
However, for self-organization to happen, stakeholders need to have access to
relevant information and, more importantly, an understanding of the organization
and the situation as a whole. Both are required to make informed decisions and
initiate socially effective action. Since the behavior of autonomous stakeholders
cannot be predicted fully, organizations need rules to guide behavior management
according to the understanding of stakeholders and their capabilities to change
their behavior.
This need can be adeptly demonstrated in customer service. When there are
too many rules, customers are locked into a bureaucracy that seems unresponsive
to their needs. When there are too few rules, inconsistency and chaos are likely
in complex business cases. Stakeholders need to develop guiding principles that
effectively support them in organizational design and the respective decision mak-
ing through information and technology provision (Firestone & McElroy, 2003).
These principles need to tackle both tangible and intangible relationships and
stakeholder interaction. They should be qualified to reflect on their tangible and
intangible exchanges, and finally negotiate their own “protocols,” that is, activities
with those with whom they interact (Allee, 2008).
Although no one person or group of people can manage a complex system, the
stakeholders can self-organize their inputs and outputs and negotiate exchanges
with others in the organizational system as necessary. Modeling work and busi-
ness relations as dynamic patterns of tangible and intangible exchanges help
stakeholders to identify individually consistent roles and understand the system.
They also allow them to make it transparent and therefore communicable with
coworkers, business partners, and the economic ecosystems of which they are
part. According to Allee (2003), all organizational layers are concerned with the
following:
Above all, stakeholders need to accept the dual nature in interaction in networked
ecosystems through tangibles and intangibles in order to learn how to engage in
conversations that matter.
226 ◾ Analytics and Knowledge Management
In line with the living system perspective, VNA assumes that the basic pattern of
organizing a business is that of a network of tangible and intangibles exchanges.
Tangible exchanges correspond to flows of energy and matter, whereas intangible
exchanges point to cognitive processes. Describing a specific set of participating
stakeholders and exchanges allows a detailed description of the structure of any
specific organization or a network of organizations.
Although VNA considers the act of exchange to be a fundamental activity,
it goes beyond traditional economic understanding of stakeholder interactions.
Exchange includes goods, services, and revenue, but also considers the transaction
between stakeholders as representative of organizational intelligence, thus as a cog-
nitive interaction process. Transactions ensure successful task accomplishment and
business through cognitively reflected exchanges of information and knowledge
sharing, opening pathways for informed decision making. Hence, exchanges not
only have value per se, but encode the currently available collective intelligence
(determining the current economic success).
role provide to others to help keep business operations running. For instance, a
service organization asks sales experts to volunteer time and knowledge on organi-
zational development in exchange for an intangible benefit of prestige by affiliation.
Stakeholders involved in intangible transactions help to build relationships by
exchanging strategic information, planning knowledge, process knowledge, and
technical know-how, and in this way they share collaborative design work, perform-
ing joint planning activities and contributing to policy development. Intangibles,
like other assets, are increased and leveraged through deliberate actions. They affect
business relationships, human competence, internal structure, and social culture.
VNA considers intangibles as assets and negotiables that can actually be delivered
by stakeholders engaged in a knowledge exchange. They can be held accountable
for the effective execution of that exchange, as they are able to articulate them
accordingly when following the VNA’s structured procedure.
Although there are various attempts to develop new measures and analytical
approaches for calculating knowledge assets and for understanding intangible value
creation, traditional scorecards need to move beyond considering people as liabilities,
resources, or investments. Responsible stakeholders need to understand how intan-
gibles create value and, most importantly, how intangibles go to market as negotiables
in economic exchanges. As a prerequisite, they need to understand how intangibles
act as deliverables in key transactions with respect to a given business model.
Value exchanges are modeled in a special type of concept map (Novak & Cañas,
2006), termed a holomap. Concept maps have turned out an effective means to
articulate and represent knowledge (cf. Trochim & McLindon, 2017). They have
been in use since the 1980s for acquiring mental models while graphically gener-
ating a coherent conceptual model, both supporting individuals, and groups in
sharing and planning processes (cf. Goldman & Kane, 2014). Recently, concept
maps served as a baseline for value analytics (cf. Ferretti, 2016), and have been used
effectively for congruence analysis of individual cognitive styles (Stoyanov et al.,
2017). Their ease of use allows triggering learning processes in distributed settings,
for example, supported by web technologies (cf. Wang et al., 2017).
The VNA mapping from the observed reality to a role-specific concept map
(holomap) is based on the following elements:
When modelers create holomaps, they think of participants as persons they know
carrying out one or more roles in the organizational system at hand. Holomapping
is based on the assumption that only individuals or groups of people have the power
to initiate action, engage in interactions, add value, and make decisions. Hence,
VNA participants can be individuals, small groups or teams, business units, whole
organizations, collectives such as business networks or industry sectors, communi-
ties, or even nation-states. VNA does not consider databases, software, or other
technology to be a participant. It is the decision-making capability about which
activities to engage in that qualifies only humans as VNA participants.
Transactions or activities are represented by an arrow that originates with one
participant and ends with another. The arrow represents movement and denotes the
direction of addressing a participant. In contrast with participants, which tend to
be stable over time, transactions are temporary and transitory in nature. They have
a beginning, a middle, and an end point.
Deliverables are those entities that move from one participant to another. A deliv-
erable can be physical or tangible, like a document or a physical object, but it can also
be nonphysical, such as a message or request that may only be delivered verbally. It
can also be an intangible deliverable of knowledge about something, or a favor.
Transactional Value Analytics in Organizational Development ◾ 229
Product information
information
Customer Customer
report preparation data
Quality report
Impact on
Financial Perceived
Resources Overall Cost/ Value in
Activities That are (Positive/ Risk for This Overall Benefit of View of
Deliverable From To Generated Negative) Impact on Intangible Assets (Positive/Negative) Input This Input Recipient
Order handling Customer Sales Evaluation of Extra time and Knowledge on Contact to Knowledge on H: 2 hours/ M: Neutral
report (Intangible) service report effort to be calculation presales competition M: recalcula- Documentation
calculated schema and and market tion required of order
accounting handling
Incomplete Product Sales Additional Effort to be Knowledge on Expert Technical skills H: 3 hours/ M: Completeness Neutral
information development information spent for product interviews and H: Availability of product
(Intangible) from product information knowledge of experts information
development to collection required
be collected
Delayed delivery Presales Sales Delay in Loss of service – Presales Presales H: Extension of L −2
(Tangible) customer time contact reminder loss of service
service to be time/
communicated H: Loss of
loyalty
Incomplete Presales Sales Additional Effort to be Knowledge on Presales Appointment M: 1 hour/ M: Completeness −1
information information spent for sales contact M: Availability of product
(Intangible) from presales to information transaction of presales information
be collected collection required
Transactional Value Analytics in Organizational Development ◾
(Continued)
233
234 ◾
Impact on
Financial Perceived
Resources Overall Cost/ Value in
Activities That are (Positive/ Risk for This Overall Benefit of View of
Deliverable From To Generated Negative) Impact on Intangible Assets (Positive/Negative) Input This Input Recipient
Product information Product Sales Material to be Time and Knowledge Informed Qualification H: 2 hours per H: Informed +1
(Tangible) development studied effort to be acquisition external feature/ collaboration
studied contacts M: Request for
clarification
Analytics and Knowledge Management
Updates (Tangible) Product Sales Feature(s) to be Time and Knowledge Informed Qualification H: 1 hour per H: Informed +2
development studied effort for acquisition method feature/ collaboration
studying application M: Request for
clarification
Leads (Tangible) Presales Sales Update of Time and Knowledge on CRM- Data accuracy M: 5 min per H: Documentation +2
CRM-database effort for opportunities timeliness record/ of opportunities
update M: requires
clarification
Transactional Value Analytics in Organizational Development ◾ 235
in time. The Presales group needs to be contacted for availability to overcome short-
comings in customer relationship management.
The value creation analysis not only reflects the situation as it is but supports the
proposal of changes to the way a participant is committed to the delivery. Although
the structure of this analysis is similar to the impact analysis, it instead focuses
on a participant’s capability regarding how to extend value to other participants
represented in the holomap. It analyzes the tangible and intangible costs (or risks)
and gains for each value output. Each value output could add new tangible or
intangible value and thus extend value to other participants represented in the holo-
map. By assessing each value output, a participant needs to determine the activities,
resources, and processes required, as well as the costs and anticipated benefits of
each value-creating activity.
In Table 8.2 a sample analysis is given for the Sales group. An important value
creation for the Sales group is to add value to product features received from the
Product Development group by providing timely information for interviewers and
the Customer Service group (in terms of what is actually ordered). Their role in
contributing to timely information in cooperation with the Product Development
group is actually very small, being of both low cost for them and also low benefit
in case of not actively requesting updates. Hence, they could be an active agent
to extend value through targeted requests and other efforts to reach the Product
Development group. They would need to engage in a value conversion process in
which they convert one relevant type of value to another with respect to a work-
ing product lifecycle. Engaging in increase value creation actually explores new
strategic possibilities and changes costs and benefits for the organization. It is the
moment where a member of the organization develops an offer to others by indi-
vidual behavior change instead of elaborating the expected behavior of others.
Looking at the value outputs that go to the Product Development group, the
Sales group can open a tangible communication channel to support the process
used by the Product Development group, and an intangible transaction of feeding
back product-relevant information. In this way, the Sales group can leverage their
intangible value outputs (information about product use) into more advanced prod-
uct development that could be turned into a revenue stream. In the example, the
communication channels were mostly one way, conveying product knowledge from
the Product Development group to the Sales group. With a focus on converting
that expected communication channel with another type of value gain, the Sales
group enables communication after gaining feedback for the Product Development
group. Both converting an intangible value to gain a tangible value, and establish-
ing an intangible deliverable with product feedback, supports the strategic intent of
a rapid response to the changing needs of customers.
Although value creation analysis can become very rich due to anticipated
changes in the flow of deliverables, the participants need to understand what
impact a particular output has on the participant who receives it. From an organi-
zational development perspective, a participant could maximize the effectiveness
236 ◾
Recipient
Highly Values
This
Deliverable.
Strongly How High Is
Agree(+2) Tangible The Risk Factor
Agree(+1) Asset What Are The In Providing
Neutral(0) Utilization Is: Tangible Costs? This Output?
Disagree(−1) H = High (Financial And H = High
Strongly M = Medium Physical M = Medium
Deliverables From To Disagree(−2) L = Low Resources) L = Low
(Continued)
Table 8.2 (Continued) Sample Value Creation Analysis
Value Creation Analysis (Continued)
Intangible asset
utilization is:
H = high
M = medium
L = low
(Human How do we What is the What is the
competence add to, overall overall
Internal structures What are other intangible costs or enhance, combined benefit for us
Business benefits? or extend cost/risk for in providing
relationships) (Industry, society, environment) value? this input? this input?
H H Customer — — — — High
care customer
H satisfaction
H of data
237
238 ◾ Analytics and Knowledge Management
and efficiency of a certain business operation when following the created value.
The overall cost benefit analysis could result in excellent data. However, a closer
analysis of the proposed value creation could lead to inconveniences of the involved
participants, in particular business partners and customers (Augl & Stary, 2015,
2017). For instance, collecting context data from interviewers when applying the
product in the field could easily lead to rejecting the proposed value creation once
they consider a request for preparing such a report as a negative value input.
Before proposed value creations can become effective in business operations,
they need to be acknowledged by the involved participants as well as by respon-
sible management. To allow the constructive elaboration of value creations affect-
ing the collective, each member of the organization should be empowered with an
instrument enabling him or her to reflect on individual values. Applying such an
instrument requires stepping out of the VNA logic while providing a baseline for
discussing proposals for change. Once value-creation proposals are congruent with
individual value systems, the resulting business operation becomes coherent for the
involved stakeholders.
The methods were picked from a selected collection of practically tested methods for
gaining and analyzing stakeholder knowledge (Stary, Maroscher, & Stary, 2012). The
repertory grid technique (Fransella & Bannister, 1977; Fromm, 1995; Goffin, 2002)
supports an understanding of how individuals perceive a certain phenomenon in a
structured way, namely by identifying the most relevant characteristics of the carriers
of those characteristics (e.g., roles, persons, objects, situations) on an individual scale.
The technique enables not only identifying attributes that relate to the carriers but
also to elicit explanations of these characteristics. Repertory grids have been intro-
duced by Kelly (1955) based on the personal construct theory in psychology, and
have been used successfully in a variety of contexts including expert, business, and
work knowledge elicitation (Ford, Perty, Adams-Webber, & Chang, 1991; Gaines &
Shaw, 1992; Hemmecke & Stary, 2006; Stewart, Stewart, & Fonda, 1981), strategic
management (Hodgkinson, Wright, & Paroutis, 2016), supplier-manufacturer rela-
tionship management (Goffin, Lemke, & Szwejczewski, 2006), learning support
(Stary, 2007), attitude studies (Honey, 1979), team performance (Senior, 1996;
Senior & Swailes, 2004), project management (Song & Gale, 2008), and consumer
behavior (Kawaf & Tagg, 2017).
The critical incident technique (Flanagan, 1954) is an exploratory qualitative
method that has been shown to be both reliable and valid in generating a compre-
hensive and detailed description of a content domain. The technique consists of
asking eyewitnesses for factual accounts of behaviors. The critical incident tech-
nique consists of a set of procedures for collecting direct observations of human
behavior in such a way as to facilitate their potential usefulness in solving practical
problems and developing psychological principles. By “incident,” what is meant
is any observable human activity that is sufficiently complete in itself to permit
inferences and predictions to be made about the person performing the act. To be
critical, an incident must occur in a situation where the purpose or intent of the act
seems fairly clear to the observer and where its consequences are sufficiently definite
240 ◾ Analytics and Knowledge Management
to leave little doubt concerning its effects. In this way, role-specific behavior can be
valued by observers, including their transactions with other stakeholders.
The critical incident technique outlines procedures for collecting observed inci-
dents that have special significance and meeting systematically defined criteria, for
example, behavior considered to be most effective or efficient in a certain situ-
ation (Butterfield, Borgen, Amundson, & Maglio, 2005). The many and varied
applications of the technique comprise counseling (Hrovat & Luke, 2015), tourism
(Callan, 1998), service design (Gremler, 2004), organizational merger (Durand,
2016), and healthcare (Eriksson, Wikström, Fridlund, Årestedt, & Broström,
2016). In contrast to repertory grids, which are grounded on introspection, the
critical incident technique relies on making observations about other people, thus
valuing the behavior of others. In doing so, people make notes when observing
others’ behavior to reconstruct the meaning later.
Storytelling has evolved from systemic representations of narratives (Livo &
Rietz, 1986), today termed mainly digital storytelling, as personal stories are kept in
digital form to preserve relevant happenings and share via the web (Lundby, 2008).
A story represents the world around us as perceived by specific individuals (story-
teller as creator and filter). Depending on its purpose, a story can be more like an
objective or fact-driven report or a dedicated trigger for change addressing persons
in certain social or functional roles (Denning, 2001). Digital storytelling, due to its
narrative and social structure, is used to capture not only individual characteristics
(Cavarero, 2014) but also workplace specifics (Swap, Leonard, Shields, & Abrams,
2001) and organizational issues (Boje, 1991). Similar to repertory grids, stories are
created from introspection, although they result in a linear text in contrast to a
matrix representing individual value systems.
The interviews with the value management experts lasted around 30 minutes each.
Since they had various backgrounds (marketing, information systems, psychology, lin-
guistics etc.) and diverse work experience, most of the analysts could identify methods
not included in the selection offered in item 5 of the interview guide. However, they
agreed that repertory grids are the most effective, since they constitute a noninvasive
method used to elicit value systems as the externalization procedure (see next section)
does not direct the way of answering the respective questions or thinking. A personal
construct can be understood as the underlying mechanism of a personality that enables
an individual to interpret a situation. In addition to individual attitudes, repertory
grids enable the expression of how individuals interpret a task procedure or technical
system feature, or perceive others’ behavior in a certain situation. Recording an indi-
vidual’s point of view concerning an object or phenomenon as well as the individual’s
contrasting viewpoint constitutes an individual’s value system. Hereby, a phenom-
enon is likely to have a certain meaning for each individual as expressing it, enabling
even varying experiences of situations to be represented. Yielding insights in such a
way about an individual’s value system may prompt a discussion regarding whether
a certain behavior or fact can be accepted by members of an organization or not. The
repertory grid technique therefore allows for the maximum bandwidth of responses.
Transactional Value Analytics in Organizational Development ◾ 241
The result is a validated value system in a given context that can be applied to reflect
on value creations, in the test case in the participants’ context of sales activities.
Furthermore, the participants need to be told that elements should have cer-
tain properties to support the elicitation of meaningful constructs. Ideally, they
should be: (1) discrete—the element choice should contain elements on the same
level of hierarchy, and should not contain sub-elements; (2) homogeneous—it
should be possible to compare the elements, for example, things and activities
should not be mixed within one grid; (3) comprehensible—the person from
whom the constructs are elicited should know and understand the elements,
otherwise the results will be meaningless; and (4) representative—the elicited
construct system will reflect the individually perceived reality, once the element
choice is representative for the domain of investigation.
The focus of the field test was work behavior and thus stakeholders acting
in particular roles, particularly those involved in the setting addressed by the
VNA holomapping. These are likely to form the repertory grid’s elements. In
the field test, the four participants were asked to identify between five and
eight elements as entities they could refer to in the business case addressed by
the holomap. These elements should allow naming certain properties of the
situations relevant for meeting the objectives of that business case. Since these
elements (e.g., roles, persons, objects, and task settings) stem from a running
business operation involving actual stakeholders, the participants were briefed
to use anonymous abbreviations, labels, or names when denoting the ele-
ments. The participants were also asked to include an element they consider to
be ideal in the addressed business case, to gain an understanding of the antici-
pated improvements of the transactional behavior through value creation.
Elicitation of constructs: When it comes to identifying the constructs (i.e., prop-
erties of the identified elements), we need to be aware that they are generated
through learning about the world when acting in the world. Hence, construct
systems, and thus value systems, depend on the experiences individuals make
during their lifetime and, moreover, depend on the common sociocultural
construct systems. Since construct systems change over time, they need to
be considered dynamic entities. As such, elicited knowledge through grids is
only viable for a certain person in a certain physical and social context.
Elicitation of constructs in a repertory grid interview in the field test was
concerned with understanding the participant’s perception or personal con-
struct system of work behavior. Constructs may be elicited by using one, two,
or three elements at a time. We used the triad form, that is, three out of
the specified five to eight elements during each elicitation iteration (see also
Table 8.3, presenting a repertory grid for Sales participant X who identified A,
B, C, D, and I (Ideal) as elements).
Our participants’ sample reflects an average age of 29 years (median of 33),
with the youngest being 24 and the oldest being 60. There were slightly more
men than women (three vs one). According to the objectives, the participants
have a tight range of employment, from Sales to Presales. All participants
have lived in Austria for a substantial part of their life.
Transactional Value Analytics in Organizational Development ◾ 243
In the triad elicitation, three elements are compared with each other
according to their similarities and differences. The person is asked to specify
“some important way in which two of them are alike and thereby different
from the third” (Fransella & Bannister, 1977, p. 14). The elicited similarity
between the two elements is recorded as the construct. Subsequently, the
person has to detail the contrast with the following question: “In what way
does the third element differ from the other two?” The construct (pair) elici-
tation continues as long as new constructs can be elicited. Table 8.3 shows a
sample grid that was elicited from Sales participant X following the elicita-
tion steps 1 to 3.
Step 1: Select triad: Each element selected by the interviewed participant was
written on a single card. The front of each card showed the name of the
respective element. In the case at hand, to ensure that the participants
could remember the addressed elements when using pseudonyms, abbre-
viations, or symbols, they were asked to make a coding list before starting
the construct elicitation procedure. After that, the interviewer turned the
index cards upside-down and mixed them, and the participant was asked
to pick three of the index cards, at random. This move prevented the
participant from knowingly selecting a specific card.
Step 2: Elicitation of raw constructs: After the participant had selected three
index cards, the participant was asked: “In the context of sales, how are
two of these elements the same, but different from the third?” The par-
ticipant expressed his or her construct system concerning the character-
istic that differentiates the selected elements. The participant was then
asked to provide a short label to describe the emergent pole (how two
are the same). For instance, Sales participant X identified “openness” for
the A and B elements in the first elicitation round (data lines 1 and 2,
244 ◾ Analytics and Knowledge Management
If you get a report without any context and personal purpose, you feel like
a mere fact checker without any background information on how a sales
process went so far and you are lost if you are supposed to arrange the next
step in customer relationship management. In particular, once the data
indicates you could have known more about a sales process when studying
the subsequent report from that person which indicates information you
should have received before launching the latest intervention.
Rating the constructs for each element: The third phase of a repertory grid elicita-
tion session is the rating of the elements according to the elicited constructs.
The mutual relations of elements and constructs can be explored by using a
rating procedure. The rating can range from 5 to 7 on a point scale. In the
case at hand, the rating ranged from +3 to −3 (excluding 0), with +3, +2, +1
for rating the strength of property (+3 is the strongest) for left-side constructs
(left row of Table 8.3), and −3, −2, −1 for rating the strength of property
(−3 is the strongest) for right-side constructs, also termed contrasts (right row
of Table 8.3). For instance, “openness” was rated for element D as very strong
(+3 in line 2), and also “formal reporting” for element I (−3, since it refers to
the contrast, rated from −3 to −1).
Construct saturation: The overall aim of construct saturation is to obtain as many
unique constructs as required to externalize the value system of a participant.
Since even people acting in the same role (e.g., Sales) in a certain organiza-
tional context are likely to have different backgrounds, there is no common
rule regarding when to stop eliciting constructs. Once all possible combina-
tions of elements have been selected, a second round of selections could make
sense, in case the participant wants to add more constructs to the already
Transactional Value Analytics in Organizational Development ◾ 245
elicited ones in the grid. Typically, a repertory grid with four to six elements
has not more than 10 to 15 lines of entry. However, a dedicated check of
whether construct saturation has been reached is strongly advised. In the field
test this strategy was applied. We even asked the participants whether certain
constructs were considered unique to ensure that a new construct was elicited
when the participants gave a statement reflecting on a triad.
Analysis and synthesis of the result: The goal of the analysis of a repertory grid is
to represent the construct system in a way that the participant gets explicit
insights into his or her own view about the corresponding elements. Finally,
other individuals should come to an understanding about the participant’s
way of thinking. The most straightforward consideration of a grid is to look
at the ideal (I) element entries and interpret the pattern representing the ideal
case, object, or person. In addition, bi-plots derived from principal compo-
nent analysis and dendrograms derived from cluster analysis can be generated.
In bi-plots the relations of elements and constructs can be displayed, whereas
in dendrograms only the relations of elements or the relations of constructs
can be visualized. However, it is not necessary to use computer programs for
analysis, especially when the results of a single person are subject to analysis.
Content analysis (e.g., as proposed by Fromm, 1995) allows not only deter-
mining the meaning of constructs, but also clusters them according to a cat-
egory system.
Testing plausibility: A check of the plausibility of the named constructs should
be conducted with independent third parties to ensure the reliability and
stability of the results, as well as the accuracy of the data (Jankowicz, 2004).
Initially, we verified the clarity of constructs by asking two domain experts
not participating in the elicitation to read each construct carefully and explain
its meaning in his or her own words in the context of the addressed setting
(i.e., sales). The constructs were adjusted in cases where constructs were not
self-explaining. Statements from the laddering step in the elicitation phase
served as a valuable input to that task.
improvement is in line (since positively rated for the ideal Sales participant) with
the positively rated constructs “openness,” “upfront information,” and “interested in
method development.” The same holds for the participant’s role in contributing to
timely information in cooperation with the Product Development group. Becoming
an active agent, and thus extending value through targeted requests and other efforts
to reach the Product Development group, is consistent with all positively rated con-
structs of the Ideal element. In this way, Sales participant X can check whether the
value creation is in line with his or her individual value system, before actively pro-
moting this change in the ongoing organizational development process.
In receiving new input from others (e.g., the Presales group), the same approach can
be applied. In our case, a Presales participant in the value creation analysis converted
the intangible deliverable “incomplete information” to a tangible one. Hence, a pos-
sible value creation for this Presales participant is to add value to customer relations
and product features received from the Product Development group by providing
indicators for incomplete information, since it may affect the work of interviewers
and the Customer Service group. Again, by checking his or her repertory grid, such
an improvement is totally acceptable to Sales participant X, since it is in line (i.e.,
positively rated for the Ideal Sales participant) with the constructs of “openness” and
“upfront information.” Sales participant X can support this change proposal based on
his or her elicited and rated constructs in the context of sales processes.
As these two examples reveal, the individually-valid repertory grid can be uti-
lized to check incoming and outgoing transactions as part of the value creation
analysis, and ascertain whether they fit to the individual value system. Based on
this information, value creation can either be promoted or opposed to be imple-
mented on the organizational level.
Conclusive Summary
Value congruence plays a crucial role in organizations once individual proposals
are discussed for improving business operation. We have looked at the case of indi-
vidual stakeholder transaction analyses, since VNA focuses on value creations while
targeting stakeholder- and business-conform exchange patterns. As the approach
considers both the exchange of tangible (or contracted) and intangible (or volun-
tarily) provided deliverables, value creations need to be aligned with individual value
systems before becoming a collective practice.
Various methods, in particular the repertory grid technique, the critical inci-
dent technique, and storytelling, exist to elicit individual value systems. Several
experts have been interviewed taking into account the listed candidates as well as
their experiential knowledge, to identify the best choice for eliciting value systems
in stakeholder-driven change management. They agreed on the consensus-building
process supported by individually generated repertory grids. These grids allow indi-
viduals in certain roles to reflect on their value system when acting in that role.
Transactional Value Analytics in Organizational Development ◾ 247
References
Allee, V. (1997). The knowledge evolution: Expanding organizational intelligence. Amsterdam,
the Netherlands: Butterworth-Heinemann.
Allee, V. (2003). The future of knowledge: Increasing prosperity through value networks.
Amsterdam, the Netherlands: Butterworth-Heinemann.
Allee, V. (2008). Value network analysis and value conversion of tangible and intangible
assets. Journal of Intellectual Capital, 9(1), 5–24.
Augl, M., & Stary, C. (2015). Communication- and value-based organizational develop-
ment at the university clinic for radiotherapy-radiation oncology. In S-BPM in the
wild (pp. 35–53). Berlin, Germany: Springer International Publishing.
Augl, M., & Stary, C. (2017). Adjusting capabilities rather than deeds in computer-
supported daily workforce planning. In M. S. Ackerman, S. P. Goggins, T. Herrmann,
M. Prilla, & C. Stary (Eds.), Designing healthcare that works. A sociotechnical approach
(pp. 175–188). Cambridge, MA: Academic Press/Elsevier.
Boje, D. M. (1991). The storytelling organization: A study of story performance in an office-
supply firm. Administrative Science Quarterly, 36(1), 106–126.
Boyle, T. A. (2005). Improving team performance using repertory grids. Team Performance
Management: An International Journal, 11(5/6), 179–187.
Butterfield, L. D., Borgen, W. A., Amundson, N. E., & Maglio, A. S. T. (2005). Fifty years
of the critical incident technique: 1954–2004 and beyond. Qualitative Research, 5(4),
475–497.
Callan, R. J. (1998). The critical incident technique in hospitality research: An illustration
from the UK lodge sector. Tourism Management, 19(1), 93–98.
Cavarero, A. (2014). Relating narratives: Storytelling and selfhood. New York, NY: Routledge.
Dalkir, K. (2011). Knowledge management in theory and practice. Cambridge, MA: MIT
Press.
Denning, S. (2001). The springboard: How storytelling ignites action in knowledge-era organi-
zations. New York, NY: Routledge.
Durand, M. (2016). Employing critical incident technique as one way to display the hid-
den aspects of post-merger integration. International Business Review, 25(1), 87–102.
Easterby-Smith, M. (1976). The repertory grid technique as a personnel tool. Management
Decision, 14(5), 239–247.
Edwards, J. R., & Cable, D. M. (2009). The value of value congruence. Journal of Applied
Psychology, 94(3), 654.
Eriksson, K., Wikström, L., Fridlund, B., Årestedt, K., & Broström, A. (2016). Patients’
experiences and actions when describing pain after surgery—A critical incident tech-
nique analysis. International Journal of Nursing Studies, 56, 27–36.
248 ◾ Analytics and Knowledge Management
Trochim, W. M., & McLinden, D. (2017). Introduction to a special issue on concept map-
ping. Evaluation and Program Planning, 60, 166–175.
Vargo, S. L., Maglio, P. P., & Akaka, M. A. (2008). On value and value co-creation:
A service systems and service logic perspective. European Management Journal, 26,
145–152.
Vogel, R. M., Rodell, J. B., & Lynch, J. W. (2016). Engaged and productive misfits: How
job crafting and leisure activity mitigate the negative effects of value incongruence.
Academy of Management Journal, 59(5), 1561–1584.
Wang, M., Cheng, B., Chen, J., Mercer, N., & Kirschner, P. A. (2017). The use of web-
based collaborative concept mapping to support group learning and interaction in an
online environment. The Internet and Higher Education, 34, 28–40.
Chapter 9
Data Visualization
Practices and Principles
Jeonghyun Kim and Eric R. Schuler
Contents
Introduction ......................................................................................................251
Data Visualization Practice ................................................................................253
Multidimensional Visualization ....................................................................253
Hierarchical Data Visualization ................................................................... 260
Data Visualization Principles .............................................................................265
General Principles .........................................................................................267
Specific Principles: Text.................................................................................271
Specific Principles: Color ..............................................................................271
Specific Principles: Layout ............................................................................272
Implications for Future Directions ....................................................................273
Note ..................................................................................................................274
References .........................................................................................................274
Introduction
The ever-increasing growth of Big Data has impacted every aspect of our modern
society, including business, marketing, government agencies, health care, academic
institutions, and research in almost every discipline. Due to its inherent properties
of excessive volume, variety, and velocity, it is very difficult to store, process, and
extract the intended information from it; that is, it becomes harder to grab a key
message from this universe of data. In this direction, the demand for data analytics,
251
252 ◾ Analytics and Knowledge Management
which is the process of collecting, examining, organizing, and analyzing large data
sets for scientific understanding and direct decision-making, has increased in recent
years. Various forms of analytics, including predictive analytics, data mining, and
statistical analysis, have been applied and implemented in practice; this list can be
further extended to cover data visualization, artificial intelligence, natural language
processing, and database management to support analytics.
Among those capabilities that support analytics, data visualization1 has a critical role
in the advancement of modern data analytics (Bendoly, 2016). It has become an active
and vital area of research and development that aims to facilitate reasoning effectively
about information, allowing us to formulate and test hypotheses, to find patterns and
meaning in data, and to easily explore the contours of a data collection from different
perspectives and at varying scales. As Keim et al. (2008) asserted, the goal of visualiza-
tion is to make the way of processing data and information transparent for an analytic
discourse. The visualization of these processes provides the means for communicating
about them, instead of being left with the results. Visualization fosters the construc-
tive evaluation, correction, and rapid improvement of our processes and models, and
ultimately, the improvement of the management of knowledge on all levels, including
personal, interpersonal, team, organizational, inter-organizational, and societal, as well
as enhancement of our decision-making process and management capabilities.
To achieve this goal, various visualization techniques have been developed and
proposed, evolving dramatically from simple bar charts to advanced and interactive
3D environments, such as virtual reality visualizations, for exploring terabytes of
data. A rapidly evolving landscape in the business intelligence market led to interest-
ing innovations in areas such as behavioral and predictive analytics; now visualiza-
tion is increasingly being integrated with analysis methods for visual analytics. This
integration offers powerful and immediate data exploration and understanding. The
number of tools and software solutions that integrate visualization and analysis,
including Power BI,2 Tableau,3 Qlik,4 and Domo,5 has grown rapidly during the past
several decades while keeping pace with the growth of data and data types. Toolkits
for creating high-performance, web-based visualizations, like Plotly6 and Bokeh7,
are becoming quite mature. Further, various web applications supporting visual-
ization features have emerged. For instance, web-based notebooks like Jupyter8
1 Elsewhere, other terms are being used, such as scientific visualization, information visualiza-
tion, and visual analytics in a more restricted sense (Keim et al., 2008; Tory & Möller, 2004).
In this chapter, the term data visualization will be used in a broad sense, referring to the use of
visual representations to explore, make sense of, and communicate data.
2 https://powerbi.microsoft.com
3 https://www.tableau.com
4 https://www.qlik.com
5 https://www.domo.com
6 https://plot.ly
7 http://bokeh.pydata.org
8 http://jupyter.org
Data Visualization Practices and Principles ◾ 253
are interactive interfaces that can be used to ingest data, explore data, visualize data,
and create small reports from the results.
In recent years, discussions on data visualization have arisen, including effective
data visualization. Important questions have been raised regarding principles for data
visualization: What constitutes a good visualization? How can we ensure that we
select the best visualizations to expose the value in the data sets? These questions are
critical and deserve empirical investigation, because a good visualization can make
the process of understanding data effective, while a bad visualization may hinder the
process or convey misleading information. Thus, the purpose of this chapter is to
address these questions by providing a review of good data visualization practice tech-
niques as articulated by the leading scholars and practitioners, and discussing leading
data visualization components to ensure data visualizations are used effectively.
Multidimensional Visualization
Column charts, sometimes termed bar charts, are utilized to compare values
from different groups (Ajibade & Adeiran, 2016). More specifically, if the groups
are categorical, a bar chart can be utilized. The bars can be vertical or horizontal,
254 ◾ Analytics and Knowledge Management
depending on the number of groups and what is being presented. Bar charts help
visualize differences in the frequencies among distinct categories. Specifically,
bar charts allow individuals to accurately read numbers off the chart and view
trends across the groups (Zoss, 2016). There are some weaknesses of bar charts
that need to be noted. For example, if the categories have long titles, they can
be difficult to read. Additionally, it is important to pay attention to the numeri-
cal values of the y-axis as any axis that does not start at zero could distort the
visualization.
Another useful visualization is the line chart. Line charts are used to visual-
ize the relationship among several items over time (Ajibade & Adediran, 2016).
Additionally, various categories could be added to the visual so that each color line
reflects a different group. This allows readers to look at changes over time and across
groups. Using line graphs can be advantageous over stacked bar charts in that the
exact values can be easily read and the exact dates do not need to necessarily match
on the x-axis (Zoss, 2016). Line charts can be difficult to read if there are numerous
groups and the lines often intersect.
For example, More Fun Toys wanted to know which of their recent toys was
the most popular across the country in the last year. Since they are only inter-
ested in a visual of what toy was most frequently bought, we can use a bar chart
(Figure 9.1). Based on Figure 9.1, we can see that Super Fast Racer was the most
popular toy across the nation, closely followed by Happy Hamster and Shuffle-
bot. Furthermore, More Fun Toys was curious how three of their new toys
sold over the last four fiscal quarters; the three new toys were Happy Hamster,
20
Units sold (in thousands)
15
10
0
Action man
Clown shoes
Existential robot
Meta-book
Woofy
Happy Hamster
Shufflebot
Figure 9.1 Bar chart that shows how many units of each toy were sold over the
2016 fiscal year.
Data Visualization Practices and Principles ◾ 255
80 100 120
Unit sales (in thousands)
60
Figure 9.2 Line graph that presents the trends in sales for three toys over each
2016 fiscal quarter.
Ultimate Table Top, and Super Fast Racer. Figure 9.2 presents the number of
units (in thousands) that each toy sold in the last four fiscal quarters. As can be
seen in Figure 9.2, the product Happy Hamster had high initial sales that dimin-
ished over time, whereas Super Fast Racer had a slow start, but sales increased
each quarter.
Usually, data analysts are interested in not only describing the data but also
making inferences from the data, termed inferential statistics (Tabachnick &
Fidell, 2007). Prior to running any sort of inferential statistics, it is important
to understand the nature of the distributions. The most insightful methods to
view the distributions are through basic data visualizations to assess for normal-
ity, outliers, and linearity of variables. Oftentimes, this starts with a histogram or
distribution graph. A histogram is akin to a bar chart in that it has each value of
a variable on the x-axis and frequencies or number of occurrences on the y-axis
(Soukup & Davidson, 2002). A histogram can inform us of how data is spread
out, whether the data is skewed positively or negatively, and whether there are
potential extreme scores or outliers within the data that could affect measures of
central tendency (i.e., mean). An additional visual for the distribution of values
for a variable is a density plot. The density plot is a smoothed histogram that can
have a normal curve superimposed over it to compare the actual distribution with
a hypothesized normal distribution. This type of visualization can help researchers
determine if the distribution of scores is normal or non-normal, as well as visually
assess for skewness. For example, More Fun Toys was interested in the distribution
of customer satisfaction scores, with scores ranging from 0 (unsatisfied) to 100
(satisfied). The distribution of scores (Figure 9.3) appears to be mostly above a score
of 50 (e.g., neutral).
The boxplot is another visualization to graphically depict the distribution of val-
ues for a single quantitative variable. This allows researchers to view aspects of the
measure’s central tendency (i.e., mean) and the distribution of values around the
256 ◾ Analytics and Knowledge Management
30
25
20
Frequency
15
10
5
0
0 20 40 60 80 100
Customer satisfaction (scale of 0 to 100)
Figure 9.3 Histogram that shows the frequencies of scores on a customer satis-
faction survey.
mean (i.e., standard deviation or variance) (Soukup & Davidson, 2002). A boxplot
is designed so that the data from the first quartile and third quartile (50% of the
data) are visualized in a box with the median dividing the box. The remaining
upper and lower 25% of the data create “whiskers” to reflect how spread out the
data are. This can be helpful to determine if there are potential floor effects (i.e.,
if many of the cases have a score of zero on the variable), which would be noted
in the visual as there being no lower whisker. Alternatively, ceiling effects, or
cases that are clustered at the highest value, could be depicted by the lack of an
upper whisker. In our scenario, More Fun Toys was interested in looking at the
marketing budget for the various products across each of the three divisions (i.e.,
Atlantic, Midwest, and West Coast). The amounts spent to advertise each product
were collapsed across the division. As can be seen in Figure 9.4, the West Coast
200 400 600 800 1000 1200
Marketing budget for 2017
(in thousands)
0
Figure 9.4 Boxplot that represents the median, quartiles, and potential outliers for
the aggregated project marketing budgets for the three divisions of More Fun Toys.
Data Visualization Practices and Principles ◾ 257
Figure 9.5 Violin plot that shows the quartiles and distribution of the marketing
budget for the Midwest division. The magenta color is the distribution plot that
has been mirrored on both sides. The black rectangle represents the data within
the first and third quartile. The white circle is the median and the lines extending
from the first and third quartiles are the whiskers of the boxplot. The data was
from the simulated data for the Midwest division (Figure 9.4).
division spent far less on the various marketing programs, whereas the Midwest
division far exceeded the average budget figures.
Related to boxplots are violin plots (Hintze & Nelson, 1998), which combine
a boxplot and a doubled kernel density plot. As presented in Figure 9.5, violin
plots combine the features of the density plot with the information of a boxplot
in one easy-to-interpret visualization (Hintze & Nelson, 1998). The information
about the quartiles is denoted by lines and a thin box within the mirrored density
plot. This provides more information than either plot alone, however, it is not
available within some of the commercial software packages, with the exceptions
of R and SAS.
A scatterplot can be used when there is an interest in determining if there is
a linear or non-linear relationship between two or more quantitative variables
(Tabachnick & Fidell, 2007). Simply put, the scores on one quantitative variable
(X) are plotted on the x-axis with the corresponding second variable scores (Y)
on the y-axis. If there is an ellipse or a line, then there could be a linear relation-
ship between the variables, whereas a U-shaped pattern could be indicative of a
curvilinear relationship (Howell, 2013; Tabachnick & Fidell, 2007). These visuals
are often used to assess for statistical assumptions of linearity and as a first step to
determine the relationship of two variables prior to running additional analyses
(Howell, 2013). An additional benefit to a scatterplot is that it can be used to
detect multivariate outliers, where a case has normal scores on each measure sepa-
rately, but when assessed together, the combination of both scores makes it further
removed from the other cases (Khan & Khan, 2011; Tabachnick & Fidell, 2007).
258 ◾ Analytics and Knowledge Management
10
Units sold (in thousands)
8
6
4
2
0
0 2 4 6 8 10
Number of promotions
Figure 9.6 Scatterplot that visualizes the relationship between units sold and the
number of marketing promotions in the Midwest division.
If there are more than two variables of interest, a three-dimensional scatterplot can
be utilized. The three-dimensional scatterplot provides an x, y, and z-axis, and the
corresponding three scores for each case are placed in the visual space as an orb in
the space. As shown in Figure 9.6, the Midwest division of More Fun Toys was
interested in seeing if there was a relationship between the number of units sold
and the number of promotional deals that were done in the last year. Additionally,
the Midwest division was interested in whether the units sold and promotions had
a relationship with the customer satisfaction survey that was completed by some
customers after their purchase. This was done to get an idea of whether there was
a relationship with promotions and the other two variables, and if more time and
money should be utilized on promotional offers, as presented in Figure 9.7.
When there are three variables, an alternative to the scatterplot, called a bubble
chart, can be used. Bubble charts are typically used to represent the relationships
among three variables (Ajibade & Adediran, 2016). Specifically, the relationship
between the outcome (Y) and the first variable (X1) are graphed on the x and y-axis,
like a scatter plot. Then the third variable’s relationship with the other two variables
is reflected by the size of the bubble. Like a scatterplot, each bubble represents a
single observation. Figure 9.8 shows the relationship among the number of units
sold, the number of promotions from the Midwest division, and the customer feed-
back survey that was completed by individuals after making a purchase.
Although line graphs are useful to look for trends in a continuous outcome vari-
able over time and across groups, they do not visualize the relative amounts of each
group to the whole over time. An area chart can be utilized to show changes over
time and view the groups and the total values simultaneously (Soukup & Davidson,
2002). More specifically, area graphs are used to visually compare categorical values
or groups over a continuous variable in the y-axis and how each group relates to the
Data Visualization Practices and Principles ◾ 259
10
Customer satisfaction
8
6
10
4
8 s)
6 nd
usa
tho
2
4
2 in
d(
0 sol
0
0 2 4 6 8 10 its
Un
Number of promotions
0 2 4 6 8 10
Number of advertisements
Figure 9.8 Bubble chart that shows the relationship among the units sold,
amount of marketing promotions, and customer satisfaction for the Midwest
division.
whole. In Figure 9.9, More Fun Toys wanted to determine whether it would be ben-
eficial to update their online store. The company was interested in whether the new
layout would appear more welcoming to potential customers. Furthermore, they
were curious if there were gender differences between the number of pages that were
viewed and the overall time spent on the site. To determine this, KMC conducted
260 ◾ Analytics and Knowledge Management
30
Sex
20
Count
Female
Male
10
0
0 10 20 30 40 50
Time
Figure 9.9 Area chart that visualizes the amount of time spent on searching the
retail website and the number of page views by gender.
a small pilot study and monitored how long individuals spent on the More Fun
Toys website and recorded the number of pages viewed. The amount of time and
pages clicked are shown by gender in the area chart in Figure 9.9. As can be seen
in Figure 9.9, men tended to spend less time on the website compared to women;
however, the number of pages viewed appeared to be consistent across gender.
The last frequently used multidimensional visualization technique is a pie chart
that contains wedges of data that correspond to categories or groups. Typically,
the wedges of data represent percentages of the data, with all wedges adding up to
100% (Khan & Khan, 2011). There are three subtypes of pie charts: the standard
pie chart, an exploding pie chart, and a multilevel pie chart. The exploding pie
chart separates the wedges from the rest of the wedges to visualize interactions
(Khan and Khan, 2011). Multi-level pie charts are useful when the data are hier-
archical in nature; the inner circle represents the parent level (i.e., upper level of
the nested data), and the outer circle is the child level (i.e., lower level of the nested
data). Pie charts create a visual to compare parts to the whole of one variable across
multiple categories. However, it is important to note that pie charts can make it
difficult to infer the precise value of each wedge, especially if there are numerous
categories that are then turned into very small wedges (Zoss, 2016).
Indiana Montana Colorado Minnesota Maine South Carolina New Jersey Arizona Wyoming
Mississippi South
Kentucky Nebraska Massachusetts District of Nevada New Mexico
Iowa Dakota Delaware Connecticut
Columbia
North Utah
Louisiana Ohio North
Texas Missouri Carolina Idaho
Dakota New York New
Vermont
Hampshire
Washington
Figure 9.10 Tree map that shows the number of unit sales for each state within
the three regional divisions during the 2016 fiscal year. The data is hierarchi-
cal in that states are grouped within the three divisions (Atlantic, Midwest, and
West Coast divisions). The green shading depicts the amount of sales and the
size of each box is based on the number of stores that carry Super Fast Racer in
each state.
262 ◾ Analytics and Knowledge Management
50
40
Latitude
30
20
10
−120 −100 −80
Longitude
Figure 9.11 Map visualization that shows the units of toys sold by state.
Data Visualization Practices and Principles ◾ 263
Store
La Rogue
Online Purchase
Price Hiker
Super Price
The Toy Chest
Figure 9.12 Interactive map visualization that shows the relationship among
units of the toys sold by store, geographical location, and estimated 2017 per
capita income. The filters include the five stores that carry the Super Fast Racer
(i.e., La Rogue, Online Purchase, Price Hiker, Super Price, and The Toy Chest). An
additional data layer of 2017 per capita income was added to explore for patterns
in the data.
Doll Young
Fantasy Toddler Space
Educational
Fantasy
Space Product line
Card game
Table top
Adult Horror
Horror
Teenager Fantasy
Fantasy
Table top
Card game Horror
Wild west
Figure 9.13 Radial tree graph that shows the hierarchical relationships of the
product age group, toy type, and toy genre classifications.
toddlers are only educational type toys and there are only fantasy dolls for tod-
dlers. Furthermore, only the young age group had all three types of toys. Based
on these findings, More Fun Toys could look into a future line of educational
space-themed toys for toddlers, since there are currently no toys that fall within
that classification. Additionally, More Fun Toys could add additional genres to
their adult card game selections, since they only carry two genres: a fantasy card
game and a horror card game.
Parallel coordinates plots allow the visualization of multidimensional relation-
ships of several variables rather than finite points (Inselberg & Dimsdale, 1990).
More specifically, data elements are plotted across several dimensions, with each
dimension related to a single y-axis (Khan & Khan, 2011). If a line chart or scat-
terplot is utilized with multi-dimensional data, the relationships across multiple
Data Visualization Practices and Principles ◾ 265
Income
Humor
Age
Education
Min Max
Figure 9.14 Parallel coordinates plot that shows the relationships among educa-
tional level, age, humor level, and income.
variables can be obscured (Khan & Khan, 2011). Within a parallel coordinates
plot, individual cases can be viewed across numerous dimensions to detect patterns
in a large data set (Khan & Khan, 2011). Parallel coordinates plots can be used to
identify groups of individuals based on several variables that share the same pat-
terns (Johansson, Forsell, Lind, & Cooper, 2008). However, it is important to know
that if the parallel lines overlap, it can be difficult to identify characteristics and
subsequent patterns (Ajibade & Adediran, 2016). More Fun Toys was interested
in classifying customers into distinct groups to better market their products based
on levels of education, age, humor levels, and income. Figure 9.14 presents the
relationships among these variables in a sample of 250 customers. There was a wide
spread of education levels among the customers, with most being older based on
the lines from education to age. There appeared to be three potential groups based
on humor levels denoted by the clusters of lines from age to humor. Interestingly,
from humor scores to income, it appeared that individuals who had lower levels of
humor had an associated higher income and individuals who had higher humor
scores had lower incomes.
considered the problem of how to construct effective data displays for a long
time. Educational researchers who work with graphical displays are mainly
concerned with the effectiveness of charts and diagrams as instructional tools.
Human factors researchers are principally concerned with psychophysical prop-
erties of displays, especially dynamic displays, rather than the interpretation of
static displays of quantitative data (Sanders & McCormick, 1987). Designers,
statisticians, and psychologists have developed rules for constructing good
graphs and tables.
Yau (2009) asserted “Data has to be presented in a way that is relate-able;
it has to be humanized. Oftentimes we get caught up in statistical charts and
graphs, which are extremely useful, but at the same time we want to engage users
so that they stay interested, continue collecting data, and keep coming back
to the site to gauge their progress in whatever they are tracking. Users should
understand that the data is about them and reflect the choices they make in their
daily lives (p. 7).”
In this context, good data visualizations should engage the audience as well as
improve both the accuracy and the depth of their understanding; both are critical
to making smarter decisions and improving productivity. Poorly created visualiza-
tions, on the other hand, can be misleading and undermine your credibility with
your audience. They make it more difficult for them to overcome the daily data
onslaught.
The most important principle of visualizing data is to address the following two
questions: What is the message you would like to convey? and Who is your audi-
ence? You should construct data visualization so it conveys the message in the most
efficient way, and the message one wants it to convey. You also need to work closely
with your intended audience to understand what actions they want to be able to
take based on the data they consume; thus, you need to identify the needs of your
audience. If the visualization doesn’t portray what the audience needs, it does not
work. Then the next step is to choose the right visualization type.
Different types of data require different visualizations. As discussed in the pre-
vious section, line charts, for example, are most useful for showing trends over time
or a potential correlation between two variables. When there are many data points
in your dataset, it may be easier to visualize the data using a scatterplot instead.
Histograms, on the other hand, show the distribution of the data, and the shape of
the histogram can change depending on the size of the bin width.
In the next section, a general vision of graphing data, which is often consid-
ered as philosophical guidelines, is presented. The following section describes some
detailed rules and best practices to ensure data visualization is efficient. It should
be noted that the set of principles and rules for designing data visualizations pre-
sented in this chapter is neither complete nor addresses all the issues relevant in
visualization design; but it is expected that this set would provide some guidelines
for visualization designers in selecting various visualization tools and primitives to
effectively display the given data.
Data Visualization Practices and Principles ◾ 267
General Principles
A number of scholars have published lists of principles governing the definition or
construction of a “good” visualization. Tufte (1983), who is known as a pioneer
in the field of data visualization, described graphs as “instruments for reasoning
about quantitative information (p. 10)” and argued that graphs are usually the
most effective method to describe, examine, and summarize even the largest data-
sets, and if properly designed, displays of statistical information can communicate
clearly even complex information and are one of the most powerful data analytic
tools. In his classic book, “The Visual Display of Quantitative Information,” Tufte
provided principles for graphical excellence, which is defined as follows:
Building on Tufte’s standards, Wainer (1984) published his principles for displaying
data badly, which are so clearly and humorously illustrated with real-life graphs and
tables. He defined that “the aim of good data graphics is to display data accurately
and clearly” (p. 137) and further breaks this definition into three parts: showing
data, showing data accurately, and showing data clearly, as presented in Table 9.1.
His principles provide some guidance for the design of graphs in the form of a
checklist of mistakes to be avoided. It is noteworthy that his principles address the
question of how to construct a graph, but do not address the question of why things
should be done this way.
Cleveland (1985) provides a more scientific treatment of the display of science
data that summarizes much of the research on the visual perception of graphic rep-
resentations. Unlike other scholars who took a more theorist’s intuitive approach,
Cleveland has actually tested some of his rules for making “good” graphs. He
and his colleagues have conducted many psychophysical investigations of the
elementary constituents of graphical displays by abstracting and then ordering a
list of physical dimensions. The results of those empirical works have influenced
Cleveland’s (1985) listing of the principles of graph construction. He has organized
his principles under headings that give the reader clues about how he thinks each of
his principles contributes to better data displays: clear vision, clear understanding,
scales, and general strategy, as shown in Table 9.2.
Many other authors have also reported studies attempting to figure out the best
methods of displaying data (Day, 1994; Wilkinson, 1999). Such works have been
Table 9.1 Wainer’s Twelve Principles for Displaying Data Badly
Principles Description
• Make the graph worse by using different baseline scales for variables on the same graph.
• Austria First!
• Order graphs and tables alphabetically to obscure structure in the data that would have been
obvious had the data been ordered by some aspect of the data.
• Label: (a) Illegibly, (b) Incompletely, (c) Incorrectly, and (d) Ambiguously.
• More is murkier: (a) More decimal places and (b) More dimensions.
• If it has been done well in the past, think of another way to do it.
Table 9.2 Cleveland’s Listing of the Principles of Graph Construction
Principles Description
Clear understanding • Put major conclusions into graphical form. Make legend comprehensive and informative.
• Error bars should be clearly explained.
• When logarithms of a variable are graphed, the scale label should correspond to the tick mark
labels.
Data Visualization Practices and Principles
• Proofread graphs.
• Strive for clarity.
◾
(Continued)
269
270 ◾
Scales • Choose the range of the tick marks to include or nearly include the range of data.
• Subject to constraints that scales have, choose the scales so that the data fills up as much of the
data region as possible.
• It is sometimes helpful to use the pair of scale lines for a variable to show two different scales.
• Choose appropriate scales when graphs are compared.
• Do not insist that zero always be included on a scale showing magnitude.
• Use a logarithm scale when it is important to understand percent change or multiplicative factors.
• Showing data on a logarithmic scale can improve resolution.
• Use a scale break only when necessary. If a break cannot be avoided, use a full-scale break. Do not
connect numerical values on two sides of a break.
Analytics and Knowledge Management
General strategy • A large amount of quantitative information can be packed into a small region.
• Graphing data should be an interactive, experimental process.
• Graph data two or more times when it is needed.
• Many useful graphs require careful, detailed study.
Data Visualization Practices and Principles ◾ 271
you should limit the number of colors shown in a visualization because use of mul-
tiple colors can be distracting. For nominal and ordinal data types, discrete color
palettes must be used. For the bar chart, for instance, research suggests the best
option would be choosing the same color for all bars except for the bar that needs
the most attention, such as those failing below the cut score of those representing
your particular program (Evergreen & Merzner, 2013). For interval and ratio data,
one can use both discrete and continuous color palettes, according to the data
structure and to the visualization aims.
Legibility is another very important criteria that needs to be considered. The
best practice is to use solid blocks of color and avoid fill patterns, which can create
disorienting visual effects. Further, background colors should generally be white or
have very reduced colors (Tufte, 1990) but graph text should be black or dark gray
(Wheildon, 2005). Color also needs to be legible for people with color-blindness;
red–green and yellow–blue combinations need to be avoided as those colors touch
one another on the color wheel.
It is also important to note that many colors depend on, and are constrained
by, the cultural traditions or technical experience of the user (Bianco, Gasparini, &
Schettini, 2014); in fact, there are many norms. Some are more universal. For
instance, most people associate green with positive or above-goal measurements,
while judicial use of red generally indicates peril or numbers that need improve-
ment. Some are political or cultural, such as red state, blue state. Some are depen-
dent on specific domain expertise; in the black, in the red (finance).
both sides of the visualization. When it corresponds to the x-axis, it can be placed
on the top, bottom, or both, but it is usually sufficient to place the scale in one place
(Few, 2012). Additionally, the numbers on the axis must follow a proper interval
in the same unit and be evenly spaced; you should not skip values when you have
numerical data.
Graphical objects inside the visualization must be sized to present ratios accu-
rately as a visualization that displays data as objects with disproportionate sizes can
be misleading. This is true for the map visualization, where a proportional symbol
is used to display absolute values. It should be noted that it may be difficult to iden-
tify the unit that the symbol refers to when the size of the symbol is bigger than
the size of corresponding spatial unit. In a plot, both length and position are better
quantitatively perceived than size, line width, or color hue, meaning that the data
values that they represent and how those values compare to other values are easily
determined (Cleveland & McGill, 1984). The order of graphic objects is important,
too. It is generally acknowledged that comparing numbers ordered alphabetically,
sequentially, or by value is best; more importantly, data should be displayed in an
order that makes logical sense to the viewer. When using bar or pie charts, you
should sort data from smallest to largest values, so they are easier to compare.
Consistent text layout and designs are also recommended to avoid distortion.
It is recommended that you avoid steep diagonal or vertical text type as it can be
difficult to read; lay out text horizontally. This includes titles, subtitles, annotations,
and data labels. Line labels and axis labels can deviate from this rule. Consistency
in the use of other elements, including line widths and box sizes, is also critical
unless those elements enable us to isolate all marks belonging to the same category.
Note
All the visualizations presented in this chapter used fictitious data. They were simu-
lated and created using R and Tableau.
References
Ajibade, S. S., & Adediran, A. (2016). An overview of big data visualization techniques in
data mining. International Journal of Computer Science and Information Technology
Research, 4(3), 105–113.
Bansal, K. L., & Sood, S. (2011). Data visualization: A tool of data mining. International
Journal of Computer Science and Technology, 2(3). Retrieved from http://www.ijcst.
com/vol23/1/kishorilal.pdf
Bendoly, E. (2016). Fit, bias, and enacted sense making in data visualization: Frameworks
for continuous development in operations and supply chain management analytics.
Journal of Business Logistics, 37(1), 6–17. doi:10.1111/jbl.12113.
Bianco, S., Gasparini, F., & Schettini, R. (2014). Color coding for data visualization. In
M. Khosrow-Pour (Ed.), Encyclopedia of information science and technology. Hershey,
PA: IGI Global.
Data Visualization Practices and Principles ◾ 275
Bickel, R. (2007). Multilevel analysis for applied research: It’s just regression. New York, NY:
Guilford Press.
Campbell, C. S., & Maglio, P. P. (1999). Facilitating navigation in information spaces:
Road-signs on the World Wide Web. International Journal of Human-Computer
Studies, 50(4), 309–327.
Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wadsworth Advanced
Books and Software.
Cleveland, W. S. (1987). Research in statistical graphics. Journal of American Statistical
Association, 82(398), 419–423.
Cleveland, W. S. (1994). The elements of graphing data (2nd ed.). Summit, NJ: Hobart Press.
Cleveland, W. S., & McGill, M. E. (1984). Graphical perception: Theory, experimenta-
tion, and application to the development of graphical methods. Journal of American
Statistical Association, 79(387), 531–554.
Day, R. A. (1994). How to write and publish a scientific paper. Phoenix, AZ: Oryx Press.
Dibble, E. (1997). The interpretation of tables and graphs. Seattle, WA: University of
Washington.
Evergreen, S. (2014). Presenting data effectively: Communicating your findings for maximum
impact. Thousand Oaks, CA: Sage Publications.
Evergreen, S., & Merzner, C. (2013). Design principles for data visualization in evaluation.
In T. Azzam & S. Evergreen (Eds.), Data visualization, Part 2. New directions for
evaluation (pp. 5–20). New York, NY: John Wiley & Sons.
Few, S. (2012). Show me the numbers (2nd ed.). Oakland, CA: Analytics Press.
Hintze, J. L., & Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. The
American Statistician, 52(2), 181–184. doi:10.1080/00031305.1998.10480559.
Howell, D. C. (2013). Statistical methods for psychology (8th ed.). Belmont, CA: Wadsworth
Cengage Learning.
Inselberg, A., & Dimsdale, B. (1990). Parallel coordinates: A tool for visualizing
multi-dimensional geometry. In Proceedings of the 1st conference on visualization’90
(pp. 361–378). San Francisco, CA: IEEE Computer Society.
Johansson, J., Forsell, C., Lind, M., & Cooper, M. (2008). Perceiving patterns in parallel
coordinates: Determining thresholds for identification of relationship. Information
Visualization, 7(2), 152–162.
Keim, D., Andrienko, G., Fekete, J-D., Görg, C., Kohlhammer, J., & Melançon, G. (2008).
Visual analytics: Definition, process, and challenges. In A. Kerren, J. Stasko, J-D. Fekete, &
C. North (Eds.), Information visualization (pp. 154–175). Berlin, Germany: Springer.
Khan, M., & Khan, S. S. (2011). Data and information visualization methods and interac-
tive mechanisms: A survey. International Journal of Computer Applications, 34(1), 1–14.
Kosslyn, S., Pinker, S., Smicox, W., & Parkin, L. (1983). Understanding charts and graphs:
A project in applied cognitive science (NIE-40079-0066). Washington, DC: National
Institute of Education.
Lewandowsky, S., & Spence, I. (1989). The perception of statistical graphs. Sociological
Methods and Research, 18, 200–242.
Morabito, V. (2016). The future of digital business innovation: Trends and practices.
Switzerland: Springer International Publishing. Heidelberg, Germany: Springer
Verlag.
Robbins, N. (2005). Creating more effective graphs. Hoboken, NJ: Wiley-Interscience.
Sanders, M. S., & McCormick, E. J. (1987). Human factors in engineering and design
(6th ed.). New York, NY: McGraw-Hill.
276 ◾ Analytics and Knowledge Management
Soukup, T., & Davidson, I. (2002). Visual data mining: Techniques and tools for data visual-
ization and mining. New York, NY: John Wiley & Sons.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (6th ed.). New York,
NY: Pearson Education.
Tory, M., & Möller, T. (2004). Rethinking visualization: A high-level taxonomy.
In INFOVIS ‘04 proceedings of the IEEE symposium on information visualization
(pp. 151–158). Washington, DC: IEEE Computer Society.
Towler, W. (2015, January/February). Data visualization: The future of data visualization.
Analytics Magazine, pp. 44–51.
Tufte, E. (1983). The visual display of quantitative information (1st ed.). Cheshire, CT:
Graphic Press.
Tufte, E. (1990). Envisioning information. Cheshire, CT: Graphics Press.
Wainer, H. (1984). How to display data badly. The American Statistician, 38(2), 137–147.
Wheildon, C. (2005). Type and layout: Are you communicating or just making pretty shapes?
Mentone, Australia: The Worsley Press.
Wilkinson, L. (1999). The grammar or graphics. New York, NY: Springer-Verlag.
Yau, N. (2009). Seeing your life in data. In T. Segaran & J. Hammerbacher (Eds.), Beautiful
data. Farnham, CA: O’Reilly.
Yeh, R. K. (2010). Visualization techniques for data mining in business context: A compara-
tive analysis. Retrieved from http://www.swdsi.org/swdsi06/proceedings06/papers/
kms04.pdf
Zoss, A. M. (2016). Designing public visualizations of library data. In L. Magnuson (Ed.),
Data visualization: A guide to visual storytelling for libraries (pp. 19–44). Lanham,
MD: Rowan & Littlefield.
Chapter 10
Analytics Using
Machine Learning-
Guided Simulations
with Application to
Healthcare Scenarios
Mahmoud Elbattah and Owen Molloy
Contents
Introduction ......................................................................................................279
Motivation ....................................................................................................... 280
Simulation Modeling and Machine Learning: Toward More Integration ...... 280
The Prospective Role of Machine Learning in Simulation Modeling .............281
Related Work ....................................................................................................282
Hybrid Simulations ......................................................................................282
Artificial Intelligence-Assisted Simulations ....................................................283
Simulation-Based Healthcare Planning .........................................................283
Background: Big Data, Analytics, and Simulation Modeling ............................ 284
Definitions of Big Data................................................................................ 284
Characteristics of Big Data........................................................................... 284
Analytics .......................................................................................................287
Simulation Modeling ....................................................................................288
What Can Big Data Add to Simulation Modeling? .......................................290
277
278 ◾ Analytics and Knowledge Management
Introduction
Simulation modeling (SM) was considered a standalone discipline that encom-
passed designing a model of an actual or theoretical system, executing the model on
a digital computer, and analyzing the execution output (Fishwick, 1995). Based on
a virtual environment, simulation models provide extended capabilities to model
real systems in a flexible build-and-test manner. In this respect, Newell and Simon
(1959) asserted that the real power of the simulation approach is that it provides
not only a means for stating a theory, but also a very sharp criterion for testing
whether the statement is adequate. Similarly, Forrester (1968) emphasized in one of
his principles of systems that simulation-based solutions present as the only feasible
approach to represent the interdependence and nonlinearity of complex systems,
whereas analytical solutions can be impossible.
The complexity of systems can be interpreted in terms of several dimensions.
One possible dimension can be attributed to the data and metadata that represent
the system knowledge. For instance, the data complexity can correspond to one,
or more, of the four aspects that characterize the notion of Big Data including:
volume, velocity, variety, or veracity. In such scenarios, further burdens can be
unavoidably placed on the modeling process, which go beyond human capabilities.
Recently, the community of systems modeling and simulation has started
to consider the potential opportunities and challenges facing the development
of simulation models in an age marked by data-driven learning. For instance, a
study (Taylor et al., 2013) introduced the term “big simulation” to describe one
of the grand challenges for the simulation research community. Big simulation
is intended to address issues of scale for big data input, very large sets of coupled
simulation models, and the analysis of big data output from these simulations, all
running on a highly-distributed computing platform. Another more recent posi-
tion paper (Tolk, 2015) envisioned that the next generation of simulation models
will be integrated with machine learning (ML), and deep learning in particular.
The study argued that bringing modeling, simulation, Big Data, and deep learn-
ing all together can create a synergy delivering significantly improved services to
other sciences.
In line with that direction, the chapter endeavored to spur a discussion on how
the practice of modeling and simulation can be assisted by ML techniques. The
initial discussion focused on why and how Big Data and ML can provide further
support to simulations. Subsequently, a practical scenario is presented in relation
to healthcare planning in Ireland to demonstrate the applicability of our ideas.
First, unsupervised ML was utilized in a bid to discover potential patterns. The
knowledge learned by ML was then used to build simulation models with a higher
level of confidence. Second, simulation experiments were conducted with the guid-
ance of ML models trained to make predictions on the system’s behavior. The key
idea was to realize ML-guided simulations during the phases of model design or
experimentation.
280 ◾ Analytics and Knowledge Management
This chapter can be viewed as structured into two main parts as follows. The
first part initiates a discussion regarding the prospective integration of simulation
models and ML in a broader sense. That discussion is intended to serve as an open-
ing to the rationale underlying our approach. Starting in Section 5, the second part
provides a more practical standpoint for integrating SM and ML through a realistic
use case.
Motivation
Simulation Modeling and Machine
Learning: Toward More Integration
The fields of SM and ML are long-established in the world of computing. However,
both of them tended to be employed in separate territories with limited, if any, inte-
gration. This lack of integration might be attributed to a couple of issues.
First, the development of ML models is highly data-driven compared to SM.
Simulation models have been largely developed with the aid of domain experts. Hence,
the subjective expert-driven knowledge can stipulate to a great extent the behavior of
a simulation model in terms of structure, assumptions, and parameters. On the other
hand, ML models can be developed with slight involvement, if any, of experts.
Second, SM and ML were often considered to be addressing different types
of analytical questions. From the perspective of data analytics, ML is largely con-
cerned with predicting what is likely to happen. However, SM goes beyond that
question and addresses further questions for more complex scenarios (e.g., “What
if?” or “How to?”). Figure 10.1 illustrates the position of SM and ML within the
Analytical sophistication
Figure 10.1 The spectrum of data analytics. (Adapted from Barga, R. et al.,
Predictive Analytics with Microsoft Azure Machine Learning, Apress, 2015;
Maoz, M., How IT should deepen big data analysis to support customer-centricity,
Gartner G00248980, 2013.)
Analytics Using Machine Learning-Guided Simulations ◾ 281
landscape of data analytics. The figure classifies data analytics into four categories
with an increasing level of analytical sophistication.
Decision
Action
Predict Collect
behavior
Update knowledge
the new system states can be “learned” aided by ML models, which can in turn help
simulation models become dynamic and more realistic.
From a more practical standpoint, ML can be utilized to predict the behavior
of system variables that may not be feasible to express analytically. For example,
Zhong et al. (2016) trained ML models within a use case for crowd modeling and
simulation. The ML models were used to learn and predict the flow of crowds.
Likewise, unsupervised ML techniques (e.g., clustering) can be used to learn about
key structural characteristics of systems, especially if tackling big data scenarios.
In a broader sense, ML can be employed as an assistive tool to help reduce
the epistemic uncertainty (Oberkampf et al., 2002) underlying simulation models.
This kind of uncertainty can be attributed to the subjective interpretation of system
knowledge by modelers, simulationists, or subject matter experts.
Related Work
Owing to the multifaceted nature of the presented work, we believe that the study
can be viewed from different perspective. Therefore, we reviewed studies with rel-
evance to the following: hybrid simulations, artificial intelligence (AI)-assisted
simulations, and simulation-based healthcare planning.
Hybrid Simulations
This study can be viewed from the perspective of developing hybrid simulations. As
suggested by Powell and Mustafee (2014), a hybrid modeling and simulation study
refers to the application of methods and techniques from disciplines like operations
research, systems engineering, or computer science to one or more stages of a simulation
study. Likewise, the study here attempted to integrate simulation models with a method
from the computer science discipline (i.e., ML). Viewed this way, we aimed to review
examples of hybrid studies that incorporated simulation methods with ML techniques.
To focus our search, two main sources were selected for review over the past 10 years
(i.e., 2016–2007): Winter Simulation conference and ACM SIGSIM Conference on
Principles of Advanced Discrete Simulation (PADS) conference. It is acknowledged
that other relevant studies could have been published in other conferences or journals,
but we believe that the selected venues provided excellent, if not the best, sources of
representative studies in accordance with the target context.
One good example is Rabelo et al. (2014) who applied hybrid modeling, where
SM and ML were used altogether in a use case related to the Panama Canal opera-
tions. A set of simulation models was developed to make predictions about the
future expansion of the canal. This information was further used to develop ML
models (e.g., neural networks and support vector machines) to help with the analy-
sis of the simulation output. With a comparable hybrid approach, Elbattah and
Molloy (2016) embraced an approach that integrated SM with ML with application
to a healthcare case. The study claimed that the use of ML improved the predictive
Analytics Using Machine Learning-Guided Simulations ◾ 283
power of the simulation model. Another example is Zhong et al. (2016) who uti-
lized ML to assist with crowd modeling and simulation. The ML models were used
to learn and predict the flow of crowds.
dynamic flow of elderly patients in the Irish healthcare system. The model was
claimed to be useful for inspecting the outcomes of proposed policies to over-
come the delayed discharge of elderly patients. However, the literature generally
laid little emphasis on endeavors toward incorporating simulation methods and
ML techniques.
(Dijcks, 2012)
Big Data is the derivation of value from traditional relational database driven
business decision-making, augmented with new sources of unstructured data.
(Microsoft, 2013)
Big Data is the term increasingly used to describe the process of applying serious
computing power (the latest in ML and AI) to seriously massive and highly
complex sets of information.
Gartner’s 3Vs: In a white paper (Laney, 2001), Big Data was characterized as hav-
ing three main attributes (i.e., 3Vs). Regarded as the basic dimensions of Big
Data, the 3Vs can be explained as follows:
1. Volume: Most organizations are currently struggling with the increasing
volumes of their data. According to an estimate by Fortune magazine
(Fortune, 2012), about five exabytes of digital data was created until
2003. The same amount of data could be created in just two days by 2011.
2. Velocity: Data velocity describes the speed at which data is created, accumu-
lated, and processed. The rapidly increasing pace of the world has placed
further demands on businesses to process information in real time or near
real time. This may mean that data should be processed on the fly or in a
streaming-based fashion to make quicker decisions (Minelli et al., 2012).
3. Variety: The assortment of data variety represents a critical factor of the
data complexity. Over the past couple of decades, data have become
increasingly unstructured as the sources of data have varied beyond the
traditional operational applications. Therefore, large-scale datasets may
likely exist in different structured, semistructured, or unstructured forms,
which can escalate the difficulty of processing tasks to a greater extent.
IBM’s 4Vs: IBM added another dimension, “veracity”, to Gartner’s 3Vs. The rea-
son behind the additional dimension was justified as that IBM’s clients started
to face data-quality issues while dealing with big data problems (Zikopoulos,
2013). Hence, IBM (2017) defined the big data dimensions as: volume, veloc-
ity, variety, and veracity. Further studies (Demchenko et al., 2014) added the
“value” dimension to IBM’s 4Vs.
Microsoft’s 6Vs: For the purpose of maximizing business value, Microsoft
extended the big data dimensions into 6Vs (Wu et al., 2016). The 6Vs included
additional dimensions for variability, veracity, and visibility. In comparison
with variety, variability refers to the complexity of data (e.g., the number of
variables in a dataset), while visibility emphasizes that there is a need to have
a full picture of data to make informative decisions. Figure 10.4 below sum-
marizes the common dimensions of Big Data as explained before.
IBM’s 4Vs
Demchenko’s 5Vs
Microsoft’s 6Vs
Analytics
The opportunities enabled by Big Data led to a significant interest in the practice
of data analytics. Thus, data analytics has evolved into a vibrant and broad domain
that incorporates a diversity of techniques, technologies, systems, practices, meth-
odologies, and applications.
Similar to Big Data, various definitions were developed to describe analyt-
ics. Table 10.1 presents some common definitions used to describe analytics. In
the same context, Figure 10.5 portrays the interdisciplinarity involved within
Delivering the right decision support to the right people at Laursen and
the right time. Thorlund
(2016)
The scientific process of transforming data into insight for Liberatore and
making better decisions. Luo (2011)
Analytics
Artificial
Operational
Intelligence
research
(machine learning)
Quantitative methods
Mathematics
Statistics
Economics (Econometrics)
Simulation Modeling
“When experimentation in the real system is infeasible, simulation
becomes the main, and perhaps the only, way to discover how complex
systems work.” (Sterman, 1994)
Shannon (1975) described simulation as the process of designing a model of
a real system and conducting experiments with this model for the purpose of
understanding the behavior of the system and/or evaluating various strategies for
the operation of the system. With a virtual build-and-test environment, simulation
provides the feasibility to model real systems that exhibit adaptive, dynamic, goal-
seeking, self-preserving, and sometimes evolutionary behavior (Meadows, 2008).
In view of the analytics spectrum as presented earlier in Figure 10.1, simula-
tion models endeavor to answer more complex questions that fall under the cat-
egory of prescriptive analytics. Therefore, SM can present as a vital component for
data analytics. This section provides a brief background of the common simulation
approaches, and the main distinctions between them.
A simulation model is developed largely based on the “world view” of a modeler.
A world view reflects how a real world system is mapped to a simulation model. In
this respect, there are three primary approaches including: system dynamics (SD),
discrete event simulation (DES), and agent-based modeling.
The SD approach assumes a very high degree of abstraction, which can be
considered adequately for strategic modeling. On the other hand, discrete-event
models maintain medium and medium–low abstraction, whereas a model com-
prises a set of individual entities that have particular characteristics in common.
Agent-based models are positioned in the middle, which can vary from very
fine-grained agents to highly abstract models. Figure 10.6 portrays the three
approaches with respect to the level of abstraction. Further, Table 10.2 makes a
more detailed comparison based on Brailsford and Hilton (2001), Lane (2000),
and Sweetser (1999).
Analytics Using Machine Learning-Guided Simulations ◾ 289
Figure 10.6 Simulation modeling approaches. (From Borshchev, A., The Big
Book of Simulation Modeling: Multimethod Modeling with AnyLogic 6, AnyLogic
North America, Chicago, IL, 2013.)
In the era of Big Data, it should be considered that system knowledge will increas-
ingly become based on empirical data accumulated or generated autonomously.
Specifically, more data will be increasingly utilized to learn about systems. Therefore,
systems that deal with big data scenarios will inevitably place further burdens on the
modeling process, which can be beyond the capabilities of humans in many situa-
tions. For instance, the knowledge of a system can be underlying huge amounts of
data, or being accumulated with high velocity. In this regard, insightful studies such
as Tolk (2015), and Tolk et al. (2015) emphasized the need for integrating SM with
Big Data. In particular, it was stressed that big data techniques and technologies
should be considered to avail of rapidly accumulating data that may be structured
or unstructured.
Analytics Using Machine Learning-Guided Simulations ◾ 291
System knowledge
Modelers
Case Description
Approaching any analytics problem needs an in-depth understanding of the
domain under study. Therefore, this section serves as a necessary background prior
to building the simulation or ML models.
In Ireland, the population has been experiencing a pronounced demographic
transition. The Health Service Executive (HSE) of Ireland reported in 2014 that
the increase in the number of people over 65 is approaching 20,000 per year (HSE,
2014b). As a result, population aging is expected to have significant impacts on a
broad range of economic and social areas, and on the demand for healthcare services.
292 ◾ Analytics and Knowledge Management
Within the context of elderly care, the focus of the case was centralized
around the care scheme of hip fracture. Hip fractures were considered for having
two-fold significance. On one hand, hip fractures represent an appropriate exam-
ple of elderly care schemes. An ample number of studies (Cooper et al., 1992;
Melton, 1996) recognized that hip fractures increase exponentially with aging,
though rates may vary from one country to another. Around 3,000 patients
per year are assumed to sustain hip fractures in Ireland (Ellanti et al., 2014).
Further, that figure may unavoidably increase due to the continuously aging
population.
From an economic perspective, hip fractures can represent a major burden on
the Irish healthcare system. According to the HSE, hip fractures were identified as
one of the most serious injuries resulting in lengthy hospital admissions and high
costs (HSE, 2014a). The median length of stay (LOS) was recorded as 13 days, and
more than two-thirds of patients are discharged to long-stay residential care after
surgery (NOCA, 2014). The cost of treating a typical hip fracture was estimated
around €12,600 (HSE, 2014a), while a different study reported a higher cost of
€14,300.
Questions of Interest
In relation to the elderly care of hip fracture, two complementary categories of
questions were addressed by the use case including: population-level questions and
patient-level questions. Table 10.3 poses the questions in detail.
Data Description
The main source of data used by the study is the Irish hip fracture database (IHFD)
(NOCA, 2017). The IHFD repository is the national clinical audit developed to cap-
ture care standards and outcomes for hip fracture patients in Ireland. Decisions to grant
access to the IHFD data are made by the National Office of Clinical Audit (NOCA).
The IHFD contains ample information about the patient’s journey from admis-
sion to discharge. Specifically, a typical patient record included 38 data fields such
as gender, age, type of fracture, date of admission, and LOS. A thorough explana-
tion of the data fields was available via the official data dictionary (HIPE, 2015).
294 ◾ Analytics and Knowledge Management
Unsupervised
machine learning
Population-
scope 2. Modeling clustered-flows of patients
Patient
cluster1
Discrete-event
simulation modeling
Patient-
scope 4. Predicting care outcomes
Predicted
LOS
Prediction
model Supervised
Predicted
Elderly discharge machine learning
patient destination
Prediction
model
Machine learning
Exploratory analysis and
predictive analytics
Data pre processing/ ggplot2
visualization
The dataset included records about elderly patients aged 60 and over in particular.
The data comprised about 8,000 records over three years, from January 2013 to
December 2015. It is worth mentioning that one patient may be related to more
than one record, in cases of recurrent fractures. However, we were unfortunately
unable to determine the proportion of recurrent cases as patients had no unique
identifiers, and records were completely anonymized for the purpose of privacy.
Figure 10.10 plots a histogram of the age distribution within the dataset, while
Figure 10.11 shows the percentages of male and female patients.
800
600
Frequency
400
200
0
60 70 80 90 100
Patient age
6000
4000
Frequency
2000
0
Female Male
Patient age
8,000,000
Pojected population
6,000,000
4,000,000
2,000,000
0
2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
Year
Total population Elderly population >= 60
Outliers Removal
As reported by NOCA (2014), the mean and median LOS for hip fracture
patients were recorded as 19 and 12.5 days respectively. Therefore, we only con-
sidered the patients whose LOS were no longer than 60 days to avoid the odd
influence of outliers. The excluded outliers represented approximately 5% of the
overall dataset. Figure 10.13 plots a histogram of the LOS used to identify the
outliers.
Analytics Using Machine Learning-Guided Simulations ◾ 297
2000
1500 Count
2000
1500
Count
1000 1000
500
0
500
0
0 25 50 75 100
Length of stay (LOS)
Figure 10.13 Histogram and probability density of the LOS variable. The density
is visually expressed as a gradient ranging from green (low) to red (high). The
outliers can be observed for LOS longer than 60 days.
Feature Scaling
Feature scaling is a necessary preprocessing step in ML in cases where the range
of features values varies widely. Several studies such as Visalakshi and Thangavel
(2009), and Patel and Mehta (2011) argued that large variations within the range
of feature values can affect the quality of computed clusters. Therefore, the feature
values were rescaled to a standard range.
The min-max normalization method was used, where every feature was linearly
rescaled to the [0, 1] interval. The values were transformed using the formula below
in Equation 10.1:
x − min( x )
z= (10.1)
[max( x ) − min( x )]
Features Extraction
In a report prepared by the British Orthopedic Association (2007), six quality stan-
dards for hip fracture care were emphasised. Those standards generally reflect good
practice at key stages of hip fracture care including:
The raw data did not include fields that explicitly captured such standards. However,
they can be derived based on the date and time values of patient arrival, admission,
and surgery. In this way, two new features were added named as TTA and “TTS”.
Eventually, only TTS was included because TTA contained a significant amount of
missing values.
Clustering Approach
The study embraced the partitional clustering approach using the k-means algo-
rithm. The k-means is one of the most widely used clustering algorithms. The
k-means clustering uses a simple iterative technique to group points in a dataset into
clusters that contain similar characteristics. Initially, a number (k) is decided that
represents centroids (center of a cluster). The algorithm iteratively places data points
into clusters by minimizing the within-cluster sum of squares as the equation below
(Jain, 2010). The algorithm converges on a solution when meeting one or more of
these conditions: the cluster assignments no longer changes, or the specified number
of iterations is completed. Equation 10.2:
J (C k ) = ∑ X −µ
X i ∈C K
i K
2
(10.2)
where:
µ K is the mean of cluster C k
J (C k ) is the squared error between µ K and the points in cluster C k
Selected Features
The k-means algorithm was originally applicable to numeric features only, where
a distance metric (e.g., Euclidean distance) can be used for measuring similar-
ity between data points. Therefore, we considered the numeric features only.
Specifically, the model was trained using the following features: LOS, age, and
TTS. However, it is worth mentioning that there are some extensions of the
k-means algorithm that attempted to incorporate categorical features, such as the
k-modes algorithm (Huang, 1998).
Clustering Experiments
As usual, the unavoidable question while approaching a clustering task is how many
clusters (k) exist? In our case, the quality of clusters was experimented using k rang-
ing from 2 to 7. Table 10.4 presents the parameters used within the clustering exper-
iments. We used the Azure ML Studio to train the clustering model.
Analytics Using Machine Learning-Guided Simulations ◾ 299
500
Within clusters sum of squares
400
300
200
1 2 3 4 5 6 7
Number of clusters (k)
Initially, the quality of clusters was examined based on the within cluster
sum of squares (WSS), as plotted in Figure 10.14. In view of that, it turned out
that there may be three or four well-detached clusters of patients that can best
separate the dataset. Furthermore, the suggested clusters were projected into two
dimensions based on the principal component analysis (PCA) to determine the
appropriate number of clusters, as in Figure 10.15. Each subfigure in Figure 10.16
represents the output of a single clustering experiment using a different number
of clusters (k). Initially with k = 2, the output indicated a promising tendency
of clusters, where the data space is obviously separated into two big clusters.
300 ◾ Analytics and Knowledge Management
Principal component 2
Principal component 2
Principal component 2
0.5 0.5 0.5
−1.5 −1.0 −0.5 0.0 −1.5 −1.0 −0.5 0.0 −1.5 −1.0 −0.5 0.0
(a) Principal component 1 (b) Principal component 1 (c) Principal component 1
k=2 k=3 k=4
Principal component 2
Principal component 2
Principal component 2
−1.5 −1.0 −0.5 0.0 −1.5 −1.0 −0.5 0.0 −1.5 −1.0 −0.5 0.0
(d) Principal component 1 (e) Principal component 1 (f) Principal component 1
k=5 k=6 k=7
Similarly for k = 3, the clusters are still well-separated. However, the quality of
the clusters started to decline when k = 4 onwards. Thus, it turned out eventu-
ally that there were three clusters that divided the dataset into coherent cohorts
of patients.
Exploring Clusters
In this section, we aim to explore the discovered clusters in a visual manner that
can reveal potential correlations or insights, which can in turn assist with simulation
model design. The clusters were particularly examined with respect to patient char-
acteristics (e.g., age), care-related factors (e.g., TTS), and outcomes (e.g., discharge
destination).
In Figure 10.16a, the inpatient LOS is plotted with respect to the three patient
clusters. At first glance, it was obvious that the patients of cluster 3 experienced
remarkably longer LOS periods compared to cluster 1 or cluster 2. In addition,
cluster 1 and cluster 2 shared a very similar distribution of the LOS variable, apart
from a few outliers in cluster 2.
Analytics Using Machine Learning-Guided Simulations ◾ 301
60
20
0
Cluster 1 Cluster 2 Cluster 3
(a) Discovered clusters of patients
LOS
10.0
100
Time to surgery (days)
7.5 90
Patient age
5.0 80
70
2.5
60
Cluster 1 Cluster 2 Cluster 3 Cluster 1 Cluster 2 Cluster 3
(b) Discovered clusters of patients (c) Discovered clusters of patients
Time to surgery (TTS) Patient age
Figure 10.16 The variation of the LOS, TTS, and age variables within the three
patient clusters.
Second, we examined the clusters with respect to the elapsed TTS. As men-
tioned in Section 8.3, the TTS has a particular significance for being one of the
quality standards for hip fracture care. Once more, the patients of cluster 3 were
observed for having a relatively longer TTS than those patients of cluster 1 and
cluster 2. Likewise, the LOS, cluster 1, and cluster 2 had a very similar distribu-
tion of the TTS. Figure 10.16b plots the TTS variable against the three clusters of
patients.
The patient age has a considerable relevance in elderly care schemes. In our
context, the possibility of sustaining hip fractures can increase significantly with
age. It turned out that cluster 2 and cluster 3 tended to have relatively older
patients rather than cluster 1. Figure 10.16c plots the age distribution within the
three clusters.
Furthermore, the clusters were inspected for possible gender-related patterns.
Figure 10.17 shows the proportions of male and female patients within clusters.
It can be clearly noticed that the number of female patients consistently exceeded
males in all clusters.
302 ◾ Analytics and Knowledge Management
2500
2000
Count of patients
1500
1000
500
0
Cluster 1 Cluster 2 Cluster 3
Discovered patient clusters
Male patients Female patients
Figure 10.17 Distribution of male and female patients within the three clusters.
Long-stay care
Hip fracture rate for discharged
elderly males
Discharge fraction
Potential male Total discharged
patients + + patients
+
New male cases
+ Inhospital
+ Home-discharged
Total elderly
population New female R
+ cases
+ Return patients
+ + +
+
Potential female
patients
Recurrence
Hip fracture rate for fraction
elderly females
The rate of hip fracture in the total The rate was defined by Dodds et al.
population aged 60 and over was set (2009).
as 407 for females and 140 for males
per 100,000.
The model did not consider the Only for simplification, assuming that
scenario of patient transfer from an the treatment course was bounded
acute hospital to another during within a single acute hospital.
treatment course.
The model used the same age For the purpose of simplification,
distribution for both male and since both distributions were
female elderly patients. slightly different.
Hip fracture rate for elderly males The rate of hip fracture in the total
elderly male population = 140 cases per
100 K.
Hip fracture rate for elderly The rate of hip fracture in the total
females population aged 60 and over = 407 for
females per 100 K.
1. Hip fracture rate for elderly males = 140 cases per 100,000. Auxiliary
2. Hip fracture rate for elderly females = 407 cases per 100,000. Auxiliary
3. New male cases = Hip fracture rate for elderly males * Inflow
Potential male patients
4. New female cases = Hip fracture rate for elderly females * Inflow
Potential female patients
visualizations could support the rationale behind the SD model design in terms of
structure and behavior as well.
Male
Elderly population-cluster 2
population Male patients admitted 2 Cluster 2-admitted Cluster 2-in ward Total
Ready for surgery 2 Cluster 2-discharged discharged
Female patients admitted 2
Female
population-cluster 2
Time to discharge 2
306 ◾ Analytics and Knowledge Management
Time to surgery 2
Male
population-cluster 3
Admitted to Admitted to
Home emergency orthopedic Home
department ward
(ED)
Primary Discharge
Nursing destination?
home surgery
Time to Time to
admission surgery
Acute Geriatric Long-stay
hospital assessment care
Non-acute Multi-
Fragility Yes
hospital disciplinary
history?
assessment
No Falls
assessment
Generation of Patients
The DES model made use of the projections produced by the SD model to gener-
ate individual patient entities. The generation process was implemented using the
R language. The total number of generated patients reached around 30,000 for a
simulated period of 10 years (i.e., 2017–2026). Table 10.8 presents the counts of
elderly patients generated for every cluster.
Model Implementation
The DES model was mainly developed based on the empirical data acquired by
the study. For instance, the probability distributions of patient attributes were set
to mimic reality, as in the IHFD dataset. The simulation model was fully imple-
mented using the R language. The source code can be accessed via our GitHub
repository (Elbattah, 2017).
Cluster 1 13,438
Cluster 2 10,782
Cluster 3 5,272
308 ◾ Analytics and Knowledge Management
Predicted
LOS DES Model
Simulated
patient
Predicted
discharge dest.
Patient
Discharge characteristics
destination
classifier
LOS
regression model
Azure predictive
web services
The main entity of the simulation model obviously represented the elderly
patient. Each patient was assigned a set of attributes that characterized age, sex, area
of residence, fracture type, fragility history and diagnosis type. The patients’ char-
acteristics varied based on the cluster they were assigned to. Further care-related
factors (e.g., TTS) were considered on an individual basis as well.
The experimental environment consisted of two integrated parts. The DES
model served as the core component. In tandem with the simulation model, ML
models were then utilized to carefully predict the inpatient LOS and discharge
destination for each elderly patient generated by the simulation model. The predic-
tions were obtained from the ML models via web services enabled by the Microsoft
Azure platform. Figure 10.21 illustrates the environment of simulation experiments
where the DES was integrated with predictions from the ML models. In this man-
ner, the ML models were employed to guide the simulation model with respect to
the LOS and discharge destination of simulation-generated elderly patients.
and behavior to depict the real system in an accurate manner. In this section, we go
through the development of supervised ML models (e.g., regression and classifica-
tion), which were utilized to guide the DES model.
Based on the IHFD patient records, ML was employed to make predictions
on important outcomes related to the patient’s journey. At the micro level, the
ML models were aimed at addressing patient entities generated by the simulation
model. The ML models included a regression model for predicting the LOS, and
a binary classifier for predicting discharge destinations. The predicted discharge
destination included either home or a long-stay care facility (e.g., nursing home).
The ML models were developed using the Azure ML Studio.
Training Data
As mentioned previously, the study acquired a dataset of the IHFD repository. The
IHFD dataset was used for training both the regression and classification models.
Initially, we explored the variables that can serve as features for training the ML
models. Based on our intuition, many irrelevant variables were simply excluded
(e.g., admission time and discharge time). Table 10.9 lists the variables initially
considered as candidate features.
Data Preprocessing
This section describes the data preprocessing phase conducted prior to training
the ML models. The preprocessing procedures included: removing outliers, scaling
features, tackling data imbalances, and extracting features. Removing outliers and
scaling features were already conducted before building the clustering model, as
310 ◾ Analytics and Knowledge Management
Multi-rehabilitation assessment
elaborated in Sections 8.1 and 8.2. Therefore, this section only explains the proce-
dures performed for tackling imbalances and extracting features.
Feature Extraction
As mentioned in Section 8.3, the TTA and TTS represent important quality care-
related factors for the hip fracture care scheme. The ML models utilized the TTA and
TTS features, which were extracted earlier during the clustering model development.
2500
2000
Count
1500 2000
Frequency
1500
1000
1000 500
0
500
0
0 20 40 60
LOS
5000
4000
3000
Frequency
2000
1000
0
Home Nursing home
Discharge destination
Feature Selection
The dataset initially contained 38 features; however, not all of them were rele-
vant. Intuitively irrelevant features were excluded. In addition, the most important
features were decided based on the technique of permutation feature importance
(Altmann et al., 2010). Table 10.10 presents the set of features used by both models.
312 ◾ Analytics and Knowledge Management
where pt(c |v) denotes the posterior distribution obtained by the t-th tree.
Predictors Evaluation
The predictive models were tested using a subset from the dataset described in
Section 10.7. The randomly sampled test data represented approximately 40% of
the dataset. The prediction error of each model was estimated by applying 10-fold
cross-validation. Tables 10.12 and 10.13 present evaluation metrics of the regression
and classifier models respectively. Further, Figure 10.25 shows the Area Under the
Curve (AUC) of the classifier model.
Analytics Using Machine Learning-Guided Simulations ◾ 313
Full
dataset
Majority voting
Final prediction
Accuracy ≈80%
Precision ≈81%
Recall ≈79%
F1 score ≈80%
1.0
0.9
0.8
0.7
True positive rate
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False positive rate
1800
1600
1400
Count of elderly patients
1200
1000
800
600
400
200
0
2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027
Year
Patient cluster 1 Patient cluster 2 Patient cluster 3
0.06
Density
Frequency
0.06
0.04
0.04
0.02
0.00
0.02
0.00
0 10 20 30 40 50
(a) LOS
Over all LOS
0.08
0.09 0.06
0.06
Frequency
Frequency
Frequency
0.06 0.04
0.04
patients experienced an LOS in the range of 10–30 days. In this regard, the model
seemed to largely mimic reality, especially after excluding outliers. Moreover,
Figures 10.27b, c, and d show the LOS with respect to every cluster individually.
It turned out that cluster 1 and cluster 3 shared a similar distribution of the inpa-
tient LOS, which to tended to be relatively longer compared to cluster 2 patients.
316 ◾ Analytics and Knowledge Management
20000
15000
Frequency
10000
5000
0
Home Nursing home
LOS
The similar LOS distribution might be due to including the same age group (i.e.,
aged 80–100). This can translate into the importance of considering early interven-
tion schemes for the category of more elderly patients.
Furthermore, the simulation output anticipated a significant demand for long-
stay care facilities (e.g. nursing homes) over the simulated period. As shown in
Figure 10.28, the overall number of patients discharged to long-stay care is around
22,000 compared to only 7,000 of home-discharged patients. This pronounced
difference raises an inevitable need for planning the capacity of nursing homes or
similar residential care facilities.
◾ Structure-verification test: The model structure was checked against the actual
system. Specifically, it was verified that the model structure was a reasonable
representation of reality in terms of the underlying patient clusters, and associ-
ated elderly populations.
Analytics Using Machine Learning-Guided Simulations ◾ 317
◾ Extreme conditions test: The equations of the simulation model were tested
in extreme conditions. For example, the flows of elderly patients were set to
exceptional cases (e.g., no elderly population aged 60 or over).
◾ Parameter-verification test: The model parameters and their numerical values
were inspected if they largely corresponded to reality. Specifically, the prob-
ability distributions of patient attributes (e.g. age, sex, and fracture types)
were compared against those derived from the IHFD dataset.
Model Validation
According to Law (2008), the most definitive test of a simulation model’s validity
is comparing its outputs to the actual system. Similarly, we used the distribution of
discharge destinations as a measure of the approximation between the simulation
model and the actual healthcare system.
On one hand, Figure 10.29 provides a histogram-based comparison between
the actual system and the simulation model regarding the discharge destination.
The comparison showed that the distributions of the actual system and simula-
tion output were largely similar. However, the comparison revealed that the model
slightly underestimated and overestimated the proportion of patients discharged to
homes and long-stay care facilities respectively.
Future Directions
A further direction is to consider more sophisticated ML techniques to extract the
knowledge underlying further complex systems or scenarios. For example, it would be
interesting to investigate how simulation models can be integrated with deep learning
(DL). DL (LeCun et al., 2015) significantly helped approach hard ML problems such
0.6
0.6
0.4
Density
Density
0.4
0.2 0.2
0.0 0.0
Home Nursing home Home Nursing home
(a) Discharge destination (b) Discharge destination
Actual system Simulation model
Figure 10.29 Histograms of the discharge destination output from the actual
system and simulation model.
318 ◾ Analytics and Knowledge Management
…
Data captured at Data captured at
system state 1 system state n
Training dataset
Features Predicted
variable
Figure 10.30 Adapting to the new system states via predictions from a trained
DNN.
as speech recognition and visual object recognition for example. The capabilities of
DL allow computational models that are composed of multiple processing layers to
learn representations of data with multiple levels of abstraction. The multiple process-
ing layers can effectively represent linear and nonlinear transformations.
We conceive that simulation models can be assisted by predictions from deep
neural networks (DNN) trained to capture the system knowledge in a mostly auto-
mated manner. For instance, Figure 10.30 illustrates a simulation model where a
variable (i.e., predicted variable) is input to the model. That variable can be pre-
dicted using a DNN, which was trained to capture the new system state. In this
way, the DNN can be continuously trained in case of the arrival of new data that
echo new system states or conditions.
Study Limitations
A set of limitations should be acknowledged as follows:
◾ More big data-oriented scenarios can better present the potentials of integrat-
ing SM and ML.
◾ The clustering of patients was based on a mere data-driven perspective.
Adding a clinical viewpoint (e.g., diagnosis, procedures) may group patients
differently.
Analytics Using Machine Learning-Guided Simulations ◾ 319
Conclusions
The integration of simulation modeling and ML can help address further com-
plex questions and scenarios of analytics. We believe that the present work con-
tributes in this direction. The study can serve as a useful example of how ML
can assist the practice of modeling and simulation at different stages of model
development.
From a practical standpoint, it was also attempted to deliver useful insights in
relation to healthcare planning in Ireland, with a particular focus on hip fracture
care. The insights were provided based on a set of simulation models along with
ML predictions. At the population level, simulation models were used to mimic the
flow of patients, and the care journey, while ML provided accurate predictions of
care outcomes at the patient level.
References
Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance:
A corrected feature importance measure. Bioinformatics, 26(10), 1340–1347.
Barga, R., Fontama, V., Tok, W. H., and Cabrera-Cordon, L. (2015). Predictive Analytics
with Microsoft Azure Machine Learning. New York: Apress.
Beyer, M. A., and Laney, D. (2012). The Importance of “Big Data”: A Definition. Stamford,
CT: Gartner, pp. 2014–2018.
Borshchev, A. (2013). The Big Book of Simulation Modeling: Multimethod Modeling with
AnyLogic 6. Chicago, IL: AnyLogic North America.
Brailsford, S. C., and Hilton, N. A. (2001). A comparison of discrete event simulation
and system dynamics for modelling health care systems. In Proceedings of ORAHS,
Glasgow, Scotland, pp. 18–39.
Brasel, K. J., Lim, H. J., Nirula, R., and Weigelt, J. A. (2007). Length of stay: An appropri-
ate quality measure? Archives of Surgery, 142(5), 461–466.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
British Orthopaedic Association. (2007). The Care of Patients with Fragility Fracture.
London, UK: British Orthopaedic Association, pp. 8–11.
320 ◾ Analytics and Knowledge Management
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review
on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based
approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews), 42(4), 463–484.
Gandomi, A., and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and
analytics. International Journal of Information Management, 35(2), 137–144.
Harper, P. R., and Shahani, A. K. (2002). Modelling for the planning and management of
bed capacities in hospitals. Journal of the Operational Research Society, 53(1), 11–18.
HIPE. (2015). Retrieved from http://www.hpo.ie/hipe/hipe_data_dictionary/HIPE_
Data_Dictionary_2015_V7.0.pdf.
HSE. (2014a). Retrieved from http://www.hse.ie/eng/services/publications/olderpeople/
Executive_Summary_-_Strategy_to_Prevent_Falls_and_Fractures_in_Ireland%E2%
80%99s_Ageing_Population.pdf.
HSE. (2014b). Annual Report and Financial Statements 2014, Dublin, Ireland: Health
Service Executive (HSE).
Huang, Y. (2013). Automated simulation model generation. Doctoral dissertation, TU
Delft, Delft University of Technology.
Huang, Z. (1998). Extensions to the K-means algorithm for clustering large data sets with
categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
IBM. (2017). Retrieved January 15, 2017, from http://www.ibmbigdatahub.com/
infographic/four-vs-big-data.
Intel. (2013). Peer Research report: Big data analytics. Retrieved February 10, 2017, from
http://www.intel.com/content/www/us/en/big.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,
31(8), 651–666.
Japkowicz, N., and Stephen, S. (2002). The class imbalance problem: A systematic study.
Intelligent Data Analysis, 6(5), 429–449.
Johansen, A., Wakeman, R., Boulton, C., Plant, F., Roberts, J., and Williams, A. (2013).
National Hip Fracture Database: National Report 2013. London, UK: Royal College
of Physicians.
Kuipers, B. (1986). Qualitative simulation. Artificial Intelligence, 29(3), 289–338.
Lane, D. C. (2000). You Just Don’t Understand Me: Modes of Failure and Success in the
Discourse Between System Dynamics and Discrete Event Simulation. London, UK: LSE
OR Department Working Paper LSEOR 00–34, London School of Economics and
Political Science.
Laney, D. (2001). 3D Data Management: Controlling Data Volume, Velocity and Variety.
Stamford, CT: META Group Research Note, 6, 70.
Lattner, A. D., Bogon, T., Lorion, Y., and Timm, I. J. (2010). A knowledge-based approach to
automated simulation model adaptation. In Proceedings of the 2010 Spring Simulation
Multiconference. Orlando, FL: Society for Computer Simulation International, p. 153.
Laursen, G. H., and Thorlund, J. (2016). Business Analytics for Managers: Taking Business
Intelligence beyond Reporting. Hoboken, NJ: John Wiley & Sons.
Law, A. M. (2008). How to build valid and credible simulation models. In Proceedings of the 40th
Conference on Winter Simulation. Miami, FL: Winter Simulation Conference, pp. 39–47.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553),
436–444.
Liberatore, M., and Luo, W. (2011). INFORMS and the analytics movement: The view of
the membership. Interfaces, 41(6), 578–589.
322 ◾ Analytics and Knowledge Management
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. H.
(2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity.
Washington, DC: McKinsey Global Institute.
Maoz, M. (2013). How IT should deepen big data analysis to support customer-centricity.
Gartner G00248980.
Marshall, A. H., and McClean, S. I. (2003). Conditional phase-type distributions for
modelling patient length of stay in hospital. International Transactions in Operational
Research, 10(6), 565–576.
Martis, M. S. (2006). Validation of simulation based models: A theoretical outlook.
The Electronic Journal of Business Research Methods, 4(1), 39–46.
Mashey, J. R. (1997). Big data and the next wave of infraS-tress. In Computer Science
Division Seminar. Berkeley, CA: University of California.
Meadows, D. H. (2008). Thinking in Systems: A Primer. White River Junction, VT: Chelsea
Green Publishing.
Melton, L. J. (1996). Epidemiology of hip fractures: Implications of the exponential increase
with age. Bone, 18(3), S121–S125.
Microsoft. (2013). The big bang: How the big data wxplosion is changing the world—
Microsoft UK enterprise insights blog-site home-MSDN blogs. Retrieved February 2,
2017, from http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04/15/
big-bang-how-the-big-data-explosion-is-changing-theworld.aspx.
Minelli, M., Chambers, M., and Dhiraj, A. (2012). Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today’s Businesses. New York: John Wiley & Sons.
Mortenson, M. J., Doherty, N. F., and Robinson, S. (2015). Operational research from
taylorism to terabytes: A research agenda for the analytics age. European Journal of
Operational Research, 241(3), 583–595.
Newell, A., and Simon, H. A. (1959). The Simulation of Human Thought. Current Trends in
Psychological Theory. Pittsburgh, PA: University of Pittsburgh Press.
NIST. (2015). S. 1500–1 NIST Big Data Interoperability Framework (NBDIF): Volume 1:
Definitions. Gaithersburg, MD.
NOCA. (2014). Irish hip fracture database national report 2014. National Office of Clinical
Audit (NOCA), Dublin, Ireland.
NOCA. (2017). Irish hip fracture database. Retrieved from https://www.noca.ie/
irish-hip-fracture-database.
O’Keefe, G. E., Jurkovich, G. J., and Maier, R. V. (1999). Defining excess resource utiliza-
tion and identifying associated factors for trauma victims. Journal of Trauma and
Acute Care Surgery, 46(3), 473–478.
Oberkampf, W. L., DeLand, S. M., Rutherford, B. M., Diegert, K. V., and Alvin, K. F.
(2002). Error and uncertainty in modeling and simulation. Reliability Engineering &
System Safety, 75(3), 333–357.
Oxford Dictionary. (2017). Retrieved from https://en.oxforddictionaries.com/definition/
big_data.
Patel, V. R., and Mehta, R. G. (2011). Impact of outlier removal and normalization approach
in modified K-Means clustering algorithm. IJCSI International Journal of Computer
Science Issues, 8(5), 331–336.
Powell, J., and Mustafee, N. (2014). Soft OR Approaches in problem formulation stage of
a hybrid M&S study. In Proceedings of the 2014 Winter Simulation Conference. IEEE
Press, pp. 1664–1675.
Analytics Using Machine Learning-Guided Simulations ◾ 323
Rabelo, L., Cruz, L., Bhide, S., Joledo, O., Pastrana, J., and Xanthopoulos, P. (2014).
Analysis of the expansion of the panama canal using simulation modeling and artifi-
cial intelligence. In Proceedings of the 2014 Winter Simulation Conference. IEEE Press,
pp. 910–921.
Rashwan, W., Ragab, M., Abo-Hamad, W., and Arisha, A. (2013). Evaluating policy inter-
ventions for delayed discharge: A system dynamics approach. In Proceedings of the
2013 Winter Simulation Conference. IEEE Press, pp. 2463–2474.
Rindler, A., McLowry, S., and Hillard, R. (2013). Big Data Definition. MIKE2. 0, The
Open Source Methodology for Information Development. Retrieved from http://mike2.
openmethodology.org/wiki/Big_Data_Definition.
Rosenberg, M., and Everitt, J. (2001). Planning for aging populations: Inside or outside the
walls. Progress in Planning, 56(3), 119–168.
Shannon, R. E. (1975). Systems Simulation: The Art and Science. Englewood Cliffs, NJ:
Prentice-Hall.
Shreckengost, R. C. (1985). Dynamic simulation models: How valid are they? In Beatrice,
A. R., Nicholas, J. K., and Louise, G. R. (Eds.), Self-Report Methods of Estimating
Drug Use: Current Challenges to Validity. Rockville, MA: National Institute on Drug
Abuse Research Monograph, 57, pp. 63–70.
Soetaert, K. E. R., Petzoldt, T., and Setzer, R. W. (2010). Solving differential equations in
R: Package deSolve. Journal of Statistical Software, 33.
Stanislaw, H. (1986). Tests of computer simulation validity: What do they measure?
Simulation & Games, 17(2), 173–191.
Sterman, J. D. (1994). Learning in and about Complex Systems. System Dynamics Review,
10(2–3), 291–330.
Sun, Y., Wong, A. K., and Kamel, M. S. (2009). Classification of imbalanced data: A review.
International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719.
Sweetser, A. (1999). A comparison of system dynamics (SD) and discrete event simulation
(DES). In Proceedings of the 17th International Conference of the System Dynamics Society
Wellington, New Zealand., pp. 20–23.
Taylor, S. J., Khan, A., Morse, K. L., Tolk, A., Yilmaz, L., and Zander, J. (2013). Grand
challenges on the theory of modeling and simulation. In Proceedings of the Symposium
on Theory of Modeling & Simulation-DEVS Integrative M&S Symposium. San Diego,
CA: International Society for Computer Simulation.
Tolk, A. (2015). The next generation of modeling & simulation: Integrating big data and
deep learning. In Proceedings of the Conference on Summer Computer Simulation.
Chicago, IL: International Society for Computer Simulation.
Tolk, A., Balci, O., Combs, C. D., Fujimoto, R., Macal, C. M., Nelson, B. L., and Zimmerman,
P. (2015). Do we need a national research agenda for modeling and simulation?. In
Proceedings of the 2015 Winter Simulation Conference. IEEE Press, pp. 2571–2585.
UK Government. (2014). Retrieved January 4, 2017, from https://www.gov.uk/govern-
ment/uploads/system/uploads/attachment_data/file/389095/Horizon_Scanning_-_
Emerging_Technologies_Big_Data_report_1.pdf.
Visalakshi, N. K., and Thangavel, K. (2009). Impact of normalization in distributed
K-means clustering. International Journal of Soft Computing, 4(4), 168–172.
Ward, J. S., and Barker, A. (2013). Undefined by data: A survey of big data definitions.
arXiv preprint arXiv:1309.5821.
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. New York: Springer
Science & Business Media.
324 ◾ Analytics and Knowledge Management
Wu, C., Buyya, R., and Ramamohanarao, K. (2016). Big data analytics= machine learn-
ing+ cloud computing. arXiv preprint arXiv:1601.03115.
Yang, Q., and Wu, X. (2006). 10 Challenging problems in data mining research.
International Journal of Information Technology & Decision Making, 5(4), 597–604.
Zhong, J., Cai, W., Luo, L., and Zhao, M. (2016). Learning behavior patterns from video
for agent-based crowd modeling and simulation. Autonomous Agents and Multi-Agent
Systems, 30(5), 990–1019.
Zikopoulos, P. C. (2013). Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David
Corrigan, and James Giles. Harness the Power of Big Data. New York: The IBM Big
Data Platform.
Chapter 11
Intangible Dynamics:
Knowledge Assets in the
Context of Big Data and
Business Intelligence
G. Scott Erickson and Helen N. Rothberg
Contents
Introduction ......................................................................................................326
A Wider View of Intangibles .............................................................................326
Big Data and Business Analytics/Intelligence .....................................................328
Reimagining the Intangibles Hierarchy .............................................................329
Assessment of the Intelligence Hierarchy in Organizations ................................333
Measuring Intangible Asset Scenarios ................................................................336
Intangible Assets and Metrics: Illustrative Applications..................................... 340
Healthcare ............................................................................................... 340
Financial Services..................................................................................... 344
Automated Driving ..................................................................................347
Conclusions ...................................................................................................... 351
References .........................................................................................................352
325
326 ◾ Analytics and Knowledge Management
Introduction
The explosive growth of interest in Big Data and analytics has caught the attention
of knowledge management (KM) researchers and practitioners. While some over-
laps are clear between the fields, previous interest in data or information from the
KM field was rather limited. Over the development of the KM discipline, highly
valued knowledge was explicitly differentiated from purportedly uninteresting
data. That perspective is changing.
This chapter will take an updated look at the nature of intangible assets, not
just knowledge but related intangibles including data, information, and wisdom or
intelligence. In clarifying the similarities and differences, we believe both practitio-
ners and scholars can more effectively understand these potentially valuable assets,
providing a greater opportunity to exploit them for competitive advantage. A struc-
tured understanding of all intangible assets, and their inter-relationships, can lead
to better applications and results.
After a review of existing theory concerning types of intangibles and approaches
for exploiting or managing them, we will turn to some data-driven examples of
how a fuller understanding of the combined disciplines can help with strategic
decision-making. In particular, we will consider how a firm’s intangibles compe-
tencies translate into success in its industry sector as well as other directions for
potential growth.
only really, truly unique resource to be found given the ubiquity of basic labor, capi-
tal, information technology (IT), and so on, in today’s world. The only unique thing
most firms having going for them is to be found in the heads of their key employees.
Consequently, the fields of KM and IC are fairly established in terms of defin-
ing the nature of knowledge assets, how they vary, and what to do with them. But
over the past few years, the rapid growth of interest in big data applications has led
to questions of how such matters relate to knowledge, particularly KM that has
been dismissive of the value of “less-developed” data and information. Is there a
means to reconcile the disciplines and include the full range of intangible assets in
our discussions?
Kurtz and Snowden divide the world into sectors based on the dimensions of
centralization and hierarchy versus connections and networks (“meshwork” in one
extension). In this conceptualization, do intangibles flow into the center (poten-
tially redistributable), across individuals without going to the center, both, or nei-
ther? The different scenarios both help to explain how intangibles might be best
understood and used most effectively. The basic ideas, including their potential
contributions and applications, can be explained as follows:
tactical, or innovative insights are more common. This environment not only
fits C-suite executives but also the big data analysts discovering unexpected
insights buried in the data lakes available to data-heavy organizations. Much
like R&D innovators or charismatic leaders, what they do may be difficult
to understand, let alone teach someone else to do. The results of their efforts
might be perfectly understandable but how they get to that is opaque and
potentially unlearnable. This domain characterizes intelligence, the ability to
draw new insights by analyzing a variety of intangible assets.
This domain comes full circle to a non-bureaucratic approach to learn-
ing and creating intangibles of value to the organization. Kurtz now sug-
gests intricate patterns (still liable to unexpectedly change) based on simple
operations. She uses the example of a government project looking to identify
weak signals in public information and news reports to predict a future envi-
ronment, specifically noting how hard it was to teach participants to think
outside their normal frames of reference. In chaos, it is the unexpected insight
or spark of creativity that is critical, the different perspective, and while that
solution might be communicable or teachable, the process to get to it might
not be. Verdun uses the term improvisation and does not really recommend
any structure, suggesting any and all connections may be useful. Kurtz would
note the lack of both central connections and constituent connections. As
just noted, this is where Snowden’s emphasis on stories instead of rational lin-
guistic explanation make sense as the learning, if possible, might take place
subconsciously rather than consciously. And, once again, Brown and Boudes’
emphasis on diverse perspectives, outside-the-network individual thinking,
and non-conformity would resonate for organizations trying to operate and
create or manage intangibles in such environments.
Following the logic, a new hierarchy evolves (Rothberg and Erickson, 2017; Simard,
2014). This time, the flow starts with data and information, moves through explicit
knowledge, tacit knowledge, and on to insight and intelligence. Per Kurtz’ (2009)
own commentary on the framework, this hierarchy does not imply differences in
value, the higher levels of the hierarchy are not necessarily better or worse than
lower levels. But there are differences in environments, and the hierarchy helps
with understanding how to manage under the different circumstances. Given the
new circumstances engendered by Big Data and business analytics and intelligence,
included now at opposite ends of this intangible asset hierarchy, we believe this
structure helps to both explain the nature of the different environments with differ-
ent intangibles present while also suggesting guidelines for handling them.
If the system is only designed to manage Big Data (“known” scenario), IT han-
dling the exchange of data and information is enough. As noted, this may include
dashboards to track metrics of high interest or established algorithms to react to
results not at expected levels. But other than the way they are organized, these
intangibles do not really require further analysis or more learning to have an impact.
Intangible Dynamics ◾ 333
◾ High levels of Big Data, knowledge, and intelligence suggest that all of the
intangibles are present and important. Industries with these sorts of metrics
include pharmaceuticals, software, and similar environments that not only
require substantial human capital but also efficient and high-quality pro-
cesses (structural capital) and effective, data- and relationship-driven mar-
keting (relational capital). Pharmaceuticals, for example, employs Big Data
in its operations, distribution, and marketing and sales; explicit knowledge in
its operational processes and sales; but more tacit knowledge and intelligence
in its labs and in analytics explorations to improve more strategic and tactical
approaches throughout the firm (R&D, production, distribution channels,
marketing, and consumer experience).
◾ High levels of Big Data and evidence of intelligence but low knowledge
metrics suggest that Big Data and analytics and insight (perhaps some tacit
knowledge bleeding in as well) are important, but explicit knowledge is not.
Big Data is available and value can come from conducting deeper analysis (data
mining, predictive analytics) on it. But the ability to discover such insights is
rare and difficult to share with others. Operations, transactions, and market-
ing are routine (structural capital and relational capital) and human capital is
found in the rare analyst or team who can find something new in the data.
Financial services are a typical industry, awash in data and aggressive competi-
tive intelligence but very low knowledge metrics compared to other industries.
◾ Moderate to high levels of big data and high knowledge metrics but little
intelligence activity suggest that data and knowledge are involved in scale and
efficiency, making processes of all kinds run better but that there is little in
the way of new insights or innovation. Particularly notable is that we believe
these types of industries have considerable explicit knowledge but probably
less tacit. What is known can be and is widely shared. Human capital (supply
chain, operations, and transactions), structural capital (in the same areas),
and relational capital (customer relationships) are all important but gener-
ally operational rather than strategic. Characteristic industries are consumer
goods and retail. Big brands (high relational capital) and operational size and
efficiency (human and structural capital) are key competitive elements.
◾ Moderate to low levels of big data, low knowledge, and low intelligence
(very little of anything of value). What data and knowledge (explicit) there
is should be supported, but operational, transactional, and marketing activi-
ties are well-understood and have little to be discovered that might improve
them. Much of the IC is probably structural or relational, investing in high
human capital employees never hurts but does not necessarily generate a great
return either, so cost and benefit should be considered. These are usually
mature manufacturing industries or regulated industries like utilities.
Again, the main idea in this chapter is not to recreate the existing logic and evi-
dence behind these distinctions but to introduce them as evidence of the wide range
Intangible Dynamics ◾ 335
of circumstances facing decision makers thinking about how to manage all these
intangibles. Intangible dynamics are situational and call for different approaches.
Circumstances may even call for different metrics and some of the all-encompassing
measurement systems discussed in the literature may need to be broken back down
to lower levels to really help. But, as a first pass, we already have some established
metrics that allow an evaluation of different industries and individual firms’ com-
petitiveness within them.
Big Data can be measured directly by looking at relative data holdings in dif-
ferent industries. One often-cited report was done by McKinsey Global Services
(Manyika et al., 2011). The industry categories are fairly broad but the data provide
a good idea of where Big Data is prevalent (financial services, communications and
media, discrete and process manufacturing, utilities) and where it is not (construc-
tion, consumer and recreation services, professional services, healthcare providers).
Moreover, the study adds some details about the nature of the data and how it
might be used in these industries.
KM metrics, of course, remains an area of considerable discourse and lack of
consensus (Firer and Williams, 2003; Tan et al., 2007). Sveiby (2010) noted over
40 different approaches that have been applied by researchers or practitioners, spe-
cifically categorizing them by level of measurement (entire organization or indi-
viduals) and method (financial statements to survey instruments). In this chapter,
as we try to evaluate multiple firms in multiple industries; the low-level, individual
employee approaches do not really work. What has been used and what works
in cross-industry comparisons are objective financial results, even if there can be
issues with trying to apply them too precisely. Specifically, a modified Tobin’s q can
be used (market capitalization to book value or market capitalization to assets) to
make broad distinctions between industries and even individual firms (Erickson
and Rothberg, 2012; Tobin and Brainard, 1977). As noted above, however, we’ve
come to the conclusion that a high Tobin’s q, all by itself, probably only identifies
explicit knowledge. Tacit knowledge does not scale enough to show up reliably in
such metrics though it may also be present. But we can employ another metric that
might flag it when used in combination with Tobin’s q.
Intelligence, as defined in this chapter, signifies organizations seeking unique,
creative insights from analytical techniques. Again, this may be highly individual
and impossible to teach. Or it may be from a form of intelligent learning organiza-
tion established to funnel inputs and provide room for learning and interchange
between dedicated analysts with different perspectives. This capability can be
particularly hard to identify, but we’ve found that organizations with an intelli-
gence capacity in one area (e.g., R&D laboratories, competitive intelligence) know
something about how to do it and so can extend the capability to other areas.
So, again, an established metric exists of identifying competitive intelligence (CI)
operations in organizations as a proxy for the ability to establish an analytics
capability, drawing new insights or creativity from intangibles inputs (Erickson
and Rothberg, 2012).
336 ◾ Analytics and Knowledge Management
This particular dataset is based on two pieces of data used to construct an index.
The number of participants from firms within an industry sectors is one input. An
industry sector with eighteen survey respondents is a quite different environment
than one with one or none. Further, each survey asks respondents to report on the
professionalism of their CI operation, from just starting to highly proficient. Again,
this indicator shows considerable differences between inexperienced, perhaps one-
person initiatives versus large, seasoned teams, even if only one respondent was
included from each firm. Combined, the index presented gives a sense of both
quantity and quality of CI efforts in a given industry sector. As discussed earlier, if
a firm is capable of intelligence activity in one area (competitive intelligence), it has
the ability to practice it in others (marketing intelligence and analytics, business
intelligence and analytics).
Interpretation of the indicator, as with others, depends on circumstances and
what other intangibles seem to be present. Just the intelligence metric, with no
other intangibles (very rarely seen) or with Big Data (common in identifiable sec-
tors), indicates an insight or analytic capability. Firms in these sectors are able to
create new insights, particularly when analyzing big data inputs. Such capabilities
are often unique to a particularly gifted individual or team within the firm and,
unlike knowledge assets, can be extremely difficult if not impossible to teach to
others. The results of the insight are replicable but the process to get there is associ-
ated only with the individual or team generating it. Such circumstances are seen
when firms have gifted leaders or creative talent, the situations when “stars” are
uniquely valuable and need to be managed as a rare asset in and of themselves.
Alternatively, when the intelligence indicator is present with a high Tobin’s q
or knowledge metric, tacit knowledge is also present. Again, unique insight comes
from individuals, as indicated by the intelligence result. But at least some of these
learnings occur more often and are more scalable—they occur in circumstances
that make them scalable. That doesn’t mean intelligence itself isn’t available as well,
just that tacit and explicit knowledge are apparent, too. Pharmaceutical companies,
for example, typically have all the intangibles: Big Data, both types of knowledge,
and intelligence. They have huge amounts of data from operations, their distribution
channels, and their customers (providers, retailers, insurers, and end consumers).
They also have explicit and tacit knowledge on how to make their processes better,
their marketing relationships better, and their labs run better. But they still need
individual insight as well, in the creativity and brilliance of their researchers as well
as for occasional strategic concepts or new marketing directions.
In the end, we can identify the different circumstances where different intan-
gibles are more or less prominent in an industry sector. We can spot industries
using only Big Data, transferring and monitoring it to support operational and
marketing efficiency. We can spot industries where explicit knowledge is critical,
where employees learn to make improvements in logistics, operations, marketing,
and any number of other functions of the firm. These learnings can then be scaled
up for greater impact through appropriate KM systems. We can spot industries
340 ◾ Analytics and Knowledge Management
where tacit knowledge and intelligence are also present, either by themselves (more
intelligence-oriented) or with explicit knowledge (indicating both tacit knowledge
and intelligence).
As a consequence, decision makers gain a better appreciation of what intangi-
bles contribute to competitive advantage in such sectors and how they might stack
up against rival firms. The indicators also provide guidance as to where to focus
attention in order to take a deeper look. So explicit knowledge might be indicated,
for example, but what is the nature of it? Is it in manufacturing goods or service
delivery? Is it in distribution efficiency? Is it in close customer relationships or deep
customer knowledge? By knowing where to look and having a rough idea what to
look for, strategists stand a better chance of truly understanding what intangibles
are necessary and how they need to be managed.
Healthcare
Previous work drew on the metrics of the entire healthcare sector. As just noted,
a report from Bain (Eliades et al., 2012) already defined sectors that might be
included in a healthcare profit pool. And an earlier study (Erickson and Rothberg,
2013a) already gathered and presented some of the intangibles metrics we have
been discussing. These are shown in Table 11.2, along with updated knowledge
data from a new database.
Intangible Dynamics ◾ 341
One of the interesting things about healthcare is representation from just about
all the different combinations of intangible assets and the domains they represent,
starkly showing how this methodology is capable of distinguishing differences
between the sectors. Further, because the sectors are so different in what they do,
the activities behind the metrics and the competitive considerations they pose are
fairly clear, even at the very general levels we’ll be discussing in this chapter. Finally,
insurance has such unique readings compared to other sectors that its place in the
industry framework is also a straightforward matter to discuss.
As noted in the table, several industry sectors have relatively high levels of all
the intangibles. These include pharmaceuticals, instruments, and diagnostics. They
have highly complex, even chaotic environments demanding accumulation of big
data on research processes, including clinical trials, on highly scrutinized opera-
tional processes, and on marketing relationships (both complicated distribution
chains, third-party payors, and end consumers). Their processes and marketing
need to be highly efficient and often meet rigorous quality standards explaining the
presence of substantial explicit knowledge as well as some tacit knowledge related
to more dramatic improvements. And they still require a steady stream of creative
insights, especially in terms of new products but also including new strategic direc-
tions concerning targeted segments (organizational and consumer), distribution,
marketing, and responses to competitor initiatives. Consequently, ample evidence
exists of tacit knowledge and intelligence.
Alternatively, some sectors show very little of anything at all. Hospitals and
other providers have low levels of big data holdings (though these will likely rise as
the mandate to digitize records takes hold) and there is little evidence of their doing
anything with them. Tobin’s q levels are well under average, so not much is going
on with knowledge development, particularly explicit, and intelligence activity
appears to be virtually nonexistent as well. As pressure builds for U.S. providers to
become more efficient, the data may feed into improvements in explicit knowledge
levels and efficiency, but that is also hard to do when operations are not necessarily
repeatable or systematized (each patient may have their own unique situation and
personal attention is hard to develop knowledge around). Tacit knowledge is likely
present with so many highly skilled workers but, again, it may be hard to share or
to exploit. At least to any degree that would show up in the metrics.
Some sectors show considerable explicit knowledge (and Big Data) but no real
evidence of intelligence activity. As discussed, that likely also means limited tacit
knowledge in spite of the high knowledge metric. The prominent example of this
is the retailers. Along with wholesalers, they generate considerable amounts of data
from their supply chains, their in-store transactions, and their marketing relation-
ships, especially those through loyalty programs and close connections to consum-
ers. Further, those that have added pharmacy benefits management (PBM) to their
activities are developing considerable databases that can draw consumers even
closer. Much of their attention is on continuous improvements in efficiency, run-
ning ever tighter supply chains and retail operations and ever better relationships
Intangible Dynamics ◾ 343
with consumers, physicians, and insurers paying the bills (explicit knowledge).
While everyone likes a new, insightful idea, their businesses aren’t predicated on
creativity or the types of big steps forward promised by intelligence or even tacit
knowledge.
And, finally, we come to the insurers. Their metrics are indicative of sectors
with a lot of data, evidence of considerable intelligence work, but very low knowl-
edge metrics. This is one of those applications where it is useful to apply both ver-
sions of Tobin’s q as the financial sector has considerable tangible assets, far beyond
what might be seen in other sectors, and so might show artificial differences in the
metric not reflected in actual practice. That would be true of the cap/asset metric
but the cap/book also includes the liabilities associated with the massive tangible
assets, much of which is borrowed in one form or another. So the fact that both
metrics show very low KM activity or success suggests that the sector really is low
when compared to other sectors.
The basic scenario with insurers, and it’s something we’ve seen in just about all
financial services firms, includes massive amounts of data on transactions, individ-
ual client conditions and activities, and marketing relationships. Explicit knowledge
is hard to develop as processes are well-understood and hard to improve while mar-
keting relationships are also well-developed but brand equity is troublesome (very
few consumers are enthusiastic about their insurance companies). Small improve-
ments in process or efficiency aren’t important to the sector. What is sought are big
new ideas, new insights for targeting consumer groups, understanding risk profiles
of different groups, identifying irregularities in the data (fraud, changes in preva-
lent symptoms, changes in prescribing patterns, etc.). Intelligence is pronounced,
including competitive intelligence—when new knowledge or intelligence is so rare,
competitors are going to be interested in uncovering it as quickly as possible.
The core question we’re going to be asking in this section is what opportunities
for growth might be open to insurers, in this case health insurers. The nature of
in-house intangibles indicates considerable holdings of big data of a very specific
nature. There are also indications of experience and competency in managing Big
Data as well as in analyzing it (the analytics and intelligence ratings). The insurance
firms show no particular holdings of or competency in knowledge assets, either
explicit or tacit. Consequently, there would likely be limited potential in venturing
into sectors requiring explicit knowledge or the related efficiencies regarding opera-
tions, supply chains, or external relationships (supply chains, providers, retailers, or
consumers). In this case, insurers would not have good prospects in taking on such
sectors themselves, whether in entry into new sectors, acquisition, or other partner-
ships based on such success requirements.
What healthcare insurers may have to offer are the skills developed in not only
managing Big Data but also in analyzing it for deeper insights (intelligence). This
could be of considerable value to many of the players in sectors with which the insur-
ers interact, including the manufacturers (pharmaceuticals, medical devices) and
the retailers/PBM. Although some data would be duplicative, the insurers have a
344 ◾ Analytics and Knowledge Management
different focus on their data and on how they analyze it, potentially adding value as
a collaborator. Further, the insurers may be of particular help in increasing the capa-
bilities of partners in sectors not so good at managing the intangibles, particularly
Big Data. Although there are obvious relationship issues to overcome with hospitals
and with other providers (as insurers are the payors in most cases), sharing data,
data analysis capabilities and techniques, and acquired insights could benefit all
and seems likely to be one of the strategic avenues with the greatest potential payoff.
Financial Services
The financial services profit pool looks much different from that in healthcare.
There is no dedicated supply chain or manufacturing, nor is there much in the
way of distribution or marketing channels not taken of by the providers them-
selves. Consequently, the profit pool described by Bain consultants (Schwedel and
Rodrigues, 2009) is much more horizontal, encompassing banking, investment
houses, and insurers but little else. This description still includes a number of dif-
ferent industry sectors, as illustrated in Table 11.3 but not much of an industry
value chain. A few distinctions could be made (e.g., insurance providers versus
insurance brokers) but the effective point is made of how different this profit pool
looks compared to the healthcare industry and how that impacts strategic choices
facing participants, especially once one looks more deeply at the intangible assets.
As noted, the results show a very different pattern from what we saw in health-
care industry sectors. The two highlighted sectors are outliers but also somewhat
different in what services they provide. The rest are all mainstream financial ser-
vices, either providing retail banking (commercial banks and savings institutions),
investment services (security brokers and Real Estate Investment Trusts (REITs)
REITs), or insurance (life, health, property). All of these involve very large data-
bases. These databases include customer knowledge such as personal characteristics,
transaction records, and relationship tracking (including all communications). All
also track data on economic indicators, market financial results, and other world,
country, or industry-wide information. In the McKinsey report, financial services
take up most of the top spots in big data holdings and by a wide margin.
The knowledge data are a different story. Again ignoring the shaded rows for now,
the rest of the sectors are uniformly well under the average cap/book and cap/assets
ratios seen across the databases. This is true when looking at both metrics. Here, cap/
assets may be artificially low just because of the huge level of tangible financial assets
held by these types of institutions—inflating the denominator of the ratio. But the
same pattern is also seen in the cap/book ratio which would level out the bias by tak-
ing into account any financial assets borrowed or employed on behalf of a customer
who actually owns them. If there is some artificial deflation of the results, it is mini-
mal and certainly doesn’t change the overall conclusions. In financial services sectors,
there is very little evidence of knowledge, particularly explicit knowledge.
Intangible Dynamics ◾ 345
There is, however, considerable evidence of intelligence. Given the low knowl-
edge metric, one would conclude that much of this rating is attributable to insight
or analytical abilities present in these sectors but not tacit knowledge. Assuredly,
new tacit knowledge or intelligence insights are rare, valuable, and come by means
of a process difficult or impossible to teach others. Organizations may set up struc-
tures to feed intangible inputs, especially data, to analysts and encourage insight
and creativity, but success depends on hiring and retaining the right people more
than on how they are taught their responsibilities.
This squares with what we know about these industry sectors. As noted, they
manage huge amounts of tangible financial assets. Doing so requires little knowl-
edge about efficiency of supply chains, operations, or customer relationships
(indeed, consumers often actively dislike their banks and insurance companies,
so brand equity is minimal). They do know how to execute transactions. They do
know how to report to customers, regulators, and shareholders. But much of that is
established and well-known, there is little new to discover in the nature of explicit
(or tacit) knowledge improving activities or relationships.
What can be new are creative insights, crafting new investment strategies or
new insurance risk profiles and products. When new knowledge and insights are
so rare, they take on additional value, so the rewards for successful analytics are
considerable. This is seen in the level of intelligence metric. A new, unique strategy
is highly valued so other firms are extremely interested in uncovering competitor
discoveries as quickly as possible. The competitive intelligence levels are quite high,
reflecting this reality as well as the considerable weight all of these financial services
sectors place on new solutions discovered through an effective analytics or intel-
ligence process.
The exceptions validate the rule. The shaded rows have both higher knowledge
and lower intelligence metrics. Both of these sectors, investment advisor and insur-
ance brokers, depend more on individuals with knowledge that can be applied to
variable conditions (advising consumers according to their specific situations and
needs). For insurance brokers, this is fairly obvious as agents actually selling the
insurance tend to be smaller concerns with close, personal client relationships and
a need to match the client with the right insurance. They learn what works for
what clients in what circumstances may have access to Big Data but may not know
what to do with it, and rely more on their own assessment and conclusions than on
repeatable patterns of action.
The investment advisors are more complex and, in some ways, more interest-
ing. The firms making up this sector are quite a mixed bag. They range from
high-powered investment firms (Blackstone, Carlyle, Invesco) to retail brokerages
(Charles Schwab, T. Rowe Price). Generally, the former group has low knowledge
metrics, similar to those seen with security brokers. The emphasis is again on Big
Data and unique insights more than repeatable explicit or tacit knowledge. The
latter group has the higher knowledge metrics as they do provide trade executions
to retail buyers, a repeatable pattern that can yield knowledge based on segmented
Intangible Dynamics ◾ 347
groups of customers. These firms also depend on close relationships with clients,
something that we know requires good levels of customer knowledge and is again
usually apparent in high explicit knowledge results.
To return to the overall point of this section, what do these results tell us about
the strategic potential of insurers in this industry? Unlike the healthcare industries,
in which insurers had little in common with any of the other industry sectors, the
results are quite similar here across the categories. One could see substantial move-
ment across sectors here as the different players have similar intangibles competen-
cies. All have some substantial level of Big Data and apparently know what to do
with it. All also have some substantial level of analytical or intelligence ability and,
again, apparently know what to do with it. While partnerships to share data and
insights are possible, as in healthcare, vertical movements across sectors in a more
aggressive manner are also possible.
One indicator of this potential comes from recent news reports about insur-
ance firms moving into pension management. Based on the intangibles, this makes
a great deal of sense. As repeatedly noted, Big Data is present and important
throughout all these sectors, so insurers’ capabilities in managing large databases
concerning client descriptions, activities, and more macro trends would be applica-
ble across sectors. Perhaps even more importantly, insurers deal routinely with both
individual retirement plans and actuarial data. Their experience with such data and
their ability to find new insights in it after conducting deep analysis would be well-
suited to pension management. Most also have experience with finding suitable
investment opportunities for held funds as well as efficiency in client relationships
(even if not overly warm and fuzzy). The intangibles results show a strong fit for
this sort of move.
Automated Driving
Although a profit pool study for this burgeoning industry hasn’t yet been compiled,
work has been done on active and announced participants so we do have some
sense of who might be involved. Further, Navigant Research (Abuelsamid et al.,
2017) has released a report assessing the prospects of known aspirants according to
a checklist of success criteria. Other observers might adjust the criteria or provide
different ratings for individual firms, but the structure is there for us to consider the
field with respect to intangibles.
The Navigant report, based on public announcements of the represented firms,
includes the industry sectors listed in Table 11.4. To these, we have added both
Apple (no announcement but substantial evidence of intentions) and Intel, which
recently announced its own interest and a partnership with Mobileye. Apple’s finan-
cial filings listed it as Computers (SIC 357) during the earlier reporting period but
switched to Phones (SIC 3663) during the latter. The company, of course, competes
in both sectors though its emphasis and revenue and profit streams have changed.
Both sectors are listed here.
348 ◾ Analytics and Knowledge Management
For the record, the Navigant “leaders,” those best placed to compete in the
new area, are mainly traditional auto manufacturers (Ford, GM, Toyota, BMW,
Tesla, etc.). The reasons behind their conclusions are probably best seen in the
criteria used for assessment, including both strategy variables (vision, go-to-
market strategy, partners, production strategy, and technology) and execution
variables (sales, marketing, and distribution; product capability; product quality
and reliability; product portfolio; and staying power). The details don’t match
up exactly, but one could easily associate much of the strategy component with
Intangible Dynamics ◾ 349
tacit knowledge and intelligence. Alternatively, execution will often have more
to do with explicit (and some tacit) knowledge.
What does the data tell us? The automakers have relatively low knowledge lev-
els across the board, except for the very high 2010–2014 cap/book ratio. In looking
into the data, this result is chiefly due to the presence of Tesla and is not mirrored
in the cap/assets ratio. This is a case where using both metrics is useful. Tesla is
highly leveraged, so the book value in the first calculation is artificially low. Cap/
asset is probably the better and seemingly more consistent metric to pay attention
to in this case though Tesla’s considerable store of intangible assets should also be
kept in mind. The auto industry has improved markedly from the earlier to the
later time period in the knowledge metrics, recall that the earlier period included
the 2008 financial downturn. The intelligence variable is average (anything in
double digits is getting into more aggressive intelligence activity). Big Data is
present, in substantial quantities but at a level that is also about average across
all industries. All in all, the auto manufacturers have some capability in explicit
knowledge (improving but still below average), evidence of some tacit knowledge
and intelligence, and some Big Data. But nothing is outstanding, especially rela-
tive to the other industries. The manufacturers should be competent at supply
chains, manufacturing, distribution, and customer relationships. They should also
have some abilities in more creative insights such as new product development.
But there is also nothing here that would scare anyone else off, perhaps why the
field is so full of potential competitors from different sectors and interesting com-
binations of old-line manufacturers and new-line software and other firms.
Tesla is a bit of a special case. As this was being written, the electric vehicle
manufacturer’s market cap passed GM’s for the first time, making it the most valu-
able U.S. auto company. As noted, the firm’s knowledge metric is considerably
higher than other competitors, particularly in the cap/book version but also in
cap/asset. Even though we can’t measure it directly, the firm likely also has higher
tacit knowledge and, perhaps, intelligence levels given its formidable research and
development efforts and presence of key players such as the founder. Moreover, the
firm has specialist knowledge in certain areas (batteries) that may or may not be a
feature of automated driving. This methodology gives a nice snapshot of the full
industry sector but does require a deeper dive into the individual characteristics of
firms to fully understand what might be happening, and Tesla might be explored
in more detail.
The next four sectors, highlighted in the table, are a mix of manufacturing and
services (software). All have relatively high knowledge metrics though the phone
sector is showing signs of decline. Knowledge in the efficiency of design, execution,
and delivery of computers, phones, semiconductors, and software is important in
all these fields, even when some parts of that chain are outsourced (e.g., Apple). A
number of high-powered brands are also present in these sectors (Apple, Google,
Intel, Microsoft), and so relational capital is also good and likely pushes up the
knowledge metric. Intelligence scores are also high in all the sectors and big data
350 ◾ Analytics and Knowledge Management
is present. Essentially, firms in these sectors possess relatively high levels of all the
intangible assets, potentially adaptable to different circumstances and making dan-
gerous competitors of key players in each.
Uber, slotted in the auto rental sector, is another special case. To some extent,
it is a special player in the rental sector with a different approach than traditional
agencies (who usually rent the entire car rather than just the single ride). Here, its
financials are not included in the results of the full sector as it was not publicly
traded when the data was reported. On the other hand, most of the major players
in the sector are moving to new service models, whether car sharing, ride sharing,
shorter-term rentals (e.g., hourly), or some other variation. So Uber (and Lyft’s)
impact on the sector should start to be visible.
What the data in auto rentals show are low but increasing knowledge levels.
Once again, given the large fleet of tangible assets that may be debt-financed, the
cap/asset ratio may be the more accurate metric though both cap/asset and cap/
book generally agree. Intelligence activity is very limited, and big data is not overly
pronounced either. One area where Uber may have an impact is in Big Data as its
network thrives on collecting rider and driver data, creating greater efficiency by
matching supply and demand. And this may drive up the knowledge metrics over
time as managers get a better handle on logistics and competitors are forced to catch
up. But, overall, the intangibles capabilities of this sector just don’t suggest a serious
player. Ride-sharing firms may have interest in autonomous vehicles but seem more
likely to be customers or partners for the technology rather than developers.
Which brings us to the insurance sector once again. Here we have included the
Fire, Marine, Casualty sector that would include auto insurers but, again, recall
that just about all insurers have similar intangible profiles. Once again, lots of data.
Huge amounts of data compared to some other prospective participants in the
field. And, again, not much in the way of explicit or tacit knowledge, according to
the knowledge metric. But what is there is some intelligence ability combined with
the Big Data, suggesting a competency in accumulating, processing, and finding
insights in databases.
There is no indication of the interest of any insurance firm in direct participa-
tion in the autonomous driving field. That might be a good thing. There is also
no indication in the metrics of any capability to participate in development of key
pieces of the technology such as sensors, software, vehicle manufacturing, or so
forth. The car insurance firms, however, do have a vested interest in the progress of
the technology, especially in terms of how it impacts accident rates, their severity,
and where liability might be placed. As such, the auto insurance firms might be
very valuable collaborators in this field, contributing their big databases on driv-
ers, driving activity, and, particularly, accidents. The software and artificial intel-
ligence systems that will guide autonomous driving cars are being developed in a
trial-and-error manner by putting the vehicles on the road and teaching them to
recognize circumstances and make the right decisions. While this makes sense, the
process could likely be enhanced by incorporating the data the insurers already
Intangible Dynamics ◾ 351
Conclusions
Even though relatively young disciplines, KM, and IC have already faced a couple of
waves of growth and decline in interest over the past few decades. The advent of inter-
est in Big Data and business analytics and intelligence presents another possible threat
or opportunity for the field. In some ways, the burgeoning investment in big data
systems, including the cloud, and associated business analytics efforts have left KM
installations behind. The possibilities of gathering massive amounts of data and apply-
ing it to make better and more informed decisions at all levels are intriguing. And the
promise may be more compelling than some of the results seen from KM installations.
But opportunities are also present. The central question in KM and IC has
always been about the systems used to identify, measure, and manage intangible
assets. With knowledge, those assets originated in individuals and the systems
had to encourage the human resources of organization to contribute their knowl-
edge while being open to employing that gained from others. The complexities of
humans interacting with IT systems or other tools have been at the core of much of
the discussion over the past 30 years.
From that perspective, both big data systems and business analytics efforts can
be seen as extensions, with different kinds of intangibles. As such, KM learnings
may have something important to contribute, not only in what the systems look
like for managing the intangibles but also how to execute the human element that
makes the systems actually work. Further, KM and IC have been employed to help
understand competitive environments, how intangibles are assessed and exploited
in different circumstances in different industries. These sorts of applications can
also be extended to the wider range of intangibles, not just knowledge varieties but
also data, or information, and intelligence.
This chapter has brought together work from a variety of disciplines to explore
this strategic perspective in more detail. In particular, we have presented an updated
conceptualization of the intangibles hierarchy, a step that helps to explain the simi-
larities and differences between data and information, explicit knowledge, tacit
knowledge, and insight and intelligence. Traditional KM and IC covers explicit and
tacit knowledge but the extensions to data and information, with connections to
Big Data, and to insight and intelligence with its connections to business analytics,
352 ◾ Analytics and Knowledge Management
are useful in framing the full picture of contemporary intangible asset assessment
and usage. This perspective adds even more complexity to our understanding of
competitive environments and what intangibles are most effective in them but it
also brings us closer to the reality of today’s business battlegrounds.
With a firm grasp of the full range of intangibles, we can also look to accurately
measure them. In this chapter, we have demonstrated how to do so in specific
industry sectors and, in some cases, by individual firms. One can get a sense of the
competitive capabilities of those individual firms but here we have looked primarily
at what it takes in terms of intangible assets to be effective in the different sectors.
From a strategic planning perspective, this approach provides guidance as to what
a firm needs in its own sectors as well as what it might add in others, including
whether it possesses the intangibles levels and management skills to be competitive
on its own in a new environment.
With that in mind, we have provided analysis of three industries, guided by
profit pool structures identifying the key sectors in two of those industries. Given
the presence or potential presence of a single industry sector, insurance, we can
more easily see how the methodology provides guidance to decision makers. In
healthcare and automated driving, the insurers do not have the intangibles to be
competitive players but might have a role to play in providing Big Data and busi-
ness analytics capabilities to other industry challengers from other sectors (who
may not have such intangibles or skill in managing them). In financial services, the
intangibles capabilities of all types of insurers look similar and are also very much
like those in other sectors (banking, investment services). In such a case, insurers
may have the potential to be serious competitors themselves in tangential sectors.
Such analysis, in a short chapter like this, is necessarily only at the surface.
But the approach illustrates what decision makers in these industry sectors, with
considerably more knowledge of the competitive details behind the metrics, could
do. The intangibles metrics provide guidance as to the state of affairs in the sector
and alerts such decision makers where to focus their efforts. Their own, more spe-
cific knowledge can then lead to the deeper insights that help make better strategic
choices.
References
Abuelsamid, S., Alexander, D., and Jerram, L. (2017). Navigant research leadership report:
Automated driving (executive summary). Available at https://www.navigantresearch.
com/wp-assets/brochures/LB-AV-17-Executive-Summary.pdf.
Ackoff, R. (1989). From data to wisdom, Journal of Applied Systems Analysis, 16, 3–9.
Argyris, C. (1977). Double loop learning in organizations, Harvard Business Review,
September/October, pp. 115–125.
Argyris, C. (1992). On Organizational Learning, Blackwell, Cambridge, MA.
Argyris, C. (1993). Knowledge for Action, Jossey-Bass, San Francisco, CA.
Intangible Dynamics ◾ 353
McAfee, A., and Brynjolfsson, E. (2012). Big data: The management revolution, Harvard
Business Review, 90(10), 60–66.
McEvily, S., and Chakravarthy, B. (2002). The persistence of knowledge-based advantage:
An empirical test for product performance and technological knowledge, Strategic
Management Journal, 23(4), 285–305.
Nahapiet, J., and Ghoshal, S. (1998). Social capital, intellectual capital, and the organiza-
tional advantage, Academy of Management Review, 23(2), 242–266.
Nonaka, I., and Takeuchi, H. (1995). The Knowledge-Creating Company: How Japanese
Companies Create the Dynamics of Innovation, Oxford University Press, New York.
Polanyi, M. (1967). The Tacit Dimension, Doubleday, New York.
Rothberg, H.N., and Erickson, G.S. (2005). From Knowledge to Intelligence: Creating
Competitive Advantage in the Next Economy, Elsevier Butterworth-Heinemann,
Woburn, MA.
Rothberg, H.N., and Erickson, G.S. (2017). Big data systems: Knowledge transfer or intel-
ligence insights, Journal of Knowledge Management, 21(1), 92–112.
Schwedel, A., and Rodrigues, A. (2009). Financial services’ shifting profit pool. Available
at http://www.bain.com/Images/BB_Financial_services_shifting_profit_pools.pdf.
Senge, P. (1990). The Fifth Discipline, Doubleday, New York.
Simard, A. (2014). Analytics in context: Modeling in a regulatory environment, in
Rodriguez, E. and Richards, G. (Eds.), Proceedings of the International Conference
on Analytics Driven Solutions 2014. Harvard University Press, Cambridge, MA,
pp. 82–92.
Stewart, T.A. (1997). Intellectual Capital: The New Wealth of Organizations, Doubleday,
New York.
Sveiby, K.E. (2010). Methods for measuring intangible assets. Available at http://www.
sveiby.com/articles/IntangibleMethods.htm.
Tan, H.P., Plowman, D., and Hancock, P. (2007). Intellectual capital and the financial
returns of companies, Journal of Intellectual Capital, 9(1), 76–95.
Teece, D.J. (1998). Capturing value from knowledge assets: The new economy, markets for
know-how, and intangible assets, California Management Review, 40(3), 55–79.
Thomas, J.C., Kellogg, W.A., and Erickson, T. (2001). The knowledge management puzzle:
Human and social factors in knowledge management, IBM Systems Journal, 40(4),
863–884.
Tobin, J., and Brainard, W. (1977). Asset markets and the cost of capital, in Nelson, R. and
Balassa, B. (Eds.), Economic Progress, Private Values, and Public Policy: Essays in Honor
of William Fellner. North Holland, Amsterdam, the Netherlands, pp. 235–262.
Verdun, J. (2008). The last mile of the market: How network technologies, architectures
of participation and peer production transform the design of work and labour, The
Innovation Journal: The Public Sector Innovation Journal, 13(3). Available at https://
www.innovation.cc/.
Wernerfelt, B. (1984). The resource-based view of the firm, Strategic Management Journal,
5(2), 171–180.
Zack, M.H. (1999). Developing a knowledge strategy, California Management Review,
41(3), 125–145.
Zander, U., and Kogut, B. (1995). Knowledge and the speed of transfer and imitation of
organizational capabilities: An empirical test, Organization Science, 6(1), 76–92.
Chapter 12
Contents
Introduction ......................................................................................................356
Conceptual Framework .....................................................................................357
Research and Business Goals (Why?) .................................................................359
What Are You Trying to Achieve? What Is Your Research or Business Goal?....359
What Good Practices and Good Practice Models Exist? ........................... 360
How Will You Measure the Results of Your Analysis?............................... 360
What Level of Risk Are You Willing to Assume? .......................................361
What Level of Investment Are You Willing to Make?................................361
Is This a Project or Enterprise-Level Goal? ................................................362
Understanding Why in Context—Use Case Scenarios ..............................362
How We Use the Tools—Analysis as a Process ...................................................370
What Analytical Method Is Best Suited to the Goals? ...............................371
Quantitative Analysis and Data Analytics ......................................371
Qualitative Analysis and Language Based Analytics .......................372
Mixed Methods Analysis and Variant Sources................................374
When Is a Quantitative Analysis Approach Warranted? ............................375
When Is a Qualitative Analysis Approach Warranted?...............................375
355
356 ◾ Analytics and Knowledge Management
Introduction
Business analytics and text analytics are not new. Work on the core analytics capa-
bilities dates back to the 1950s and 1960s. What has changed in 2017 is the com-
puting capacity we have—in the business and the research environment—to apply
analytics. The increased capacity has created an expanded set of expectations for
what is possible. It has also created opportunities to explore business and research
questions we previously thought impossible due to scale, scope, cost, or reliability.
The expanded capacity has allowed us to explore the ways in which humans lever-
age language to create and represent knowledge. Knowledge management meth-
ods and core concepts are critical to the intelligent use of analytics. The expanded
foundation makes it possible to do much more than detect clusters and patterns in
text. This new capacity holds great promise for advancing the discipline of knowl-
edge management. Knowledge management has traditionally relied on qualitative
Analyzing Data and Words—Guiding Principles ◾ 357
methods that rely on human manual interpretation and analysis. The discipline
of knowledge management, which has traditionally relied on qualitative methods,
can be enhanced and expanded through the intelligent use of analytics. These
expectations and opportunities are achievable, but only if we approach analytics
intelligently. New linguistic and knowledge-based tools allow us to use machines
to begin to truly understand text. Understanding, though, requires thoughtful
design, investments, error and risk management, a fundamental understanding
of qualitative methods and their machine-based transformation, a fundamental
understanding of language and linguistics, and a willingness to navigate today’s
volatile “analytics” market. This chapter considers how we can leverage both of
these disciplines to meet expectations and new demands. This chapter offers guid-
ance for navigating these issues in the form of key questions and lessons learned.
Conceptual Framework
Over the past 35 years, four focus points have helped the author navigate the choice
and use of analytical methods and tools to a range of projects. The focus points
include: intent and focus of the project (why); the nature of analysis (how); sources
we analyze and use for analysis (what); and tools and methods that support the
kind of analysis we’re doing. The remaining sections of the chapter walk through
the key questions for each dimension. Each focal point is supported by a set of criti-
cal thinking questions (Table 12.1). The 27 questions are presented in sequential
order. Lessons learned suggest that addressing each of these questions individually
increases the probability of a successful effort.
In the sections that follow, we explore each of these key questions in the context
of seven real world use cases. These use cases have all ended in success but could
have resulted in significant failures had we not considered the key questions. The
use cases are identified below.
◾ Use Case 1: Causal Relationships between Arab Spring and Release of Wiki-
leaked Cables (manuscript in process)
◾ Use Case 2: Emotional and Social Tone of the Discourse of the 2012 U.S.
Presidential Campaign (Bedford, 2012b)
◾ Use Case 3: Language and Knowledge Structures of Mathematical Learning
(Bedford & Platt, 2013)
◾ Use Case 4: Knowledge Transfer Practices among Beekeepers (Bedford &
Neville, 2016)
◾ Use Case 5: Precision of Automated Geographical Categorization (Bedford, 2012)
◾ Use Case 6: Analysis of Physician-Patient Communication (Bedford, Turner,
Norton, Sabatiuk, & Nassery, 2017)
◾ Use Case 7: Analysis and Detection of Trafficking in Persons (Bedford,
Bekbalaeva, & Ballard, 2017)
358 ◾ Analytics and Knowledge Management
Business and research goals define the purpose and reason a company, organization
or institution exists.
of ensuring that we achieve our business and research goals. Errors occur in degrees.
Defining our tolerance for errors is an import part of measurement. Two consider-
ations that help us to manage errors are reliability and validity. Both reliability and
validity errors are possible with the new analytical tools. In fact, both may increase
if we chose tools and sources incorrectly.
Reliability speaks to the consistency in measure and considers whether you
would obtain the same result if you repeated the analysis with all factors remaining
constant (Babbie, 1989; Carmines & Zeller, 1979; Kirk & Miller, 1986). A simple
example of a reliability error occurs when we apply quantitative analytical tools
to dynamic text or language. The reliability and generalizability of results is com-
promised on each time the corpus of text changes. Validity is understood as truth
in measurement and tells us whether we measured what we intended to measure
(Adcock, 2001; Guion, 1980; King, Keohane, & Verba, 1994). Validity is easier to
measure in quantitative analyses than it is in qualitative analyses (Atkinson, 2005;
Reason and Bradbury, 2001; Eriksson & Kovalainen, 2015; Seale, 1999). A simple
example of a validity error occurs when we apply quantitative analytical tools to
text or language without first developing a deep understanding of the linguistics
or the knowledge structures inherent to the text. The simplest examples of errors,
though, are type 1 and type 2 errors.
analysts and researchers often choose these tools is the belief they will save them
time (e.g., just apply the tools and the solution appears) and require fewer resources
(e.g., the tools will solve all the complex problems). In fact, the tools can require
more computing resources and an extended set of competencies. The tools that have
the greatest value and the most analytical power also come with a high price tag
and maintenance costs. Quality results do not result from lower cost solutions or
lower level resources.
Sources of Two primary sources: (1) media reports describing the Arab
evidence uprising and (2) the leaked diplomatic cables. Cables for
two countries were collected—Tunisia and Egypt—for two
years.
Level of The initial analysis was research-based and carried little risk
acceptable risk to either the Department of State or the media. The
researcher carried the majority of the risk in the event that
the analysis provided no clear results or produced results
that were not reproducible or which could not be
validated by subject matter experts.
(Continued)
364 ◾ Analytics and Knowledge Management
Reliability and Because the conceptual model was designed around a full
validity issues view of diplomatic intelligence, it can be reliably applied
to any other country for analysis. Because a conceptual
model was available for control purposes, internal validity
was also strengthened.
Follow-on goals The initial goal was achieved—the research team was able
to demonstrate that a cause-effect relationship of the
leaked diplomatic cables and the uprising did not exist.
Rather, the deep analysis of language and content of the
diplomatic cables suggested that the amount of unique or
novel information in the cables was low when compared
to media reports and social media communication in the
country. This lead to a second research question
comparing the content of diplomatic cables with
in-country media reports.
Table 12.3 Use Case 2: Emotional and Social Tone of the Discourse of
the 2012 U.S. Presidential Campaign
Dimension Use Case Context
(Continued)
Analyzing Data and Words—Guiding Principles ◾ 365
Table 12.3 (Continued) Use Case 2: Emotional and Social Tone of the
Discourse of the 2012 U.S. Presidential Campaign
Dimension Use Case Context
Follow-on goals The success of the initial semantic analysis of the political
candidates prompted the research team to conduct the
same analysis on media persona from three major
networks. The media texts were contemporaneous to the
candidate’s texts. The follow-on research goal tested the
same factors for media persona and found that in most
cases, the media persona were more extreme than the
political candidates.
366 ◾ Analytics and Knowledge Management
Reliability and Reliability was high for the characterization of language and
validity knowledge structures because each text is grounded in
issues basic principles, assumptions, functions, and symbols. How
language is used to explain concepts or functions may vary,
but the nature and frequency of use of that language is
consistent across texts.
Reliability and While the source sample was statistically reliable, the
validity issues conceptual model was applied to only three communities
of beekeepers. The reliability of the research can only be
established for the beekeeping communities. Reliability of
the semantic model of knowledge sharing can be tested in
other communities. Validity of the results is high because
it is grounded on natural language processing and
rule-based linguistics characterizations.
Probability of Both type 1 and type 2 errors were noted in the results.
errors However, these errors are not errors of research but errors
of machine-based processes. The fact that the errors were
identifiable and explainable was a significant research
result.
Reliability and Internal validity of the research was high because of the
validity issues availability and use of the control set of documents.
Reliability may vary, though, depending on the nature of
documents analyzed and of the regional focus of the
documents. The geographical descriptions of some
countries were found to generate higher errors rates than
of other countries.
Follow-on research Each facet of the profile can be broken off and used
independently. We would expect further refinement
of pieces of the profile.
foundation from mathematics and statistics (Han, Pei, & Kamber, 2011; Hand,
Mannila, & Smyth, 2001; Witten, Frank, Hall, & Pai, 2016). There is a robust
market of tools—in the hundreds—to support data analysis, and this market has
been stable for close to two decades. The open source and the commercial markets
are mature so there are options available for organizations of all sizes—from the
simple functions embedded into Excel to the more sophisticated capabilities of
Informatica and SAS. These tools have been used by organizations for decades. The
change in use is in the scale and scope of the evidence they can process and the fact
that they can process live streams of transactional data.
Another type of quantitative analysis that is relevant to this chapter is text ana-
lytics. Text analytics is sometimes also referred to as text mining, text analysis, text
processing, text mining, or text data mining. This type of analysis involves the deri-
vation of information from text—it attempts to infer things from text based on pat-
terns. This approach dates back to the late 1950s and early 1960s and is grounded in
the application of quantitative methods to text (Abbasi & Younis, 2007; Abilhoa &
De Castro, 2014; Arya, Mount, Netanyahu, Silverman, & Wu, 1998; Bejou,
Wrap, & Ingram, 1996; Bengio, 2009; Carletta, 1996; Debortoli, Muller, Junglas, &
vom Brooke, 2016; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990;
Fernandez et al., 2015; Hofmann, 1999; Lehnert, Soderland, Aronow, Feng, &
Shmueli, 1995; Luhn, 1958a, 1958b; Maarek, Berry, & Kaiser, 1991; Miner, 2012;
Pereira, Tishby, & Lee, 1993; Salton, 1970, 1986; Shmueli-Scheuer, Roitman,
Carmel, Mass, & Konopnicki, 2010; Srinivas & Patnaik, 1994; Teufel & Moens,
2002; Tischer, 2009). Organizations often apply text analytics to a large corpus of
text in an attempt to identify new or interesting patterns or areas of convergence and
similarity. While these methods have continued to evolve over the past 50 years, the
growth has largely been focused on improving the relevance and precision of the
categories or the coverage and recall of targeted entities extracted. In general, these
methods have evolved parallel to, but not with, the evolution of language-based
methods. Text is processed to remove formatting, structure, and language variations
prior to applying quantitative algorithms. Some text analytics tools have natural
language processing (NLP) components that are used to reduce language variations
and surface lemmas and roots. Natural language components are not a function that
is exposed to decision makers or researchers for direct use. The majority of the tools
available on the market today support text analytics solutions.
major substantive questions: (1) What are the characteristics of the language itself?
(2) Can we discover regularities in human experience? (3) Can we comprehend the
meaning of a text or action? Tesch correctly observes that most qualitative analysis
is done with words and focuses on language. Words and language is a very impor-
tant starting point for selecting an analytical method and analytical tools. Typically,
qualitative research tools are designed to facilitate working with information, such as
the conversion or analysis of interviews; the analysis of online surveys; the interpreta-
tion and analysis of focus groups, video recordings, audio records; or the organization
of images. These tools still require human intelligence and effort to support qualitative
analysis. They have little embedded human intelligence. The challenge is that until
recently there were no robust or affordable machine-based tools that automated the
process of analyzing the language of qualitative sources.
One thing we have learned over the years is that a machine can only do high
performance analysis when it has a human-equivalent level of understanding of that
language. Since the 1960s, we’ve devoted considerable research to understanding
how to embed linguistic competence into machine-based applications. Since the
1970s, considerable work has been devoted to building a machine-based under-
standing of language. Computational linguistics is focused on the use of machines
and technologies to support deeper and broader understanding of linguistics and
language. It is a young and interdisciplinary field concerned with the statisti-
cal or rule-based modeling of natural language from a computational perspective
(Anwar, Wang, & Wang, 2006; Berman, 2014; Brandow, Mitze, & Rau, 1995;
Church & Hanks, 1990; Church & Mercer, 1993; Croft, Coupland, Shell, &
Brown, 2013; Eggins, 2004; Hatzivassiloglou & McKeown, 1997; Hearst, 1992;
Hindle & Rooth, 1993; Kaplan & Berman, 2015; Marcus, Marcinkiewicz, &
Santonini, 1993; Mitkov, 2005; Nir & Berman, 2010; Pustejovsky, 1991; Shieber,
Schabes, & Pereira, 1995; Sproat et al., 2001; Van Gijsel, Geeraerts, & Speelman,
2004; Yarowsky, 1995). The theoretical foundations of computational linguistics are
in theoretical linguistics and cognitive science. Applied computational linguistics
focuses on modeling human language use. Since the 1980s, computational linguis-
tics has been a part of the field of artificial intelligence. Its initial development was in
the field of machine translation. The simple automated approach to machine trans-
lation was revisited in favor or more complex development of language and domain-
specific vocabularies, extensive morphological rules engines, and patterns that reflect
the actual use of language in different contexts. Today, computational linguistics
focuses on the modeling of discourse and language patterns. Traditionally, com-
putational linguistics was performed by computer scientists focused on applying
stochastic methods to natural language. In the twenty-first century, though, com-
puter programmers and linguists work together as interdisciplinary teams. Linguists
provide critical knowledge of language structure, meaning, and use that guide the
use of analytical tools. Because these tools have an academic or research focus, com-
putational tools are often open source or laboratory-based. By and large, they are not
available on the commercial market.
374 ◾ Analytics and Knowledge Management
functions. Other key tasks involved in qualitative analysis such as knowledge elici-
tation, knowledge modeling, and knowledge representation are inherently human.
Table 12.10 Use Case 1: Exploring the Causal Relationship between Arab
Spring Revolutions and Wiki-leaked Cables
Dimension Use Case Context
Table 12.11 Use Case 2: Emotional and Social Tone of the Discourse of the
2012 U.S. Presidential Campaign
Dimension Use Case Context
Design choices Two strategies were built into the design, one
focusing on categorization profiles and the other
focusing on text analytics. Two parallel designs
were run simultaneously and then quantitatively
tested in the final stage.
(Continued)
380 ◾ Analytics and Knowledge Management
Level 1: Phonetics
Level 2: Phonology
Level 3: Morphology
Level 4: Syntax
Level 5: Semantics
Return on investment Very high in the longer term if the profile can be
used to model and assess the intelligence
information that is being sent back to
headquarters. If the incoming intelligence is
equivalent to media reports from the country,
there is a need to improve foreign intelligence
gathering work.
(Continued)
Analyzing Data and Words—Guiding Principles ◾ 389
Table 12.18 Use Case 2: Emotional and Social Tone of the Discourse of the
2012 U.S. Presidential Campaign
Dimension Use Case Context
(Continued)
390 ◾ Analytics and Knowledge Management
Table 12.18 (Continued) Use Case 2: Emotional and Social Tone of the
Discourse of the 2012 U.S. Presidential Campaign
Dimension Use Case Context
(Continued)
Analyzing Data and Words—Guiding Principles ◾ 391
(Continued)
392 ◾ Analytics and Knowledge Management
(Continued)
Analyzing Data and Words—Guiding Principles ◾ 393
Qualitative risk reduction The risk is of type 1 and type 2 errors, but not
associated with subjective human interpretation
of source materials.
(Continued)
394 ◾ Analytics and Knowledge Management
Qualitative risk reduction Subject risk for this area is very high depending
on one’s role and perspective. An objective
analysis of the discourse can identify ways that
communication can be improved. There is a
potential to impact health outcomes by
improving communication and trust between
doctor and patient.
(Continued)
Analyzing Data and Words—Guiding Principles ◾ 395
(Continued)
396 ◾ Analytics and Knowledge Management
Use Case
Number Success Factors
integrated into another product. The volatility is more often found on among lan-
guage and linguistics focused tools because they require a significant investment to
create and to sustain. These are also the tools that have the greatest value for enterprise
applications, so they are often purchased and integrated as a component of another
transactional or content management system. As consumers and producers of these
tools, we need to lobby for a more accurate portrayal of the functionality of the tools.
7 The size of the authoritative reports and the need to break them
into facet-focused chunks prior to doing any linguistic analysis or
extracting any knowledge structures.
Lesson learned 1: Knowledge organizations have much to gain from wise and
considered use of the full range of analytical tools.
Lesson learned 2: Using analytics wisely means understanding the difference
between data and text analytics, good practice business intelligence models
and methods, and the nature and representation of information and knowl-
edge sources.
Analyzing Data and Words—Guiding Principles ◾ 399
Lesson learned 12: Text analytics and semantic analysis methods are not silver
bullets. They require investment to set up, configure, and sustain. They have
significant costs so most businesses get one chance to make the right choice.
Lesson learned 13: Research papers should describe in detail the tools used and
their capabilities. Peer reviewers should be willing to critique tool choices and
to question their use.
Lesson learned 14: The two most significant challenges for selecting a tool are
understanding the nature of the products that are available on the market
today, and working towards a stable and growing market for tools.
Lesson learned 15: The volatility of the market is related to the level of vendor
investment required to sustain these tools and the level of organizational invest-
ment required to maintain and operationalize them. As a result, the “quick-fix”
tools tend to dominate the market. A greater presence in the market, though,
does not signal a long product life span—the text analytics tools disappear from
the commercial market at the same rate as the language-based applications.
Language-based products may have a longer life span where they are integrated
into a larger enterprise application (e.g., SAP, SAS). The effect for the consumer
is the same, though—they disappear from the open market. Tools that support
academic research—computational linguistic tools—tend to have a longer life
span. However, these tools are generally not business or designer-friendly.
Conclusions
We know from experience that we can improve the effectiveness of data and text
analytics by leveraging knowledge management methods such as knowledge model-
ing, knowledge representation, and knowledge engineering. The opportunities and
rewards of applying analytics to knowledge management challenges hold equal oppor-
tunities. The opportunities for knowledge management professionals lie primarily in
the transformation of qualitative methods to machine-based methods. The challenge,
though, is the amount of design, thinking, and investment required to achieve those
results. What we have learned over the past 45 years is that this is possible if we have
dedicated academic and development resources to support the transformation. The
research questions we can pose and explore have increased exponentially. These ques-
tions would have been too manually intensive or too time consuming to investigate
in the past. Business decisions that have had to rely on experience and “gut instincts”
can now be tested with machine-based qualitative solutions.
We need to learn to think more expansively and creatively in defining our
research agendas and setting our business goals. Creativity and discovery begins
with the first strategic dimension—setting expectations and the framework for
analysis and solution. It means modeling knowledge and thought processes, apply-
ing appreciative inquiry process at every step of the way, expecting to do discov-
ery at a micro level with macro level impacts. It means having the opportunity to
Analyzing Data and Words—Guiding Principles ◾ 401
References
Abbasi, A. A., & Younis, M. (2007). A survey on clustering algorithms for wireless sensor
networks. Computer Communications, 30(14), 2826–2841.
Abilhoa, W. D., & De Castro, L. N. (2014). A keyword extraction method from Twitter
messages represented as graphs. Applied Mathematics and Computation, 240, 308–325.
Adcock, R. (2001). Measurement validity: A shared standard for qualitative and quantita-
tive research. American Political Science Review, 95(3), 529–546.
Anwar, W., Wang, X., & Wang, X. L. (2006). A survey of automatic Urdu language processing. In
Machine learning and cybernetics, 2006 international conference on (pp. 4489–4494). IEEE.
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal
algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the
ACM (JACM), 45(6), 891–923.
Atkinson, P. (2005). Qualitative research—Unity and diversity. In Forum qualitative sozial-
forschung/forum: Qualitative social research (Vol. 6, No. 3). London: Routledge.
Babbie, E. R. (1989). The practice of social research. Belmont, CA: Wadsworth Publishing Company.
Ballard, B. W., & Biermann, A. W. (1979, January). Programming in natural language: “NLC” as a
prototype. In Proceedings of the 1979 annual conference (pp. 228–237). New York, NY: ACM.
Bar-Hillel, Y. (1966). Language and information; selected essays on their theory and applica-
tion. Reading, MA: Addison-Wesley.
Barton, D., & Hamilton, M. (2012). Local literacies: Reading and writing in one community.
London: Routledge.
Bedford, D., & Platt, C. (2013, September 16). Mathematical languages—Barriers to knowl-
edge transfer and consumption. Scientific Information Policies in the Digital Age:
Enabling Factors and Barriers to Knowledge Sharing and Transfer, Aula Marconi,
Consiglio Nazionale delle Ricerche, Rome, Italy.
Bedford, D. A. D., Bekbalaeva, J., & Ballard, K. (2017, November 27–31). Global human
trafficking seen through the lens of semantics and text analytics. American Society for
Information Science and Technology Annual Conference, Crystal City, VA.
Bedford, D. A. D., Turner, J., Norton, T., Sabatiuk, L., & Nassery, H. (2017, November
27–31). Knowledge translation in health sciences. American Society for Information
Science and Technology Annual Conference, Crystal City, VA.
Bedford, D. A. D. (2012a). Enhancing the precision of geographical tagging—Embedding gazetteers
in semantic analysis technologies. Poster session presented at the TKE 2012, Madrid, Spain.
Bedford, D. A. D. (2012b). Semantic analysis of the political discourse in the presidential
and congressional campaigns of 2012. Paper presented at the Text Analytics World
Conference, San Francisco, CA.
Bedford, D. A. D., & Neville, L. (2016, September). Knowledge sharing and valuation in
beekeeping communities. In Proceedings of the international conference on intellectual
capital knowledge management and organizational learning. Ithaca, NY: Ithaca College.
402 ◾ Analytics and Knowledge Management
Bejou, D., Wray, B., & Ingram, T. N. (1996). Determinants of relationship quality: An arti-
ficial neural network analysis. Journal of Business Research, 36(2), 137–143. Oxford:
Oxford University Press
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine
Learning, 2(1), 1–127.
Berman, R. A. (2014). Linguistic perspectives on writing development. In B. Arfe,
J. Dockrell, & V. Berninger (Eds.), Writing development in children with hearing
loss, dyslexia or oral language problems: Implications for assessment and instruction
(pp. 176–186). Oxford: Oxford University Press.
Berman, R., & Nir, B. (2009). Cognitive and linguistic factors in evaluating text quality:
Global versus local. In V. Evans & S. Pourcel (Eds.), New directions in cognitive linguistics
(pp. 421–440). Amsterdam, the Netherlands: John Benjamins.
Berman, R., & Verhoeven, L. (2002). Cross-linguistic perspectives on the development of
text-production abilities: Speech and writing. Written Language and Literacy, 5(1), 1–43.
Biermann, A. W. (1981). Natural language programming. In A. Biermann & G. Guiho
(Eds.), Computer program synthesis methodologies (pp. 335–368). Amsterdam, the
Netherlands: Springer.
Biermann, A. W., Ballard, B. W., & Sigmon, A. H. (1983). An experimental study of natural
language programming. International Journal of Man-machine Studies, 18(1), 71–87.
Bollinger, A. S., & Smith, R. D. (2001). Managing organizational knowledge as a strategic
asset. Journal of Knowledge Management, 5(1), 8–18.
Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publica-
tions by sentence selection. Information Processing and Management, 31(5), 675–685.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic.
Computational Linguistics, 22(2), 249–254.
Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment (Vol. 17).
Newbury Park, CA: Sage Publications.
Chilton, P. (2004). Analysing political discourse: Theory and practice. London: Routledge.
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information
Science and Technology, 37(1), 51–89.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1), 22–29.
Church, K. W., & Mercer, R. L. (1993). Introduction to the special issue on computational
linguistics using large corpora. Computational Linguistics, 19(1), 1–24.
Cope, B., & Kalantzis, M. (Eds.). (2000). Multiliteracies: Literacy learning and the design of
social futures. London: Psychology Press.
Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short
text similarity metric. Computational Intelligence (UKCI), 2013 13th UK Workshop on
(pp. 221–227). IEEE.
Dahl, V., & Saint-Dizier, P. (1985). Natural language understanding and logic programming.
New York, NY: Elsevier.
Debortoli, S., Müller, O., Junglas, I. A., & vom Brocke, J. (2016). Text mining for informa-
tion systems researchers: An annotated topic modeling tutorial. CAIS, 39, 7.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing
by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391.
Dijkstra, E. W. (1979). On the foolishness of “natural language programming”.
In F. L. Bauer, E. W. Dijkstra, S. L. Gerhart, & D. Gries (Eds.), Program construction
(pp. 51–53). Berlin: Springer.
Analyzing Data and Words—Guiding Principles ◾ 403
Kirk, J., & Miller, M. L. (1986). Reliability and validity in qualitative research. Beverly Hills,
CA: Sage Publications.
Kress, G. (2009). Multimodality: A social semiotic approach to contemporary communication.
London: Routledge.
Kruger, D. J. (2003). Integrating quantitative and qualitative methods in community
research. The Community Psychologist, 36(2), 18–19.
Lehnert, W., Soderland, S., Aronow, D., Feng, F., & Shmueli, A. (1995). Inductive text clas-
sification for medical applications. Journal of Experimental and Theoretical Artificial
Intelligence, 7(1), 49–80.
Liddy, E. D. (2001). Natural language processing. Syracuse, NY: Syracuse University.
Loper, E., & Bird, S. (2002, July). NLTK: The natural language toolkit. In Proceedings
of the ACL-02 workshop on effective tools and methodologies for teaching natural lan-
guage processing and computational linguistics (pp. 63–70, Vol. 1). Philadelphia, PA:
Association for Computational Linguistics.
Luhn, H. P. (1958a). A business intelligence system. IBM Journal of Research and
Development, 2(4), 314–319.
Luhn, H. P. (1958b). The automatic creation of literature abstracts. IBM Journal of Research
and Development, 2(2), 159–165.
Maarek, Y. S., Berry, D. M., & Kaiser, G. E. (1991). An information retrieval approach
for automatically constructing software libraries. IEEE Transactions on Software
Engineering, 17(8), 800–813.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing
(Vol. 999). Cambridge, MA: MIT Press.
Marcu, D. (1997, July). The rhetorical parsing of natural language texts. In Proceedings of
the 35th annual meeting of the association for computational linguistics and eighth confer-
ence of the European chapter of the association for computational linguistics (pp. 96–103).
Madrid: Association for Computational Linguistics.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated
corpus of english: The penn treebank. Computational Linguistics, 19(2), 313–330.
Berlin, Heidelberg: Springer.
Mihalcea, R., Liu, H., & Lieberman, H. (2006, January). NLP (natural language process-
ing) for NLP (natural language programming). In A. Belbukh (Ed.), International
Conference on intelligent text processing and computational linguistics (pp. 319–330).
Berlin, Heidelberg: Springer.
Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data
applications. Amsterdam, the Netherlands: Academic Press.
Mitkov, R. (Ed.). (2005). The Oxford handbook of computational linguistics. Oxford: Oxford
University Press.
Neumann, S. (2014). Cross-linguistic register studies: Theoretical and methodological con-
siderations. Languages in Contrast, 14(1), 35–57.
Nir, B., & Berman, R. (2010). Parts of speech as constructions: The case of Hebrew
“adverbs”. Constructions and Frames, 2(2), 242–274.
Pereira, F., Tishby, N., & Lee, L. (1993, June). Distributional clustering of English words.
In Proceedings of the 31st annual meeting on association for computational linguistics
(pp. 183–190). Montreal: Association for Computational Linguistics.
Pustejovsky, J. (1991). The generative lexicon. Computational Linguistics, 17(4), 409–441.
Ravid, D., & Berman, R. (2009). Developing linguistic register across text types: The case
of Modern Hebrew. Pragmatics and Cognition, 17(1), 108–145.
Analyzing Data and Words—Guiding Principles ◾ 405
Ravid, D., & Berman, R. A. (2010). Developing noun phrase complexity at school age:
A text-embedded cross-linguistic analysis. First Language, 30(1), 3–26.
Reason, P., & Bradbury, H. (Eds.). (2001). Handbook of action research: Participative inquiry
and practice. London: Sage Publications.
Salton, G. (1970). Automatic text analysis. Science, 168(3929), 335–343.
Salton, G. (1986). Another look at automatic text-retrieval systems. Communications of the
ACM, 29(7), 648–656.
Schiffman, H. (1997). The study of language attitudes. Philadelphia, PA: University of
Pennsylvania.
Seale, C. (1999). Quality in qualitative research. Qualitative Inquiry, 5(4), 465–478.
Shieber, S. M., Schabes, Y., & Pereira, F. C. (1995). Principles and implementation of
deductive parsing. The Journal of Logic Programming, 24(1), 3–36.
Shmueli-Scheuer, M., Roitman, H., Carmel, D., Mass, Y., & Konopnicki, D. (2010).
Extracting user profiles from large scale data. In Proceedings of the 2010 workshop on
massive data analytics on the cloud (p. 4). New York, NY: ACM.
Spigelman, J. J. (1999). Statutory interpretation: Identifying the linguistic register.
Newcastle Law Review, 4, 1.
Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. (2001).
Normalization of non-standard words. Computer Speech and Language, 15(3),
287–333.
Spyns, P. (1996). Natural language processing. Methods of Information in Medicine, 35(4),
285–301.
Srinivas, M., & Patnaik, L. M. (1994). Genetic algorithms: A survey. Computer, 27(6),
17–26.
Tesch, R. (2013). Qualitative research: Analysis types and software. London: Routledge.
Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with rel-
evance and rhetorical status. Computational Linguistics, 28(4), 409–445.
Tischer, S. (2009). U.S. Patent No. 7,483,832. Washington, DC: U.S. Patent and Trademark
Office.
Tomita, M. (2013). Efficient parsing for natural language: A fast algorithm for practical systems
(Vol. 8). New York, NY: Springer Science and Business Media.
Ure, J. (1982). Introduction: Approaches to the study of register range. International Journal
of the Sociology of Language, 1982(35), 5–24.
Van Gijsel, S., Geeraerts, D., & Speelman, D. (2004). A functional analysis of the linguis-
tic variation in Flemish spoken commercials. In G. Purnelle, C. Fairon, & A. Dister
(Eds.), Le poids des mots. Proceedings of the 7th international conference on the statistical
analyses of textual data (pp. 1136–1144). Louvain-la-neuve: Presses Universitaires de
Louvain.
Weizenbaum, J. (1966). ELIZA—A computer program for the study of natural language
communication between man and machine. Communications of the ACM, 9(1),
36–45.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine
learning tools and techniques. Burlington, VT: Morgan Kaufmann.
Yarowsky, D. (1995, June). Unsupervised word sense disambiguation rivaling supervised
methods. In Proceedings of the 33rd annual meeting on association for computa-
tional linguistics (pp. 189–196). Stroudsburg, PA: Association for Computational
Linguistics.
Chapter 13
Contents
Introduction ..................................................................................................... 408
Cyber Threat Intelligence ................................................................................. 409
Related Work in Cyber Threat Intelligence ...................................................410
Computational Methods in Cyber Treat Intelligence..................................... 411
Challenges in Cyber Threat Intelligence ........................................................413
Prevalence of Social Media Analysis in Cyber Threat Intelligence .................. 415
Benefits of Social Media Analysis in Cyber Threat Intelligence ......................416
Establishing Motivations in Cyber Threat Intelligence through Social
Media Analysis.............................................................................................. 417
Role of Behavioral and Predicative Analysis in Cyber Threat Intelligence ......418
Text Analysis Tools ........................................................................................... 420
Linguistic Inquiry Word Count ....................................................................421
Sentiment Analysis ...................................................................................... 423
SentiStrength ................................................................................................424
Case Study Using Linguistic Inquiry Word Count ........................................425
Conclusions ......................................................................................................427
References .........................................................................................................427
407
408 ◾ Analytics and Knowledge Management
Introduction
Cybersecurity nowadays is a national priority due to the widespread application of
information technology and the growing incidence of cybercrime. Data science and
big data analysis is gaining popularity due to widespread utility in many subfields
of cybersecurity, such as Intrusion detection system (IDS), social networks, insider
threat, and wireless body network. The use of advanced analytic techniques, com-
putational methods, and traditional components that have become representative
of “data science” has been at the center of cybersecurity solutions focused on iden-
tification and prevention of cyberattacks.
Most companies are capable of hardening their external defenses using various
information security methods available. However, companies often fail in protecting
against attacks that originate from within the company. These are known as insider
threats that are difficult to guard against, because employees are given a certain level
of access to internal company networks that normally bypass the external security
measures put in place to protect the company. One way to identify insider threats is
through the use of predictive analysis monitoring social networks. Social network-
ing is a form of communication over virtual spaces (Castells, 2007). Monitoring
such networks requires tools that can readily analyze linguistic patterns.
Analyzing the huge amount of content in social networks needs linguistic tools
such as linguistic inquiry word count (LIWC), Stanford’s Core NLP Suite, or
SentiStrength. Natural language processing is the ability of computers to under-
stand what users are writing using frequently used human writing patterns. This
means the computer must be capable of identifying various dialects, slang, and
even properly identify homonyms to better interpret human language. Most NLP
algorithms are based on statistical machine learning techniques, which use statisti-
cal inferences. These inferences come from rules gained from analyzing large sets
of documents known as corpora which contain the correct values that need to be
studied (Chi et al., 2016). An example of using NLP is the auto selection in the
Google search bar. As the user types a query into the search bar, Google’s search
algorithm considers the words that the user has already typed and compares them
to the words most often typed by other users. This allows Google to provide users
with options and the ability to auto-complete the remaining statement based on
these similar searches. Such tools could aid in detecting insider threats associated
through identifying conversations that show some agenda or malice.
In this chapter, we focus on the potential use of computational methods in
intelligence work, insider security threats, and mobile health. The standard bearer
for mobile computing is the smartphone, which connects people anytime and
anywhere. Smartphones also store large amounts of personal information and run
applications that may legitimately, inadvertently, or maliciously manipulate per-
sonal information. The relatively weak security models for smartphone applica-
tions, coupled with ineffective security verification and testing practices, have made
smartphones an ideal target for security attacks. Advancing the science, technology
Data Analytics for Cyber Threat Intelligence ◾ 409
Analysis
Collection Processing Decision
(machine Dissemination
(social media) (NLP) support
learning)
and practices for securing mobile computing are essential to support the inevitable
use of mobile computing in areas with even requirements for privacy, security, and
resistance to tampering (Figure 13.1).
claimed to have hacked a U.S. drone (Rawnsley, 2011). Although the United States
never publicly acknowledged the hack, Obama did acknowledge the capture of
the drone (CNN Wire Staff, 2011). In response to these events, the DOD Defense
Science Board (2013) released a report detailing their assessment of the cyberse-
curity threat to national security with the recommendation that the U.S. govern-
ment take cyber threat seriously and rethink the way they operate (2013). Shortly
thereafter, in March of 2014, the Office of Personnel Management (OPM) systems
and two government contractors were hacked affecting over 4.2 million employ-
ees (Koerner, 2016). The Senate, in response to this attack, passed the Federal
Information Security Management Act of 2014 which redirected the security over-
sight of the OPM systems to the Department of Homeland Security and established
a Federal Information Security Incident Center (U.S. Congress, S-2521, 2014).
Despite the government’s best efforts to develop and improve the nation’s cyber-
security posture, the implementation plan did not happen fast enough. In 2015, the
Pentagon’s joint staff unclassified email system was taken down for two weeks due
to an attack suspected to originate from Russia (Bennett, 2015). The same OPM
hacker, in a similar attack, stole personal information from 21.5 million people all
related to government service (Davis, 2015). Events such as these brought to the
forefront the scale of vulnerability within the intelligence community. In March of
2017, Wikileaks published a treasure trove of professed CIA hacking capabilities,
the implications of which have yet to be fully understood (Lapowsky and Newman,
2017). In April of 2017, an NSA hacking tool called EternalBlue was stolen and
released online (Larson, 2017). This tool was then used in the WannaCry hack in
May 2017 that reportedly affected 99 countries (BBC Europe, 2017). These events
showed that even the most secure organizations are vulnerable and the threat is
moving at a faster rate than CTI. The government quickly realized that advanced,
efficient, and rapid solutions are needed to secure intelligence information used to
safeguard the nation.
Since then, there have been several related works and attempts to categorize
indicators of intrusion like the Cyber Observable Expression (CybOX), the Open
Indicators of Compromise (OpenIOC), and the Incident Object Description
Exchange Format (IODEF) frameworks (Gragido, 2012). Around the same time,
the Cyber Intelligence Sharing and Protection Act was passed which allows for
government to share intelligence with private corporations for the purpose of secur-
ing networks and systems (112 Congress, 2012). SANS provides an in-depth white
paper on some major CTI works titled “Tools and Standards for Cyber Threat
Intelligence Projects” (Farham, 2017). Shortly after these developments, the cyber-
security and communications section under the Department of Homeland Security
led a collaborative environment to establish a structured language for CTI sharing
titled Structured Threat Information eXpression (STIX) which uses the Trusted
Automated eXchange of Indicator Information (TAXII™) to exchange CTI infor-
mation between the government and private sectors. The structure consists of eight
core constructs (Barnum, 2014). The eight core constructs of STIX are campaign,
course of action, exploit target, incident, indicator, observable, threat actor, and
tactics, techniques, and procedures (TTP). Many of the earlier frameworks were
incorporated into STIX and the effort is now being led by the OASIS Cyber Threat
Intelligence Technical Committee (OASIS, 2017).
(Thycotic, 2017). GIS capabilities help security analysts visualize attack patterns in
terms of geography. Geography plays a strong role in cyber activity as the actions of
a hacker correlate with the hacker’s environment even if the hacker operates in the
digital world (Bronk, 2016). An example of this is the MalwareTech botnet tracker
that uses GIS capabilities to track and visualize botnets throughout the globe
(MalwareInt, 2017). The Department of Defense realizes the relevancy of geogra-
phy to cyber activity demonstrated by the strategic objective to build and maintain
strong global alliances to combat cyber threats (The Department of Defense Cyber
Strategy, 2015). GIS capabilities can also assist with predicting future attack pat-
terns through the visualization of previous patterns. Data visualization tools help
conduct comparative or trend analysis such as heat signatures for identifying high
activity concentrations or graphs to identify escalation of events.
Regardless of the analytical capability, security analysts require tools that can
talk and communicate with each other for quick exchange of data sharing. Once
a threat is identified, time is of the essence which means the tool should facilitate
not only thorough and in-depth analysis but also be able to share information with
other security analysts quickly. Thus, the data should flow fluidly from one system
to another to give others adequate time to respond and secure their systems. As
demonstrated in Table 13.1, CTI tools are developed from a variety of sources in
response to the competitive and lucrative field but are not necessarily able to cross-
talk with other intelligence tools. The more the tool can communicate with the
other tools out there, the more utility it has. The usability of the tool will also be a
factor, as the security analyst is not necessarily going to be a subject matter expert
in computer science even though the intelligence community is certainly trying to
change this. Until then, the tool should be relatively easy for a security analyst to
use, as the most common security analyst tasks include the ability to read, inter-
pret, and apply that data to world events. As of right now, the more user friendly
the CTI tool, the more it will be used (Tableau, 2017).
is fragmented and assembling the information is the real challenge (2014). CTI
currently relies on the collection of data to find signatures and anomalies to iden-
tify hackers but these techniques have overwhelmed security analysts. A research
study from Enterprise Strategy Group found that 74% of security operations teams
reported that security events and alerts are ignored due to volumes of data and not
enough staff (Davis, 2016). With so much data being collected, expiration dates on
the information become an issue as the data expires faster than the security analyst
can come to conclusions and act on those conclusions (Dickson, 2016). Expiration
dates are also an issue because, as one Big Data platform pointed out, threat analy-
sis is only released to the public after the fact to avoid tipping off the hacker that
someone is tracking the hack (SecureWorks, 2017).
One of the main attractions to cybercrime is the anonymity of the act. Experts
in the field are frustrated by the fact that origins of attack are difficult to trace
(J.G., 2014). Although the CTI professional can remove malware from a system,
the malware itself is not eradicated. Once the malware is identified and extracted,
a hacker can still change small portions of the code and then use the same malware
again. One piece of malware derived from one person can spread to thousands of
computers and then be manipulated in seconds to spread again (Koerner, 2016).
This highlights the futility of trying to track down a hacker from the trace, sig-
nature, design, or timestamp. One alternative response would be to identify the
source from its point of origin rather than from a trace. This is where monitoring
social media in conjunction with IT systems has the potential to lessen the time it
takes to identify a threat.
Identity management is another challenge in the intelligence field as the inter-
net offers the element of anonymity. For example, one report on social media sites
estimates that approximately 31 million Facebook user accounts are fake (Hayes,
2016). Regardless of how a tool tracks an identity, determining identities will
always be a challenge when dealing with the digital world as the slightest mistake
in reporting can change the identity of a person. Biometric tools have this chal-
lenge despite the capability of establishing unique identities, such as fingerprints.
The data that is assigned to the fingerprint still must be tagged. Should a finger-
print have similar biometric data to another individual, this information can eas-
ily become scrambled; many fingerprint readers use only selected points along the
fingerprint ridges for identity, which results in up to 3% misidentification or over-
lapping fingerprints between individuals. The intelligence world is thus turning to
machine learning and predictive analysis to remedy these issues.
The solution to these challenges is an obvious one: Track the hacker by develop-
ing easy to use tools that can collect data from a variety of independent sources in a
selective manner. These tools should be able to communicate with other tools in an
easy and timely fashion. Although the solution maybe is obvious, the implementa-
tion of the solution is not. Further development of machine learning programs is
required to selectively pull relevant information. This paper views cyber intelligence
as the computational ability to identify what information to collect to successfully
Data Analytics for Cyber Threat Intelligence ◾ 415
identify the threat without collecting irrelevant data. With the advancement of
technology and the speed in which the intelligence world is moving towards cyber-
security, the ability to develop these types of computational methods is what will
define one Big Data analytic tool over another. This is also where algorithms of Big
Data have a place in cyber threat intelligence.
behind the latest national threat. Cybercrime is just now coming into mainstream
focus for national threats. HUMINT means deriving intelligence from people
rather than computers and an investigation usually begins with an event. It is the
event that drives the intelligence professional to discover the five Ws of who, what,
when, where, and why with the “why” usually being the least of concern due to
the fast-paced environment of threats. There is a relatively new field that focuses
more on the “why” and human behavior called sociocultural research. However,
the focus of this research, in general, is societies and the objective of the study is to
understand how and why society allows the threat to operate in that space. Social
media analysis could fall under the umbrella of all source intelligence (OSINT) but
it is still not designed specifically to look at those behind computer-related activity
and this type of analysis would be limited if not used in conjunction with CTI. In
summation, the study of social media to predict cyber threats by analyzing human
motivations is really a field or area of expertise that has yet to be established.
3.50% 1.20%
21.20%
Cyber crime
Cyber espionage
74.10%
Hactivitism
Cyber warfare
the minds of the operator using Big Data analysis and text analysis is a logical
method of CTI collection. These operators, like all humans, have a basic human
desire for social interaction. Granted, there may be anomalies to this but it is safe
to assume that if an operator lives in the world of cybercrime, then the operator’s
social interaction would likely take place in the cyber world as well. Using text
analysis of social media could thus be an effective strategy for narrowing and inter-
cepting the threat.
Understanding the motivations behind the actions of an intruder also offers the
potential to provide alternative means for mitigating threat as each human moti-
vation is satisfied in different ways. There have been numerous studies regarding
the complexity of motivations (Marsden, 2015) and how varied human motiva-
tions can be. The Department of Defense is realizing the advantages of behavioral
analysis in detecting intrusion, which is why it is pushing for a new acquisition
model that would allow them to invest in private sector machine learning tech-
nologies (Owens, 2017). As strategic politics and war move into the digital era
and away from conventional techniques, the development of this type of intel-
ligence may become more and more relevant, especially as governments begin to
lose traditional forms of power and control with the proliferation of information
dissemination.
1 Countries under a secure rule of law, people are not imprisoned for
their views. Political murders are extremely rare.
are more familiar to humans but once displayed are often also ignored and usually
end in a bite. Canine behaviorists understand that if the problem can be identified
at its onset, mitigating action can take place to prevent the aggressive behavior.
Scales like the example in the canine ladder of aggression (Horwitz and Mills,
2009) can be used in a similar way for threat identification and prevention using
social media. Once the scale or trajectory of behavior is understood, the analyst
can data mine for these key words, assign weight to the words like scales and then
focus on the words that carry the most weight to prevent a hack. For example, if
the objective is to save lives by preventing dog bites, a canine analyst would moni-
tor behaviors in the middle of the aggression scale. Likewise, with human behavior,
if the goal is to diminish potential hacks, an analyst can search for words found
in the middle of the identified hacking “scale of aggression.” Both will narrow the
search for subjects that may commit the unwanted behaviors. The scope can be
narrowed even further by focusing on indicators that the behavior has yet to occur,
but the subject being reviewed has the potential or drive to commit the offense. In
the case of hackers, priority of intervention would then be placed on text indicating
the individual is in a current state of danger of performing a hack. Logically, this
method will be extended to CTI using the information available from social media.
420 ◾ Analytics and Knowledge Management
The same technique can be applied to hackers using the language of a hacker.
Hypothetically, if each phase of the Cyber Kill Chain requires a separate set of
tasks, the words assigned to these tasks could potentially indicate where in the
Cyber Kill Chain the hacker is postured. The weight of the words used could pro-
vide warning or indication of movement to the next stage and thus narrow the
scope of intervention. The Office of the Director of National Intelligence developed
a common cyber threat framework that defines the actions of hackers in each stage
of the Cyber Kill Chain (Sweigert, 2017). What is missing and could be developed
is a hacker dictionary with the words associated to the most logical stage of the
framework. There is already a preliminary work titled “The Hacker’s Dictionary”
that tracks the lexicon of hackers (Raymond, 2017). After the words are assigned to
the stages, they could then be given weight and priority based on where the words
fall in the Cyber Kill Chain.
Since LIWC is a major tool used in the text analysis field there have been
experiments done to determine how effective the program is. At the University of
Illinois (Kahn et al., 2007) three experiments where done to test the effectiveness
of LIWC. In the first experiment, users where asked to write an amusing essay and
an emotional essay. LIWC was then used to analyze these essays and determine the
emotions the author was attempting to convey in them. LIWC was successful at
this and was able to identify the positive word usage in the amusing essay as well
as the negative emotions being conveyed in the emotional essay. During the second
experiment, users’ emotions were manipulated using film clips. After the users were
shown these clips, they were asked to give an oral presentation about how they felt.
This oral presentation was transcribed into text and analyzed using LIWC. The
results were accurate and they demonstrated that LIWC could correctly pick up
various traits from one text. The third experiment was similar to the second experi-
ment only this time the users were asked in-depth questions that helped determine
more about their actual personality. They were then once again shown clips similar
to those in the second experiment. The purpose of this experiment was to deter-
mine if LIWC could actually tell the difference in users’ personalities based on the
words they used. The results of this application, that LIWC could not tell the dif-
ference, reflected a more challenging scenario.
There have been several ways in which people have sought to use sentiment
analysis. One of the most popular ways has been in prediction of elections. Twitter
is a medium allowing for microblogs up to 140 words in length, known as “Tweets,”
to be posted by users. Other users on the site then read these Tweets and have the
option of retweeting them. Tumasjan et al. (2010) analyzed over 10,000 messages
that contained references to a political party of politician during the German fed-
eral election. The purpose of this analysis was to determine whether you could use
Twitter as a valid medium for determining the offline political landscape in regards
to the public’s opinion.
This study was based off the fact that many believe that the current United
States president’s victory was due to his ability to effectively use social networks
in his campaigning. The president not only used social networking but he even
created his own website to increase his online presence. This led the researchers to
wonder to what extent social networks play roles in politics and whether it was an
effective place for gathering information on what the public currently felt about the
political parties and the politicians in those parties. The researchers also wanted to
determine whether Twitter was a platform for political deliberation, whether it was
an accurate reflection of political sentiment, and whether it could accurately predict
election results.
After examining 104,003 political Tweets regarding the 6 political parties that
were part of the election or Tweets that mention prominent politicians in those par-
ties, the researchers were able to answer these questions. Regarding whether Twitter
is a platform for political deliberation, it was found that there was a high amount
of political debate on Twitter; however, it was coming from a small group of heavy
Data Analytics for Cyber Threat Intelligence ◾ 423
users, which may lead to an unequal representation overall. Regarding the second
question, it was found that the examination of users’ Tweets about the political par-
ties and politicians did show that users could accurately identify certain traits about
the moderate politicians and parties unless they were completely veered towered
one side of the political spectrum. Regarding the last question, there was a correla-
tion found between the number of Tweets where the party was mentioned and the
winner of the election.
Sentiment Analysis
Sentiment analysis is the use of NPL and text analysis to determine the attitude of
the speaker on the topic. The primary purpose of sentiment analysis is to determine
if a sentence or even a document is positive or negative; this is called polarity senti-
ment analysis. A common use for polarity sentiment analysis is on product reviews.
By analyzing the text of the review, it can be determined whether the review is
positive or negative.
There are numerous ways that have been proposed to conduct sentiment analy-
sis. Pang and Lee (2004) proposed a machine learning method that categorized the
subjective portions of the document. This method differs from previous methods
of classifying documents which was based on selecting lexical features of the docu-
ment such as the word good being present. Instead, the author proposed to only
use subjective sentences and not the objective sentences, which can be considered
to be misleading.
Web 2.0 has the chance for businesses to use sentiment analysis in more busi-
ness efficient matters. Web 2.0 is characterized by its increase in interactions
between users and web pages. Facebook posts, Twitter posts, and even product
reviews allow for analysis of large datasets, which can be used to create predica-
tion models. These predication models can be used to help companies determine
how successful their next product would be or what features they should include
in their future products. Another use for this analysis is to attempt outcomes of
governmental elections.
When attempting to determine the sentiment of writing one of the issues that
may cause skewed results is the misreading of the text into the analysis software.
One of the foundations of sentiment analysis is polarity. The polarity is either nega-
tive or positive. An example of this can be seen in application reviews, which can
either have a positive review (thumbs up) or a negative review (thumbs down).
When attempting to determine the polarity of a document there have been
many approaches purposed. One of the most common methods is to select lexical
features like indicative words such as “good.” Upon seeing “the word good” in the
text of a review of a novel, many analysis tools may identify that as a positive polar-
ity which may skew the overall results even though the sentence where it was used
had nothing to do with the reviewers feeling about the actual novel. For example,
the reviewer may write “The lead character attempted to protect his good name.”
424 ◾ Analytics and Knowledge Management
It is for this purpose that Pang has proposed the use of a minimum cuts method,
which separates the subjective and objective sentences from text and only considers
subjective sentences in the analysis phase. They propose a document level subjectiv-
ity detector, which will allow for subjectivity detection on the sentence level. Only
sentences that passed through the subjectivity detector will be allowed in to the
polarity classifier for document analysis.
The results of applying this subjectivity detector are that they gained a statisti-
cally significant improvement of 4% in the accuracy of the sentiment analysis of the
given documents.
The reason that Twitter is such a popular place for sentiment analysis is due
to the high value of text that is present there as well, how easy it is to gather the
text, the wide variety of users it has, and the possibility to collect data from other
cultures. This offers a plethora of data for companies such as what is the best demo-
graphic for their product and how the public reacts to certain forms of marketing.
To demonstrate how effective Twitter can be for a corpus of text for sentiment
analysis, Pak and Paroubek (2010) collected 300,000 texts from Twitter, which
they then separated into three groups evenly. The categories were: text that con-
veyed positive emotions, text that conveyed negative emotions, and objective text.
Linguistic analyses were then collected on this corpus and a sentiment classifier was
then developed using the corpus as training data.
The corpus was collected using the Twitter API that allows you to actually
query a collection of Tweets from the website. In the query, happy emoticons
and sad emoticons were searched. Tweets containing sad emoticons where con-
sidered negative text and tweets using positive emoticons where considered posi-
tive text. Objective text was pulled from Tweets of magazines and newspaper such
as The New York Times. The classifier was built using a multinomial naive Bayes
classifier with binary n-grams from the text as the input.
SentiStrength
SentiStrength is a tool for sentiment analysis. Textual sentiment analysis tools
focus largely on determining positive, negative, or lack of, sentiments within
the given text. To accomplish this, tools either use a machine-learning or lexical
approach to analyzing the text. The machine-learning method works by using
n-grams. The software is trained to mark certain n-grams as positive and others
as negative. The lexical approach is similar to that of LIWC where the software
is given a dictionary file that identifies certain words as negative or positive, and
then assigns a value during the analysis of the text. The lexical approach also
uses pseudo n-gram styles such as the word “not”; it is read as a negation of the
following word, so the words “not happy” would be read as a negative statement
(Thelwall, 2013).
SentiStrength is a sentiment analysis tool that uses the lexical approach for
analysis. There are three versions of SentiStrength: the online version, the free Java
Data Analytics for Cyber Threat Intelligence ◾ 425
version which is made available for research and educational purposes, and the
paid version which is also written in Java, but made for commercial use. All the
versions contain similar functions but the commercial version offers the ability to
scan and organize much larger datasets.
Unlike LIWC, which is popular due to its dictionary, the main draw to
SentiStrength is the algorithm it uses to determine the sentiment in a document.
As mentioned earlier, SentiStrength is a lexical sentiment analysis tool; this means
that it uses a dictionary to determine the sentiment of words in the given text. The
dictionary file is made up of a combination of the LIWC dictionary but has been
updated periodically through testing. In the dictionary file for SentiStrength, each
word is given a positive sentiment score of 1 to 5 and a negative sentiment score of
−1 to −5. A score of 1 means no positive found and a score of −1 means no negative
emotion was found. The sentiment of the overall sentence is then given the senti-
ment score of the highest positive and negative sentiment in that statement
Textual analysis is the analysis of text to gain useful information from it.
Computer-assisted textual analysis done by computer programs using natural lan-
guage processing which allows computers to convert high-level human language
into a computer-readable format. Many NLP programs also use statistical analysis
within their algorithm to be predictive of text that is being read in.
The difference between LIWC and sentiment analysis is that linguistic analy-
sis is a form of textual analysis that is done by using certain language structures
found in most languages to get an understanding of the text. There are a variety
of linguistic analysis tools available, but one of the most well used tools is LIWC.
The two main parts of LIWC are the processing component and the dictionary.
The processing component is what reads in the text file and it is then compared
to the dictionary component and sorted based on the category of the word in the
dictionary. By doing this, LIWC can then find patterns within those categories that
may give information about the meaning behind the writing, such as whether it is
joyous or sad. Sentiment analysis is similar to linguistic analysis in what it seeks
to accomplish. The notable difference between the two is that linguistic analysis is
attempting to understand the language that is being used in the writing whereas
sentiment analysis is attempting to determine the overall polarity of the document.
With the rise of social media, sentiment analysis is being used to analyze large sets
of text to determine certain trends found in it.
Text data is analyzed using LIWC to determine if a given actor is a threat. The
difference between the past methods of determining an insider threat is that there
has never been a study that used linguistic analysis to bridge the gap between an
employee’s text and the demonstration that they could possess the characteristics
of the dark triad model. The data is tested for agreeableness, neuroticism, conscien-
tiousness, and extraversion. Since both agreeableness and conscientiousness are traits
consistent with all three of the dark triad traits, this is the first pair that is looked
for in the actor. The LIWC scores in the required categories are tested against the
average of known insider threat cases to determine if the actor possesses these traits.
If these traits are identified, the actor’s text is tested against the LIWC catego-
ries for neuroticism and extroversion. Depending on the scores for these categories
it becomes evident if the actor does or does not possess the characteristics of these
mental diseases. If the actor does possess the traits for one of the dark triad person-
ality traits, then they are considered to be a possible insider threat.
Of the three dark triad traits, this is the first pair that is looked for in the actor.
The LIWC scores in the required categories are tested against the average of known
insider threat cases to determine if the actor possesses these traits. Figure 13.3 rep-
resents the method by which threat values will be assigned to users.
The underlying problem, when determining whether or not the psychology of
a person makes them more likely to pose a threat to the company, is if their actions
make them less trustworthy. Trust is a concept, which is multidimensional and effects
relationships on both small and large scales. When an organization hires an individ-
ual, it is important that they trust that individual. It is for this reason that many com-
panies give new employees personality evaluations. These personality questions are
based on the traits identified in the five factor model (FFM) (Judge and Bono 2000).
FFM is a widely accepted model that is used to identify certain personality traits in
people based on certain mental cues. The problem with this personality test is that
many people lie to get hired by companies. Response distortion is common among
many applicants; therefore, many companies may not get an accurate understanding
of an employee’s true personality.
The outliers for social networking after LIWC analysis can be identified using
median absolute deviation (MAD). MAD is a method to identify outliers. The
MAD (Leys et al., 2013) was calculated as shown below. Here M is the median and
x the values:
The decision criteria for the outliers are as shown below. We chose mi = 2.5. A score
of 3 is very conservative, 2.5 is moderately conservative, and 2 is poorly conservative.
Conclusions
This chapter gives an overview of data analysis uses in the field of CTI and insider
threat. All data collection resources are related to mobile devices. This chapter exam-
ined the process of collecting and organizing data, various tools for text analysis, and
several different analytic scenarios and techniques. Applying data analytics into the
field of CTI is still in its infancy. Meanwhile, incorporating social media into CTI
analysis is an even more untapped resource (Clark, 2016). Dealing with a collec-
tion of very huge datasets with a great diversity of types from social networks, data
analytic provides enormous novel approaches for CTI professionals. How to use Big
Data analytic technique to achieve valuable information is still big challenge.
References
Barnum, S. (2014). Standardizing Cyber Threat Intelligence Information with the Structured
Threat Information eXpression (STIX™), Mitre. February 20. Version 1.1, Revision 1.
http://stixproject.github.io/about/STIX_Whitepaper_v1.1.pdf. Accessed June 26, 2017.
BBC Europe (2017). Cyber-attack: Europol says it was unprecedented in scale. BBC News,
May 13. http://www.bbc.com/news/world-europe-39907965. Accessed June 23, 2017.
Bech, P. (1996). The Bech P., Hamilton and Zung Scales for Mood Disorders: Screening and
Listening. Springer-Verlag: Berlin, Germany.
Bennett, C. (2015). Russian hackers crack pentagon email system. The Hill, August 6.
http://thehill.com/policy/cybersecurity/250461-russian-hackers-crack-pentagon-
email-system. Accessed July 8, 2017.
Bronk, C. (2016). Cyber Threat: The Rise of Information Geopolitics in U.S. National Security.
Praeger: Santa Barbara, CA.
428 ◾ Analytics and Knowledge Management
Hackmageddon (2017). Motivations behind attacks. Information Security Timeline and Statistics.
http://www.hackmageddon.com/2017/06/09/april-2017-cyber-attacks-statistics/.
Hayes, N. (2016). Why social media sites are the new cyber weapons of choice. Dark Reading,
September 6. http://www.darkreading.com/attacks-breaches/why-social-media-sites-
are-the-new-cyber-weapons-of-choice/a/d-id/1326802. Accessed July 9, 2017.
Horwitz, D., and Mills, D. (2009). BSAVA Manual of Canine and Feline Behavioral
Medicine. BSAVA: Gloucester, UK.
Hutchins, E., Cloppert, M., and Amin, R. (2011). Intelligence-Driven Computer Network
Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains.
Lockheed Martin Corporation. http://www.lockheedmartin.com/content/dam/
lockheed/data/corporate/documents/LM-White-Paper-Intel-Driven-Defense.pdf.
Accessed July 8, 2017.
Internet World Stats (2017). http://www.internetworldstats.com/stats.htm. Accessed July 24, 2017.
Judge, T.A., and Bono, J.E. (2000). Five-factor model of personality and transformational-
leadership. Journal of Applied Psychology 85(5): 751.
Kahn, J.H., Tobin, R.M., Massey, A.E., and Anderson, J.A. (2007). Measuring emotional
expression with the linguistic inquiry and word count. The American Journal of
Psychology 120: 263–286.
Koerner, B. (2016). Inside the cyberattack that shocked the US government. Wired,
October 23. Condé Nast: New York, NY. https://www.wired.com/2016/10/inside-
cyberattack-shocked-us-government/. Accessed June 22, 2017.
Kumar, S., and Carley, K. (2016). Approaches to Understanding the Motivations Behind Cyber
Attacks. Carnegie Mellon University. http://www.casos.cs.cmu.edu/publications/
papers/2016ApproachestoUnderstanding.pdf. Accessed July 6, 2016.
Lapowsky, I., and Newman, L.H. (2017). Wikileaks CIA dump gives Russian hacking
deniers the perfect ammo. https://www.wired.com/2017/03/wikileaks-cia-dump-
gives-russian-hacking-deniers-perfect-ammo/. Accessed July 24, 2017.
Larson, S. (2017). NSA’s powerful Windows hacking tools leaked online. CNN Technology,
April 15. http://money.cnn.com/2017/04/14/technology/windows-exploits-shadow-
brokers/index.html. Accessed June 23, 2017.
Lee, R. (2014). Cyber threat intelligence. Tripwire, October 2. https://www.tripwire.com/
stateof-security/security-data-protection/cyber-threat-intelligence/. Accessed July 7, 2017.
Leys, C., Ley, C., Klein, O., Bernard, P., and Licata, L. (2013). Detecting outliers: Do not
use standard deviation around the mean, use absolute deviation around the median.
Journal of Experimental Social Psychology 49(4): 764–766.
MalwareInt (2017). Malware Tech Botnet tracker. https://intel.malwaretech.com/. Accessed
July 11, 2017.
Marsden, P. (2015). The science of why. Brand Genetics, June 22. http://brandgenetics.com/
the-science-of-why-speed-summary/. Accessed July 12, 2017.
Nakashima, E., and Krebs, B. (2007). Contractor blamed in DHS data breaches. The
Washington Post, September 24. http://www.washingtonpost.com/wp-dyn/content/
article/2007/09/23/AR2007092301471.html?hpid=sec-tech. Accessed June 22, 2017.
National Center for Education Statistics (2000). Technology in schools. Chapter 5:
Maintenance and Support, Technology in Schools: Suggestions, Tools, and Guidelines
for Assessing Technology in Elementary and Secondary Education. https://nces.ed.gov/
pubs2003/tech_schools/chapter5.asp. Accessed July 11, 2017.
NationMaster (2002). http://www.nationmaster.com/countryinfo/compare/Ireland/United-
States/Crime. Accessed July 25, 2017.
430 ◾ Analytics and Knowledge Management
Nelson, M. (2016). Threat Intelligence Capability. North Dakota State Capability. https://
www.ndsu.edu/fileadmin/conferences/cybersecurity/Slides/Nelson-Matt-Threat_
Intel_Capability_Kick_Start_.pptx. Accessed July 14, 2017.
OASIS (2017). OASIS cyber threat intelligence (CTI) TC. https://www.oasis-open.org/
committees/tc_home.php?wg_abbrev=cti. Accessed June 26, 2017.
Owens, K. (2017). Army uses behavioral analytics to detect cyberspace invaders. https://
defensesystems.com/articles/2017/06/29/army-cyber.aspx. Accessed June 29, 2017.
Pak, A., and Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opin-
ion mining. In LREc (Vol. 10). Université de Paris-Sud: France.
Pang, B., and Lee, L. (2004, July). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. In Proceedings of the 42nd
annual meeting on Association for Computational Linguistics (p. 271). Association for
Computational Linguistics: Barcelona, Spain.
The Political Terror Scale (2017). Documentation: Coding rules. http://www.political
terrorscale.org/Data/Documentation.html#PTS-Levels. Accessed July 12, 2017.
Rawnsley, A. (2011). Iran’s alleged drone hack: Tough, but possible. Wired. https://www.
wired.com/2011/12/iran-drone-hack-gps/. Accessed June 23, 2017.
Raymond, E. (2017). The New Hacker’s Dictionary, 3rd ed. The MIT Press. https://mitpress.
mit.edu/books/new-hackers-dictionary. Accessed July 11, 2017.
Rumberg, J. (2012). Metric of the month: Tickets per technician per month. MetricNet. http://
www.thinkhdi.com/~/media/HDICorp/Files/Library-Archive/Insider%20Articles/
tickets-per-technician.pdf. Accessed July 11, 2017.
SANS (2017). FOR578: Cyber threat intelligence. https://www.sans.org/course/cyber-
threat-intelligence. Accessed July 14, 2017.
SecureWorks (2017). Cyber threat basics, types of threats, intelligence & best practices.
https://www.secureworks.com/blog/cyber-threat-basics. Accessed July 9, 2017.
Sevastopulo, D. (2007). Chinese hacked into Pentagon. Financial Times, September 3.
https://www.ft.com/content/9dba9ba2–5a3b-11dc-9bcd-0000779fd2ac?mhq5j=e2.
Accessed June 23, 2017.
Shackleford, D. (2015). Who’s Using Cyberthreat Intelligence and How? Sans Institute
InfoSec Reading Room. https://www.sans.org/reading-room/whitepapers/analyst/
cyberthreat-intelligence-how-35767. Accessed June 21, 2017.
Sweigert, D. (2017). A Common Cyber Threat Framework: A Foundation for Communication. Cyber
Threat Intelligence Integration Center—ONDI. https://www.slideshare.net/dgsweigert/
cyber-threat-intelligence-integration-center-ondi?qid=4d6055a5-ead1–40d5–84b8–
60439c570852&v=&b=&from_search=6. Accessed July 14, 2017.
Tableau (2017). Top 10 big data trends for 2017. https://www.tableau.com/sites/default/
files/media/Whitepapers/whitepaper_top_10_big_data_trends_2017.pdf?ref=lp&sig
nin=66d590c2106b8d532405eb0294a4a9f1. Accessed July 6, 2017.
Thelwall, M. (2013). Heart and soul: Sentiment strength detection in the social web with
sentistrength. Proceedings of the CyberEmotions 5: 1–14.
Thomas, M. (2017). 2017 global threat intelligence report. Dimension Data, May 4. http://
blog.dimensiondata.com/2017/05/2017-global-threat-intelligence-report/. Accessed
July 9, 2017.
Thornburgh, N. (2005). Inside the Chinese hack attack. TIME Magazine, August 25. http://
content.time.com/time/nation/article/0,8599,1098371,00.html. Accessed June 23,
2017.
Data Analytics for Cyber Threat Intelligence ◾ 431
ThreatCloud (2017). Live Cyber Attack Threat Map. Check Point Software Technologies Inc.
https://threatmap.checkpoint.com/ThreatPortal/livemap.html. Accessed June 3, 2017.
ThreatRate Risk Management (2017). Types of kidnappings. http://www.threatrate.com/
pages/47-types-of-kidnappings. Accessed July 6, 2017.
Thycotic (2017). PBA access use case. https://vimeo.com/209209431. Accessed July 11, 2017.
Tumasjan, A., Sprenger, T.O., Sandner, P.G., and Welpe, I.M. (2010). Predicting elections with
Twitter: What 140 characters reveal about political sentiment. ICWSM 10(1): 178–185.
U.S. Cong. Senate—Homeland Security and Governmental Affairs 113 Cong. (2014). Federal
Information Security Management Act of 2014. 113 Cong. S.2521. Washington, DC.
https://www.congress.gov/bill/113th-congress/senate-bill/2521. Accessed June 22, 2017.
U.S. Cong. House-Government Reform; Science. 107 Cong. (2002) Federal Information
Security Management Act of 2002. 107 Cong. 2nd Sess. H. R. 3844. Washington, DC.
https://www.congress.gov/bill/107th-congress/house-bill/3844. Accessed June 22, 2017.
Windrem, R. (2015). China read emails of top U.S. officials. NBC News, August 10. http://
www.nbcnews.com/news/us-news/china-read-emails-top-us-officials-n406046.
Index
Note: Page numbers followed by f and t refer to figures and tables respectively
433
434 ◾ Index
cluster-based flows of patients, 302–305, Social media analytics (SMA), 75–81, 115–116,
303f, 303t–305t 196–199, 197f, 198t, 215–216. See
conclusions on, 319 also Twitter-based communities
data description, 293–296, 294f–296f (TBCs)
discovering patient clusters unsupervised business applications, 212
machine learning, 296–301, 297f, collective knowledge within communities of
299f–302f, 299t practice, 69–71
discrete event simulation (DES) in, 288, conclusion and road ahead, 113–115
289f, 289t, 305–308, 306f–308f in cyber threat intelligence, 415–418
elderly discharge planning analytics use data visualization tools, 213–214
case, 291–292, 292t defining, 203–204
future directions, 317–318, 318f evolution of analytics and, 201–203
hybrid simulations, 282–283 evolution of analytics in knowledge
integration with machine learning, 280–282 management and, 71–75, 72f–73f
introduction to, 279–280 interaction among users, 85–90, 91f, 91t
model verification and validation, introduction to, 68–69
316–317, 317f location dimension, 94–97
motivation, 280–282 network visualization tools, 212
overview of analytics approach, 293 opinion dimension, 105–110
patient’s care journey, 305–308, 306f–308f opinion-location interaction, 110–113
prospective role of machine learning in, processes of, 204–207
281–282, 281f representative fields of, 214–215
related work, 282–284 scientific programming tools, 211–212
results and discussion, 314–316, 315f–316f sentiment analysis, 208–209
simulation-based healthcare planning, social media management tools, 214
283–284 social media monitoring tools, 213
study limitations, 318–319 stream processing, 211
supervised machine learning, 308–312, techniques, 207–211
310t, 311f, 313f–314f, 313t–314t text analysis tools, 213
system dynamics (SD) in, 288, 289f, 289t, time dimension, 92–93, 94f
302, 303f topic dimension, 97–101
unsupervised machine learning, 296–301, topic modeling, 209–211
297f, 299f–302f, 299t topic-time interaction, 102–104, 104f
Single-project analytics projects, 12–20 user dimension, 81–85. See also User
SixDegrees.com, 200 feedback
Six Sigma, 42–45, 44f visual analytics, 210–211
Skandinaviska Enskilda Banken (SEB), 11t, 19, Social media data, 169
21–22 Social media management tools, 214
Sky blog, 200 Social media monitoring tools, 213
Slade, D., 384 Social networking sites (SNSs), 73, 79. See also
SM. See Simulation modeling (SM) Social media analytics (SMA)
SMA. See Social media analytics (SMA) analyzing user feedback on, 124–128
Smart cities, 157 growth of, 196–197
Smartphones, 145 historical perspective of, 199–201
Smith, T., 22 link analysis, 127–128
Snapchat, 79 opinion mining, 125–127
Snowden, D. J., 329–333 platforms, 198–199
Snow, J., 8, 12 polling and, 158–159
SNS. See Social networking sites (SNSs) user response in, 124
Snyder, W., 70 Social Pilot, 214
Snyman, R., 132 Sopra Steria, 2
444 ◾ Index