Analytics Data Science
Analytics Data Science
Amit V. Deokar
Ashish Gupta
Lakshmi S. Iyer
Mary C. Jones Editors
Analytics
and Data
Science
Advances in Research and Pedagogy
Annals of Information Systems
Volume 21
Series Editors
Ramesh Sharda
Oklahoma State University
Stillwater, OK, USA
Stefan Voß
Universität Hamburg
Hamburg, Germany
v
vi Contents
vii
viii About the Authors
awards for her research, including the Jane K. Fenyo Best Paper Award for Student
Research, the ACR/Sheth Foundation Dissertation Award, and the Best Paper in
Track Award at the American Marketing Association (AMA) Winter Conference.
She has presented her work in several forums, both nationally and internationally.
Her work has been accepted for publication in the Journal of Marketing, the
International Journal of Research in Marketing, Journal of Public Policy and
Marketing, Journal of Business Research, Journal of Macromarketing, and
Consumption, Markets and Culture. Dr. Cross received her Ph.D. in marketing from
the University of California, Irvine, her M.B.A. in international business from
DePaul University, and a B.Sc. in management studies from the University of the
West Indies.
Chapter 1
Exploring the Analytics Frontiers Through
Research and Pedagogy
Abstract The 2015 Business Analytics Congress (BAC) brought together academic
professionals and industry representatives who share a common passion for research
and education innovation in the field of analytics. This event was organized by the
Association for Information System’s (AIS) Special Interest Group on Decision
Support and Analytics (SIGDSA) and Teradata University Network (TUN) and held
in conjunction with the International Conference on Information Systems (ICIS
2015) in Ft. Worth, Texas from December 12 to 16, 2015. The theme of BAC 2015
was Exploring the Analytics Frontier and was kept in alignment with the ICIS 2015
theme of Exploring the Information Frontier. In the spirit of open innovation, the
goal of BAC 2015 was for the attendees to contribute their scientific and pedagogi-
cal contributions to the field of business analytics while brainstorming with the key
industry and academic leaders for understanding latest innovation in business ana-
lytics as well as bridge industry-academic gap. This volume in the Annals of
Information Systems reports the work originally reviewed for BAC 2015 and subse-
quently revised as chapters for this book.
It has been a tradition for the AIS Special Interest Group on Decision Support and
Analytics (SIGDSA) to organize the pre-International Conference on Information
Systems (pre-ICIS) analytics workshop with the title of “Congress” when the event
is held in the North American region. This “Congress” was the fourth such in its
series that began in 2009. Planning for Business Analytics Congress held in December
2015 in Ft. Worth, Texas began in Fall 2014. The theme of Business Analytics
Congress (BAC 2015) was decided as Exploring the Analytics Frontiers and was
kept in alignment with the ICIS 2015 theme of Exploring the Information Frontier.
A major purpose of the Congress was to bring together a core group of leading
researchers in the field to discuss the trends and future of business analytics in practice
and education. This included discussion of the role of academicians in investigating
and creating knowledge about applications of business analytics and its dissemina-
tion. This volume contributes to this purpose by striking a balance between investigat-
ing and disseminating what we know and helping to facilitate and catalyze movement
forward in the field. This volume in the Annals of Information Systems includes
papers that were originally reviewed for BAC 2015. These chapters were presented at
BAC 2015 and subsequently revised for inclusion as chapters for this book.
BAC 2015 was sponsored by both industry and academia. The two main industry
sponsors were Teradata University Network (TUN) and SAS, which in addition to
providing financial support for the Congress, helped with bringing in distinguished
speakers from industry. TUN also sponsored a reception for attendees the first eve-
ning of the event. Teradata University Network is a free, web-based portal that pro-
vides teaching and learning tools used by over 54,000 students and educators
world-wide. These include majors as diverse as information systems, management,
business analytics, data science, computer science, finance, accounting and market-
ing. The content provided by TUN supports instruction ranging from introductory
information systems courses at the undergraduate level to graduate and executive
level big data and business analytics classes. A key element of TUN success is that
it is “led by academics to ensure the content will meet the needs of today’s class-
rooms.” SAS is a corporate leader in the provision of statistical and analytical soft-
ware, services and support. SAS supports customers at over 80,000 sites around the
world and provides several resources (www.sas.com/academic) for academics in
support of their education needs. Academic sponsors included the University of
Arkansas, University of North Carolina at Greensboro, University of North Texas,
and University of Tennessee Chattanooga.
The day and a half BAC2015 event began on Saturday December 12th with sev-
eral workshops. The first workshop was sponsored by SAS and focused on SAS®
Visual Analytics and SAS® Visual Statistics. The workshop presented by Dr. Tom
Bohannon focused on the basics of how to explore data and build reports using SAS
Visual Analytics. It also covered topics on building predictive models in SAS Visual
Statistics, such as decision tree, regression and general linear models.
The next workshop was sponsored by TUN and illustrated the vast academic
resources available on TUN. It was presented by Drs. Barbara Wixom and Paul
Cronan. The presenters discussed the rich repertoire of resources for faculty and stu-
dents covering topics related to BI/Data Warehouse, database and analytics. Further,
the talk session showcased software resources available from TUN and partnership
1 Exploring the Analytics Frontiers Through Research and Pedagogy 3
with BI and Analytics companies such as MicroStrategy, SAS and Tableau that pro-
vide excellent resources to support analytics and visualization topics. The University
of Arkansas is also a TUN partner and their resources were also discussed.
A workshop organized by Prof. Ramesh Sharda included Prof. Daniel Asamoah,
Amir Hassan Zadeh, and Pankush Kalgotra and focused on pedagogical innovations
related to delivering a Big Data Analytics course for MIS Programs. This session
covered their experiences in offering a semester long course on Big Data technologies
and included some hands-on demonstrations that they have used in their courses.
Discussions also included the course outline and learning objectives followed by a
description of various teaching modules, case studies, and exercises that they have
developed or adapted.
The last session on Saturday was a panel on Innovations in Healthcare:
Actionable Insights from Analytics. It was moderated and organized by Dr. Ashish
Gupta. Panelists included Ms. Sherri Zink from BlueCross BlueShield of Tennessee,
Ramesh Sharda from Oklahoma State University, David Lary from University of
Texas Dallas and Ashish Gupta from Auburn University. This panel shared insights
that have been derived using big data approaches, and how they have led to transfor-
mations in areas related to health. Example include analytics in insurance from
consumer’s perspective, sports, pollution and allergy management, utilizing dispa-
rate data using new data science paradigms such as deep learning framework and
other enabling technologies.
The Sunday session began with an industry keynote by Ms. Sherri Zink, Senior VP,
Chief Data and Engagement Officer, BlueCross BlueShield of Tennessee. The keynote
address provided detailed insight into applications of analytics for empowering con-
sumers, reducing redundant consumer touch points, optimal treatment plan based on
information shared between provider and payer, informed decision making. Her talk
provided an overview of how analytics could be used to develop a 360-degree view of
consumers with the help of various approaches that foster the data integration, transfor-
mation & prediction, and eventually towards actionable insights. Key takeaways from
the keynote address included a description of how clinical, life style and psychographic
data could help develop a better understanding about consumer for stratification pur-
poses using segmentation and clustering approaches. Such insights could help in devel-
oping better wellness programs and creating continuous feedback.
The keynote was followed by a panel entitled AACSB Resources for Building a
Business Analytics Program. The panel was moderated by Dr. David Douglas and
panelists included Drs. David Ahuja, Paul Cronan, Michael Goul, Eli Jones, Dan
LeClair and Tom McDonald. The panel discussed AACSB’s analytics initiative
designed to help schools develop programs by providing a mix of curriculum con-
tent, pedagogy, and structure resources for schools contemplating development of
or enhancement of Business Analytics. Panelists who were members of the AACSB
Analytics Curriculum Advisory Group shared resources and encouraged interactive
attendee discussion. Consistent with AACSB’s goal of providing services to member
schools across the globe, they shared information on initial analytics curriculum
development seminars that are being be offered in the three cities that house
AACSB’s regional offices: Tampa (USA), Singapore, and Amsterdam.
4 A.V. Deokar et al.
Biographies
Abstract Inspired by the theme “Exploring the Information Frontier” of the ICIS
2015 conference, the Pre-ICIS Business Analytics Congress workshop sought
forward-thinking research in the areas of data science, business intelligence, analyt-
ics, and decision support with a special focus on the state of business analytics from
the perspectives of organizations, faculty, and students. The research track aimed to
promote comprehensive research or research-in-progress on the role of business
intelligence and analytics in the creation, spread, and use of information. This work
has been summarized in this chapter.
2.1 Introduction
Business Intelligence and Analytics (BI&A) have become core to many businesses
as they try to derive value from data. Although addressed by research in the past few
years, these domains are still evolving. For instance, the explosive growth in big
data and social media analytics requires examination of the impact of these
A. Sidorova (*)
University of North Texas, 365D Business Leadership Building, 1307 West Highland Street,
Denton, TX 76201, USA
e-mail: [email protected]
B. Gupta
California State University Monterey Bay, Room 326, Gambord BIT Building,
100 Campus Center, Seaside, CA 93955, USA
e-mail: [email protected]
B. Dinter
Faculty of Economics and Business Administration, Chemnitz University of Technology,
Chemnitz, Germany
e-mail: [email protected]
2.2 O
rganizational Use and Impact of Business Intelligence
and Analytics
Building on the IS success model, a paper titled Critical Value Factors in Business
Intelligence Systems Implementations (Dooley et al. 2018), proposes and empiri-
cally tests a theoretical model on business intelligence system success. The paper
extends Delone and McLean’s model of IS success (Delone and McLean 2003) by
relating critical success factors identified in extant BI&A research perceived infor-
mation quality and perceived system quality. Through the use of survey methodol-
ogy, the study finds empirical support for the relationships among critical success
factors, perceived information quality, perceived system quality and user satisfac-
tion with the system and with the information provided by the system.
Song et al. (2018) present in their paper Business Intelligence Systems Use in
Chinese Organizations an international perspective on BI&A systems by investigat-
ing the impact of natural culture, in particular of Guanxi, a universal and unique
Chinese cultural form. The authors have conducted a series of interviews in two
indigenous Chinese organizations (including Alibaba) in order to test previously
identified research constructs. Based on the results five propositions of BI systems
use in Chinese organizations have been formulated, introducing a Guanxi perspec-
tive in BI use theories. Their results confirm that national culture has a significant
impact on BI&A usage in China. Future research should be guided by these insights
given the high relevance and influence of Chinese firms worldwide.
Web 2.0 and social media facilitate the creation of vast amounts of digital content
that represents a valuable data source for researchers and companies alike. Social
media analytics relies on new and established statistical and machine learning tech-
niques to derive meaning from large amounts of textual and numeric data. In this
section we present several papers that seek to advance social media analytics meth-
ods and to demonstrate how social media analytics can be applied in a variety of
contexts to deliver useful insight.
The first paper in this category, titled The Impact of Customer Reviews on Product
Innovation: Empirical Evidence in Mobile Apps (Qiao et al. 2018) addresses a
research field with promising opportunities—analyzing Web 2.0 data to foster inno-
vation. The article examines the role played by customer reviews in influencing
product innovations in the context of mobile applications. In particular, the authors
verify the impact of online mobile app reviews on developers´ product innovation
decisions and identify the characteristics of such reviews that increase the likeli-
hood of future app updates. The findings suggest that it is important to explore user
generated reviews in the context of customer-centered product innovation.
The paper Whispering on Social Media (Zhang 2018) examines the role of infor-
mation circulated on social media in influencing stock performance during the so-
called “quiet period” before an initial public offering (IPO). During such quiet
periods organizations are not allowed to disclose any information that might influ-
ence investors´ decisions. Nevertheless, people discuss and comment about
10 A. Sidorova et al.
upcoming IPS’s in social media. The author finds in her research that the number of
IPO-related tweets (and re-tweets) have significant positive correlation with the
IPO’s first-day return, liquidity and volatility.
The next contribution in this category presents another interesting use case for
social media analytics. The paper titled Does Social Media Reflect Metropolitan
Attractiveness? Behavioral Information from Twitter Activity in Urban Areas
(Bendler et al. 2018) describes how the analysis of social media activities can gener-
ate insights for urban planning. When tweets are combined with other data such as
the temporal information, spatial coordinates, appended images, videos, or linked
places, a variety of applications can be supported, for example city planning, city
safety, and investment decisions. For these purposes, the paper presents methods
and measures for identifying the places of interest.
The paper titled The Competitive Landscape of Mobile Communications Industry
in Canada—Predictive Analytic Modeling with Google Trends and Twitter (Szczech
and Turetken 2018) describes how social media and Google Trends can be analyzed
to predict competitive performance. Their predictive model builds on the previous
studies that use Google Trends for predicting economic and consumer behavior
trends in a particular business or industry. The authors improve these existing mod-
els by adding competition variables and incorporate Twitter Sentiment scores into
their models to discover if Twitter sentiment scores modify some of the variance in
the dependent variable that is not already explained by Google Trends data.
The research-in-progress paper titled Scale Development Using Twitter Data:
Applying Contemporary Natural Language Processing Methods in IS Research
(Agogo and Hess 2018) illustrates the use of Twitter data analytics for scale devel-
opment. With the rise in social media communication, these data are becoming an
important source to understand consumer behavior. However, challenges abound in
transitioning the traditional measurement scales into social media data such as
tweets. This paper uses natural language processing methods to develop measure-
ment scales using big data such as tweets. They present a new scale called the tech-
nology hassles and delights scale (THDS) to show how the content validity of the
scale can be improved by using a syntax aware filtering process that identifies rele-
vant information from analyzing 146 million tweets.
The rise of big data and associated analytical techniques has important implication
not only for organizational, but for the society in general.
The research-in-progress paper titled Information Privacy on Online Social
Networks: Illusion-in-Progress in the Age of Big Data? (Sharma and Gupta 2018)
focusses on the issues of privacy and information disclosure on social media. They
present a research model that draws together concepts from behavioral economic
theory, the prospect theory which is an extension of expected utility hypothesis, and
2 Introduction: Research and Research-in-Progress 11
the rational apathy theory, which is derived from the public choice theory in social
psychology. The research methodology investigates why people choose to disclose
vast amounts of personal information voluntarily on Online Social Networks (OSN).
The proposed research model considers the effect of situational factors such as the
information control, ownership of personal information, and apathy towards privacy
concern of users on OSN. The article proposes value to practitioners in many differ-
ent ways, the OSN providers and third parties could better understand how con-
sumer’s information disclosure behavior works and we could better understand why
people tend to disclose too much of their personal information on OSN.
The research article, titled Online Information Processing of Scent-Related Words
and Implications for Decision Making (Lin et al. 2018) takes a broader view of human
information processing by examining the role of olfactory information in decision
making. The authors propose a methodology to examine emotions triggered by olfac-
tory-related information and how these could be simulated using visual cues in the
context of consumer decision-making online. The methodology combines approaches
from neuroscience with behavioral experiments. Their work studies the effectiveness
of triggering olfactory emotions using sensory congruent brand names in online ads
and also examines the influence on the consumers’ attitudes and intentions towards
brand and purchases. Results show that individual differences in olfactory sensitivity
moderate the effects on cognitive and emotional processes. This work has implica-
tions for online advertising and marketing decisions made by the consumers.
2.5 Conclusion
The research work presented at the Special Interest Group on Decision Support and
Analytics (SIGDSA) Workshop held on Dec 12, 2015, Fort Worth, TX was of con-
siderable variety in addressing the issues facing the researchers in the business intel-
ligence and analytics area. The research work included here represents some of the
innovations taking place in the analytics, combining theories from not only infor-
mation systems but also diverse fields such as neuroscience, psychology, behavioral
economics, and social sciences. Future research promises to exciting with opportu-
nities to extend literature and methodologies presented here to further the field of
decision support systems in the context of business intelligence.
Biographies
References
Agogo D, Hess TJ (2018) Scale development using Twitter data: applying contemporary natural
language processing methods in IS research. In: Deokar A, Gupta A, Iyer L, Jones MC (eds)
Analytics and data science: advances in research and pedagogy. Springer annals of information
systems series. 163–178. http://www.springer.com/series/7573
Bedeley RT, Ghoshal T, Iyer LS, Bhadury J (2018) Business analytics capabilities and use: a value
chain perspective. In: Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science:
advances in research and pedagogy. Springer annals of information systems series. 41–54.
http://www.springer.com/series/7573
Bendler J, Brandt T, Neumann D (2018) Does social media reflect metropolitan attractiveness?
Behavioral information from Twitter activity in urban areas. In: Deokar A, Gupta A, Iyer L,
Jones MC (eds) Analytics and data science: advances in research and pedagogy. Springer
annals of information systems series. 119–142. http://www.springer.com/series/7573
DeLone WH, McLean ER (2003) The DeLone and McLean model of information systems suc-
cess: a ten-year update. J Manag Inf Syst 19(4):9–30
Dooley PP, Levy Y, Hackney RA, Parrish JL (2018) Critical value factors in business intelligence
systems implementations. In: Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data
2 Introduction: Research and Research-in-Progress 13
science: advances in research and pedagogy. Springer annals of information systems series.
55–78. http://www.springer.com/series/7573
Isik O (2018) Big data capabilities: an organizational information processing perspective. In:
Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science: advances in research
and pedagogy. Springer annals of information systems series. 29–40. http://www.springer.com/
series/7573
Lin M-H, Cross SNN, Jones WJ, Childers TL (2018) Online information processing of scent-
related words and implications for decision making. In: Deokar A, Gupta A, Iyer L, Jones
MC (eds) Analytics and data science: advances in research and pedagogy. Springer annals of
information systems series. 197–216. http://www.springer.com/series/7573
Porter ME (2001) Strategy and the Internet, Harvard Business Review (79:3). Harvard Business
School Publication Corp, pp 62–78
Qiao Z, Wang A, Zhou M, Fan W (2018) The Impact of Customer Reviews on Product Innovation:
Empirical Evidence in Mobile Apps. In: Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics
and data science: advances in research and pedagogy. Springer annals of information systems
series. 95–110. http://www.springer.com/series/7573
Ramakrishnan T, Khuntia J, Saldanha T, Kathuria A (2018) Business Intelligence Capabilities. In:
Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science: advances in research
and pedagogy. Springer annals of information systems series. 15–27. http://www.springer.com/
series/7573
Sharma S, Gupta B (2018) Information privacy on online social networks: illusion-in-progress in
the age of big data? In: Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science:
advances in research and pedagogy. Springer annals of information systems series. 179–196.
http://www.springer.com/series/7573
Song Y, Arnott D, Gao S (2018) Business intelligence system use in Chinese organizations. In:
Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science: advances in research
and pedagogy. Springer annals of information systems series. 79–94. http://www.springer.com/
series/7573
Szczech M, Turetken O (2018) The competitive landscape of mobile communications industry in
Canada—predictive analytic modeling with Google Trends and Twitter. In: Deokar A, Gupta
A, Iyer L, Jones MC (eds) Analytics and data science: advances in research and pedagogy.
Springer annals of information systems series. 143–162. http://www.springer.com/series/7573
Zhang J (2018) Whispering on social media. In: Deokar A, Gupta A, Iyer L, Jones MC (eds)
Analytics and data science: advances in research and pedagogy. Springer annals of information
systems series. 111–118. http://www.springer.com/series/7573
Chapter 3
Business Intelligence Capabilities
T. Ramakrishnan (*)
College of Business, Prairie View A&M University, 805 A.G. Cleaver St.,
Agriculture/Business Multipurpose Building, Room 447, Prairie View, TX 77446, USA
e-mail: [email protected]
J. Khuntia
Business School, University of Colorado Denver,
1475 Lawrence Street, Denver, CO 80202, USA
e-mail: [email protected]
A. Kathuria
Faculty of Business & Economics, The University of Hong Kong, Pokfulam, Hong Kong
e-mail: [email protected]
T.J.V. Saldanha
Carson College of Business, Washington State University,
Todd Hall, Pullman, WA 99164, USA
e-mail: [email protected]
3.1 Introduction
BI process capabilities is the ability of BI to penetrate into the firms’ business pro-
cesses. This capability examines the functionalities of BI that can sustain both B2B
centric and customer centric activities. We argue that BI helps organizations by sup-
porting the business processes that give a firm a competitive advantage. Business
processes in a firm help orient its activities towards value creation. To create value,
a firm needs to do at least three activities; first, operations that can convert goods to
products or services (i.e., operations); second, relationship with other firms who
supply materials and products to the firm (e.g., firms in the supply chain), and third,
orienting its operations to deliver products and services to the customers (i.e., cus-
tomer oriented activities). As noted previously in this paper, the operational BI
capabilities are embedded within infrastructural development related to BI, or, in
other words, the infrastructural BI development caters to the operations. On the
22 T. Ramakrishnan et al.
Prior studies recognize BI integration to be very important and critical for the successful
utilization of BI (Isik et al. 2013). Integration refers to combining different types of
explicit data and information into novel patterns and relations (Herschel and Jones
2005). Based on the existing literature, we posit that organizations need to develop ways
to acquire and convert business intelligence towards organizational performance.
We argue that BI integration capability has two dimensions that are effective
towards organizational performance, albeit in an interconnected manner. First, BI
acquisition consists of gathering data from different types of sources across the
organization and beyond, in addition to data aggregation, rollup and partitioning.
Data extracted from operational systems need to be cleansed and transformed in
3 Business Intelligence Capabilities 23
order to make it suitable for use without errors (Ramakrishnan et al. 2012). Second,
the data need to be converted to usable patterns and schemas to help an organization
to glean more insights from the data. Thus, BI Integration consists of the acquisition
of data from various sources, followed by the conversion of data to the right format
and quality in order to be used effectively in the organization.
As much as the acquisition and integration of business intelligence from various
sources is a prerequisite for the utilization BI capabilities; the outcome of the acqui-
sition and conversion through integration helps to achieve higher organizational
performance. For instance, customer centric activities require acquisition of busi-
ness intelligence regarding customer behavior and experience, which in turn pro-
vide insights regarding goals and requirements. Second, the gathering and
aggregation of data from different types of sources across the organization and beyond
enables the organization to leverage BI to adequately respond to market and envi-
ronmental changes. Hence BI can provide insights regarding the nature of change to
which the organization needs to adapt, as well as the internal changes required to do
so. Third, aggregation, cleansing and transformation of this data can make this data
more substantive and insightful, thereby making subsequent decisions faster and
more effective. Thus, integration capability of BI that facilitates the gathering and
cleaning of data from disparate data sources and providing the decision-makers
with timely and usable information will make the BI more effective.
Biographies
management, and Ecommerce. His work appears in such journals as MIS Quarterly,
Decision Support Systems, and Computers in Human Behavior.
References
Barney J (1991) Firm resources and sustained competitive advantage. J Manag 17:99–120
Bharadwaj AS (2000) A resource-based perspective on information technology capability and firm
performance: an empirical investigation. MIS Q 24(1):169–196
Bhatt GD, Grover V (2005) Types of information technology capabilities and their role in competi-
tive advantage: an empirical study. J Manag Inform Syst 22(2):253–277
Buchanan L, O’Connell A (2006) A brief history of decision making. Harv Bus Rev 84(1):32–40
Chaudhuri S, Dayal U, Narasayya V (2011) An overview of business intelligence technology.
Commun ACM 54(8):88–98
Chen H (2011) Design science, grand challenges, and societal impacts. ACM Trans Manag Inform
Syst 2(1):1:1–1:10
26 T. Ramakrishnan et al.
Cooper BL, Watson HJ, Wixom BH, Goodhue DL (2000) Data warehousing supports corporate
strategy at First American Corporation. MIS Q 24(4):547–567
Davenport TH (2006) Competing on analytics. Harv Bus Rev
Elbashir MZ, Collier PA, Davern MJ (2008) Measuring the effects of business intelligent systems:
the relationship between business process and organizational performance. Int J Account Inf
Syst 9:135–153
Gebauer J, Schober F (2006) Information system flexibility and the cost efficiency of business
processes. J Assoc Inform Syst 7(3):122–145
Gold AH, Malhotra A, Segars AH (2001) Knowledge management: an organizational capabilities
perspective. J Manag Inform Syst 18(1):185–214
Harding W (2003) BI crucial to making the right decisions. Financ Exec 19(2):256–268
Henshen D (2008) Special report: business intelligence gets smart. Inf Week
Herschel RT, Jones NE (2005) Knowledge management and business intelligence: the importance
of integration. J Knowl Manag 9(4):45–55
Hostmann B, Herschel G, Rayner N (2007) The evolution of business intelligence: the four worlds.
Gartner report. http://www.gartner.com/DisplayDocument?id=509002
Hwang H, Ku C, Yen DC, Cheng C, Critical Factors C (2004) Influencing the adoption of data
warehouse technology: a study of banking industry in Taiwan. Decision Support Syst 37:1–21
Isik O, Jones MC, Sidorova A (2013) Business intelligence success: the roles of BI capabilities and
decision environments. Inf Manag 50:13–23
Kim G, Shin B, Kim KK, Lee HG (2011) IT capabilities, process-oriented dynamic capabilities,
and firm financial performance. J Assoc Inform Syst 12(7):487–517
Li S, Shue L, Lee S (2008) Business intelligence approach to supporting strategy-making of ISP
service management. Exp Syst Appl 35:739–754
Lusch RF, Liu Y, Chen Y (2010) The phase transition of markets and organizations: the new intel-
ligence and entrepreneurial frontier. IEEE Intell Syst 25(1):71–75
Massa S, Testa S (2005) Data warehouse-in-practice: exploring the function of expectations in
organizational outcomes. Inf Manag 42:709–718
Melville N, Kraemer K, Gurbaxani V (2004) Information technology and organizational perfor-
mance: an integrative model of IT business value. MIS Q 28(2):283–322
Moss LT, Atre S (2007) Business intelligence roadmap. Pearson Education Inc., Boston
Olszak CM, Ziemba E (2003) Business intelligence as a key to management of an enterprise. In:
Proceedings of Informing Science and IT Education, Santa Rosa, CA
Parikh AA, Haddad J (2008) Right-time information for the real-time enterprise. DM review.
http://www.dmreview.com/dmdirect/2008_92/10002003-1.html?portal=data_quality
Park Y (2006) An empirical investigation of the effects of data warehousing on decision perfor-
mance. Inf Manag 43(1):51–61
Petrini M, Pozzebon M (2009) Managing sustainability with the support of business intelligence:
integrating socio-environmental indicators and organizational context. J Strateg Inf Syst
18:178–191
Popovic A, Hackney R, Coelho PS, Jaklic J (2012) Towards business intelligence systems suc-
cess: effects of maturity and culture on analytical decision making. Decision Support Syst
54:729–739
Pugh DS (1990) Organization theory: selected readings. Penguin, Harmondsworth
Ramakrishnan T, Jones MC, Sidorova A (2012) Factors influencing business intelligence (BI) data
collection strategies: an empirical investigation. Decision Support Syst 52:486–496
Rockmann R, Weeger A, Gewald H (2014) Identifying organizational capabilities for the
enterprise-wide usage of cloud computing. In: Proceedings of the Pacific Asia Conference on
Information Systems
Sabherwal R, Kirs P (1994) The alignment between organizational critical success factors and
information technology capability in academic institutions. Decis Sci 25(2):301–330
Sahay BS, Ranjan J (2008) Real time business intelligence in supply chain analytics. Inf Manag
Comput Secur 16(1):28–48
3 Business Intelligence Capabilities 27
Sambamurthy V, Zmud R (1997) At the heart of success: organizational wide management compe-
tencies. In: Sauer C, Yetton P, Alexander L (eds) Steps to the future: fresh thinking on the man-
agement of IT-based organizational transformation. Jossey-Bass, San Francisco, pp 143–163
Schaefferling A (2013) Determinants and consequences of IT capability: review and synthesis of
the literature. In: Proceedings of the nineteenth American conference on information systems,
Chicago, IL
Sukumaran S, Sureka A (2006) Integrating structured and unstructured data using text tagging and
annotation. Bus Intell J 11(2):8–17
Tabbitt S (2013) BI services market predicted to double by 2016. Information Week
Teece DJ, Pisano G, Shuen A (1997) Dynamic capabilities and strategic management. Strat Manag
J 18(7):509–533
Thamir A, Poulis E (2015) Business intelligence capabilities and implementation strategies. Int
J Glob Bus 8(1):34–45
Turban E, Sharda R, Aronson JE, King D (2008) Business intelligence: a managerial approach.
Prentice Hall, Upper Saddle River
Watson HJ, Wixom H (2007) Enterprise agility and mature BI capabilities. Bus Intell J 12(3):13–28
Watson HJ, Abraham D, Chen D, Preston D, Thomas D (2004) Data warehousing ROI: justifying
and assessing a data warehouse. Bus Intell J:6–17
White, C (2005) The next generation of business intelligence: operational BI. Information Management
Magazine. http://www.information-management.com/issues/20050501/1026064-1.html
Wixom BH, Watson HJ (2001) An empirical investigation of the factors affecting data warehous-
ing. MIS Q 25(1):17–41
Wixom BH, Watson HJ, Werner T (2011) Developing an enterprise business intelligence capabil-
ity: the Norfolk southern journey. MIS Q 10(2):61–71
Chapter 4
Big Data Capabilities: An Organizational
Information Processing Perspective
Öykü Isik
Abstract Big data is at the pinnacle of its hype cycle, offering big promise.
Everyone wants a piece of the pie, yet not many know how to start and get the most
out of their big data initiatives. We suggest that realizing benefits with big data
depends on having the right capabilities for the right problems. When there is a
discrepancy between these, organizations struggle to make sense of their data.
Based on information processing theory, in this research-in-progress we suggest
that there needs to be a fit between big data processing requirements and big data
processing capabilities, so that organizations can realize value from their big data
initiative.
4.1 Introduction
At the World Economic forum in Davos every year, many public figures, politicians
and brightest minds come together to discuss world’s most important new develop-
ments. Since 2012, data has been discussed as a “critical new form of economic
currency” (World Economic Forum Briefing 2012). Thanks to the availability of
new data sources, the hyper-connectivity through social media and the internet of
things, and digitalization of our business processes, big data is revolutionizing the
way we interact with not only the businesses around us, but also the governments.
Next to the mind-boggling increase in data amounts, the rise of analytics and quan-
tification has also motivated organizations to leverage data for competitive
Ö. Isik (*)
Information Systems Management, Vlerick Business School,
Vlamingenstraat 83, Leuven 3000, Belgium
e-mail: [email protected]
advantage. As many organizations try and fail, it became evident that our organiza-
tions do not have not only the technological means, but also the necessary organiza-
tional capabilities to process this voluminous and differently structured strategic
resource.
Every now and then we hear success stories, such as how Macy’s can optimize
pricing of their 73 million items for sale in near-real time (Davenport and Dyché
2013), and how Tesco can do proactive maintenance by running analytics on 70 mil-
lion refrigerator data points coming off its units (Goodwin 2013), or how Netflix
managed to defeat its biggest competitor, Blockbuster, with only an algorithm and
petabytes of data (Madrigal 2014). But, organizations cannot expect to ‘go from
zero to Netflix overnight’ (Simon 2014). First, they need to figure out what business
uncertainties they desire to address by processing big data. Then, it takes a well-
planned process to organize the big data initiative; from developing the business
case to understanding the business requirements, and to figuring out the necessary
capabilities to address those requirements. When there is a discrepancy between
these requirements and capabilities, organizations struggle to make sense of their
data. Hence, we suggest that there needs to be a fit between big data processing
requirements and big data processing capabilities, so that organizations can realize
value from their big data initiative. To achieve the objective of finding support for
our arguments, three research questions were formulated: (1) What elements consti-
tute big data processing capabilities? (2) What elements constitute big data process-
ing requirements? (3) How does the fit between big data processing capabilities and
requirements impact value realization from the big data initiatives?
We intend to measure the research questions through a quantitative analysis of
survey data. After refining the research model based on expert interviews, an instru-
ment will be developed. Following the survey-based data collection phase, quantita-
tive data analysis will be conducted. As a result, we expect to observe different
levels of big data processing requirements as well as capabilities that can be clus-
tered in four groups, based on high and low capabilities, as well as high and low
levels of uncertainty. We expect to observe different levels of performance among
these configurations.
This study not only addresses a very relevant topic, but does so rigorously. It
contributes to the literature by improving the current understanding around what
capabilities organizations need to have for big data success. Thus far, business lit-
erature has merely pointed to a number of variables that can play an important role
for big data (e.g. hiring the right data scientist (Davenport and Patil 2012), top man-
agement support (McAfee and Brynjolfsson 2012), right IT infrastructure (LaValle
et al. 2011), however, empirical validation of these variables has been limited, if not
non-existent.
We design this research to empirically test big data capabilities, big data require-
ments and their fit by collecting quantitative data through an online survey. We use
the established theory of organizational information processing as the foundation of
our work. Besides testing the impact of fit on big data value realization, we also
suggest and assess several elements that may contribute to big data processing capa-
bilities as well as requirements.
4 Big Data Capabilities: An Organizational Information Processing Perspective 31
Big data definition has been evolving along with the technologies enabling as well as
the expectations surrounding the concept. While the earlier definitions focused on
volume by emphasizing “data sets that can no longer be easily managed or analyzed
with traditional data management tools, methods and infrastructures” (Rogers 2011),
later on the velocity of data (i.e., speed of data) and the variety of data (i.e., the for-
mat and structure of the data) were also added to the discussion and together, they are
referred to as the ‘3 V’s’ of big data. More recently, value (i.e. extraction of benefits
from data) and veracity (i.e., data and data source quality) were included and the defi-
nition was extended to ‘5 V’s’ (Fosso Wamba et al. 2015). We adopt Fosso Wamba
et al.’s (2015) definition and suggest that ‘big data’ is a holistic approach to manage,
process and analyze 5 V’s in order to create actionable insights for sustained value
delivery, measuring performance and establishing competitive advantage.
Big data, coming from social media streams, banking transactions, sensors, GPS
signals and countless other sources, may create new business opportunities in almost
every industry (Gobble 2013). Yet, according to Gartner Group, big data is now in
the “trough of disillusionment” phase of their hype cycle (Sicular 2013). This means
that many organizations, even though they may have great ideas and opportunities,
are disappointed with the difficulty of figuring out how to organize for their initia-
tives as well as the lack of reliable solutions that go beyond traditional vendor offer-
ings. Therefore, a negative hype surrounding the topic is extant. Organizations are
also grappling with cultural issues. For example, a retail organization heavily
invested in new models and tools to optimize their returns on advertising, only to
find out none of the frontline marketers were using them because they did not under-
stand how the model worked and didn’t believe in its results (Barton and Court
2012). Several other suggestions have been made with regards to why organizations
struggle in their big data initiative; such as the lack of transparency between teams
working on big data (Perrey and Arikr 2014), lack of qualified data scientists in the
organization (Forsyth et al. 2014; Perrey and Arikr 2014) and the necessity of defin-
ing a realistic business case before delving into the data (Menon 2013).
These suggestions imply that there are factors not well understood to drive big
data projects to success, and that we still do not have a clear approach that organiza-
tions can take for better performance with big data. Hence, we should look deeper
into organizations that have started gaining positive value out of their big data initia-
tives, and understand how they’re doing it. Current success stories imply that these
organizations think differently about their data management methods as well as
information processing capabilities to take advantage of this new resource (LaValle
et al. 2011; McAfee and Brynjolfsson 2012). Yet, academic literature has yet to
document the critical elements that may make or break an organization’s big data
initiative.
Several BI and analytics success models might be relevant for this research. For
instance, Dinter et al. (2011)’s model build on the Delone and McLean IS success
model to suggest a model for BI success, yet they have a distinct implementation
32 Ö. Isik
success focus. On the other hand, Isik et al. (2013) have approached BI success
from a capabilities perspective and suggested that, depending on the different nature
of the decision to be made, certain BI capabilities may be more important than oth-
ers. Another capabilities approach was used by Sidorova and Torres (2014), where
the authors have distinguished between internal and external data, and suggested
that capability building is key to success with BA. Yet none of these models incor-
porate uncertainty as a factor, they also do not assess whether these capabilities are
being utilized for the right reasons. That is why a fit perspective is necessary. Using
organizational information processing theory can help us make the case for certain
capabilities being more critical for certain situations.
The power of big data depends heavily upon the context in which it’s used
(McAfee and Brynjolfsson 2012), and one key to success may be to have the right
capabilities in place for that specific context (Davenport et al. 2012). The capabili-
ties of the organization should be sufficient to meet the requirements of the business
case put forward for big data. One lens that can be used to examine this match
between capabilities and requirements is the Organizational Information Processing
Theory (OIPT). OIPT emerged as a result of an increasing understanding that infor-
mation is possibly the most important element of today’s organizations (Fairbank
et al. 2006; Galbraith 1977). OIPT focuses on information processing requirements
(IPR), information processing capability (IPC), and the fit between them to obtain
the best possible performance in an organization (Premkumar et al. 2005). In this
context, information processing is defined as the gathering, analysis and synthesis
of data for decision making (Tushman and Nadler 1978), and IPR are the means to
reduce uncertainty (Daft and Lengel 1986). Uncertainty is the difference between
information acquired and information needed to complete a task (Galbraith 1977;
Premkumar et al. 2005; Tushman and Nadler 1978). Organizations that face uncer-
tainty must acquire more information to learn more about their environment (Daft
and Lengel 1986). When tasks are non-routine or highly complex (as mostly in the
case of big data) uncertainty is high; hence IPR are greater for effective perfor-
mance (Daft and Lengel 1986). Not surprisingly, when it comes to discovery and
experimentation with big data, uncertainty increases significantly. Hence, it is criti-
cal that organizations build the right capabilities to minimize big data related
uncertainty.
Organizations can only benefit from big data if they can manage to process, ana-
lyze and to turn it into useful knowledge, therefore it makes sense to study big data
from an OIPT perspective. Although there is research using OIPT to explain various
IS phenomena (e.g. Fairbank et al. 2006; Premkumar et al. 2005), there is very little
focusing on business intelligence and analytics (e.g. Cao et al. 2015), and none on
big data. This research suggests that the performance of big data initiatives signifi-
cantly depend on the fit between IPC and IPR of the organization, specifically
within the context of their big data initiatives (see Fig. 4.1 for the research model).
We posit that the value realized in big data initiatives depend on how well uncer-
tainty is minimized by the IPC of the organization. In line with the OIPT literature,
we suggest environmental uncertainty is an important factor contributing to pro-
cessing requirements for big data as organizations in high uncertainty environments
4 Big Data Capabilities: An Organizational Information Processing Perspective 33
FIT
Environmental Organizational
Uncertainty Big Data Big Data Capabilities
processing processing
requirements capabilities Technological
Contextual
Uncertainty Capabilities
Big Data Value
Realization
require more information processing, and, in turn, need more data to reduce uncer-
tainty (Daft and Lengel 1986; Karimi et al. 2004; Premkumar et al. 2005). We also
suggest contextual uncertainty as a source of uncertainty pertaining to big data; it
can be defined as the potential biases, ambiguities, and inaccuracies in the data
which need to be identified and accounted for to improve the accuracy of generated
insights (Lukoianova and Rubin 2013; Schroeck et al. 2012). IBM suggests that in
2015, 80% of all available data is uncertain and will continue increasing (Claverie-
Berge 2012). While data quality is a significant portion of this issue, enterprise data
that can be subjected to data quality improvement (such as enterprise data quality
solutions) forms only a fraction of the total data enterprises analyse. Most organiza-
tions include external data sources in their big data initiatives, such as social media
accounts. These external data sources significantly increase data uncertainty, both in
terms of content and expression. The ambiguity and lack of verifiability of these
data increases contextual uncertainty. Environmental uncertainty and contextual
uncertainty together influence the amount of processing an organization needs to do
for value generation with big data.
Organizations may target to achieve different benefits from their big data initia-
tives. An organization may target using big data to change its products or the way it
competes, this makes big data a strategic initiative. How Netflix competes based on
its Cinematch algorithm is a good example for this. On the other hand, the purpose
of the big data projects could be to cut costs and improve operational efficiency of
the organization; such as how Tesco cut its annual refrigeration cooling costs by
%20 across UK and Ireland by analyzing gigabytes of refrigeration data (Goodwin
2013)—this makes the big data project a transactional initiative. Finally, it can be all
about bringing hidden information to light and to actually realize the knowledge
residing in the organization. This would indicate an informational initiative. Such as
LinkedIn, which developed several products, including People You May Know and
Who’s Viewed My Profile, based on the data they have already been collecting
(Davenport and Dyché 2013). Acknowledging these different types of benefits an
organization may realize through their big data projects, we prefer a wider defini-
tion of benefit realization and adopt the net benefits concept (DeLone and McLean
2003) which represents the individual and/or organizational impacts a certain IS
investment has.
We suggest that the processing capabilities depend on the technological as well
as organizational capabilities of the organization (Barton and Court 2012).
Technological capabilities include the hardware and software that is being utilized
34 Ö. Isik
for big data analytics; these capabilities should be sufficient enough to handle the ‘5
V’s’ of big data mentioned earlier. Organizational capabilities refer to the availabil-
ity of the right skill set for big data analytics (Viaene and Van den Bunder 2011),
analytical decision-making culture (Popovič et al. 2012), and top management sup-
port and championship of the big data initiative (Barton and Court 2012), which
refers to the extent to which top management believes in the value of and actively
participates in the efforts related to big data initiatives (Liang et al. 2007). Prior
research has confirmed the importance of some these organizational elements in BI
and analytics environments (Isik et al. 2013), but their impact on big data initiative
are yet to be confirmed.
To obtain value from a big data initiative, big data processing requirements
(BDPR) should match the big data processing capabilities (BDPC) of an organiza-
tion. We posit that if organizations do not purposefully match their IPR with the
IPC, configurations of misfit will occur and performance standards will be lower.
For instance, if an organization is interested in strategic benefits from Big Data, it is
more likely that they will include more data from a variety of sources, and deal with
high uncertainty not in the environment but also within the data itself. To be able to
manage these data, the technological as well as organizational capabilities should be
on par. If not, less than optimal capabilities will render big data analytics ineffec-
tive. On the other hand, if the organization is interested in a rather isolated applica-
tion of big data, using only internal sources, they would need to deal with less
uncertainty compared to the previous example. Yet, having high level of IPC in this
situation would be resource overkill (Mani et al. 2010). This is the case not only
because unused resources will lead to inefficiency but also their management would
be unnecessarily costly. So, while low uncertainty configurations may lead to ben-
efits with low level of capabilities, high level of uncertainty would require high
levels of capabilities. The other two configurations (low capability, high uncertainty
and high capability, low uncertainty) represent misfit and will not realize benefits as
much as the fit configurations.
4.3 Methodology
Our approach to empirically evaluating the research model consists of four phases:
(1) research model fine-tuning, (2) instrument design, (3) pilot testing, and (4) final
data collection and analysis.
In the first phase of this research, an extensive literature review is carried out to fine-
tune the conceptual model, using academic literature as well as industry outputs.
Specific attention is paid to success and failure studies, in order to pinpoint
4 Big Data Capabilities: An Organizational Information Processing Perspective 35
The required data for this study will be collected through an online survey. The
instrument will measure variables using multiple indicator items derived from vali-
dated instruments used in prior research where possible. For variables that have no
validated measures, items will be developed based on the comprehensive literature
review and the interview round mentioned in the first phase above. These instru-
ments will be tested for clarity of content, scope as well as purpose (representing
content validity) by academicians and practitioners active within the field of big
data.
Operationalization of the concept of fit has long been a topic of discussion in aca-
demia. As Galbraith and Nathanson (1979) observed, “although the concept of fit is
a useful one, it lacks the precise definition needed to test and recognize whether an
organization has it or not” (p. 266). As a lack of correspondence between the fit
concept and how it is statistically formulated may lead to inconsistent research
results (Venkatmaran 1989), it is critical to pay adequate attention to its formula-
tion. Venkatmaran’s (1989) seminal work on the fit concept in strategy literature
provides a useful classification scheme that groups fit studies under six perspec-
tives; fit as moderation, fit a mediation, fit as matching, fit as covariation, fit as
gestalts and fit as profile deviation; each with distinct theoretical implications and
specific analytical methods.
This research conceptualizes fit as matching, and evaluates the impact of fit on
big data value realization through the impact of the interaction effect between
BDPR and BDPC. This is a proper approach because fit is specified with a reference
to a criterion variable, specifically ‘big data value realization’ in this research
36 Ö. Isik
(Umanath 2003). Even though there are various objections to multiplicative models
that measure interaction, this form of operationalization has been shown to be rea-
sonably robust (Kim and Umanath 1992; Umanath 2003; Venkatmaran 1989). The
fit between BDPR and BDPC will be examined by evaluating four clusters formed
by the interaction of these two variables. The number of clusters may be adapted in
case additional constructs are added to the model based on the outcome of the
research model fine-tuning phase. The interaction of BDPR and BDPC will be cap-
tured by the interaction effects of the main factors in ANOVA (Premkumar et al.
2005). The interaction effects are expected to have a greater impact on the value
realization compared to the main effects (Premkumar et al. 2005; Venkatmaran
1989), because model posits that it is the notion of fit, and not the BDPR or BDPC
individual effects, which results in better value realization.
4.3.2.2 Measures
Big Data Information Processing Capabilities. This study suggests technological and
organizational capabilities as the basis for BDPC. Technological capabilities include
the hardware and software necessary to deal with high volume, high velocity and high
variety of data. Multi-item scales will be looked for in the literature and if necessary
developed for this construct. For organizational capabilities, validated measures
already exist. This research considers the availability of the right skill set for analytics,
analytical decision-making culture, and top management support for the big data ini-
tiative as critical capabilities. Several studies, targeting to identify the right skill set for
big data analytics, offer measures for this construct, e.g. Wixom et al. (2014),
Liberatore and Luo (2013). Through a literature review, a list of such studies will be
finalized and a comprehensive list of items will be generated. The analytical decision-
making culture will be measure by three items adopted from Popovič et al. (2012).
The level of top management support and championship of the big data initiative will
be measured by seven items validated by Bajwa et al. (1998).
Big Data Value Realization. How much value an organization gains from its big data
depends on the benefits an organization obtains from this initiative. In this research, the
net benefits construct will be used to operationalize the value realized with big data. The
net benefits concept, first introduced by DeLone and McLean (2003) in their widely
studied IS success model, represents the individual and/or organizational impacts a cer-
4 Big Data Capabilities: An Organizational Information Processing Perspective 37
tain IS has, representing its success level. Given that this research has an organizational
level of analysis, the instrument developed by Mirani and Lederer (1998) will be adapted
to measure the organizational benefits derived from IS projects.
After designing the instrument, a pilot test will be carried out for further refinement,
validity and reliability checks. Data collection for this phase will be completed via
our personal contacts in the industry, using a snowball sampling. Based on the find-
ings, the instrument will be updated.
After finalization of the survey, the required data for this study will be collected by
reaching out to small to large organizations that have big data initiatives, varying in
scope. Our institution’s research partnerships with several European financial ser-
vices institutions will be leveraged for data collection, this way we will also be
controlling for the context/industry.
the frequent emphasis of this as a factor impacting big data analytics confirms the
importance of including environmental uncertainty in our model.
Currently, next to analyzing the above mentioned interviews, we are in the phase
of scheduling interviews within the healthcare industry, with senior decision makers
who are involved in big data projects. We target ten interviews. The intention with
this research is to control for and to represent a wide variety of industries to increase
generalizability of our findings.
4.5 Conclusion
The initial projects that have leveraged big data have provided some organizations
with big returns and enabled them to disrupt their markets. For instance, shipping
companies improving their fleets’ on-time performance by using specialized
weather forecast data and real-time information on port availability (Barton and
Court 2012) and airlines improving their estimated time of arrivals by combining
publicly available weather data and flight schedules and other proprietary data
(McAfee and Brynjolfsson 2012) can serve as evidence that when done right, big
data can transform the way organizations do business. Hence, this study has high
practical relevance; many C-level executives are placing the topic on top of their
agendas. Instead of sharing only high level and common best practices with execu-
tives, findings from this research will enable us to talk about specific capabilities
they need to develop and environmental challenges they need to be careful about.
They key contribution of this research is that is suggests the alignment of capa-
bilities and needs is key for benefit realization. This is especially important to justify
given that today many organizations are trying to do too much without a clear strat-
egy, just to not to miss the big data bandwagon. We believe this study will also
contribute to the IS literature by (1) providing a comprehensive overview of factors,
internal and external, influencing the success of big data initiatives, (2) examining
both technological and organizational capabilities necessary to obtain value with
big data, and by (3) being the first study looking into the big data phenomena with
an OIPT lens.
Biography
References
Bajwa DS, Rai A, Brennan I (1998) Key antecedents of executive information system success: a
path analytic approach. Decis Support Syst 22(1):31–43
Barton D, Court D (2012) Making advanced analytics work for you. Harv Bus Rev 90(10):78–83
Cao G, Duan Y, Li G (2015) Linking business analytics to decision making effectiveness: a path
model analysis. Trans Eng Manage 62(3):384–395
Claverie-Berge I (2012) Solutions Big Data IBM. IBM Presentation, 13 Mar 2012. http://www-05.
ibm.com/fr/events/netezzaDM_2012/Solutions_Big_Data.pdf
Daft RL, Lengel RH (1986) Organizational information requirements, media richness and struc-
tural design. Manag Sci 32(5):554–571
Davenport TH, Barth P, Bean R (2012) How ‘Big Data’ is different. MIT Sloan Manag Rev
54(1):21–24
Davenport TH, Dyché J (2013) Big Data in big companies. SAS International Institute for Analytics
Report. http://www.sas.com/resources/asset/Big-Data-in-Big-Companies.pdf
Davenport TH, Patil DJ (2012) Data scientist: the sexiest job of the 21st century. Harv Bus Rev
90(10):70–76
DeLone WH, McLean ER (2003) The DeLone and McLean model of information systems suc-
cess: a ten-year update. J Manag Inf Syst 19(4):9–30
Dinter B, Schieder C, Gluchowski P (2011) Towards a Life Cycle Oriented Business Intelligence
Success Model. AMCIS 2011 Proceedings
Fairbank JF, Labianca G, Steensma HK, Metters RD (2006) Information processing design
choices, strategy and risk management performance. J Manag Inf Syst 23(1):293–319
Forsyth J, Moorman C, Spittaels S (2014) Recruit better data analysts. HBR Blog Network, 14 Feb
2014. http://blogs.hbr.org/2014/02/recruit-better-data-analysts/
Galbraith J (1977) Organizational design. Addison-Wesley, Reading, MA
Galbraith JR, Nathanson D (1979) The role of organizational structure and process in strategy
implementation. In: Schendel D, Hofer CW (eds) Strategic management: a new view of busi-
ness policy and planning. Little, Brown, Boston, pp 249–283
Gobble MM (2013) Big Data: the next big thing in innovation. Res Technol Manag 56(1):64–67
Goodwin B (2013) Tesco uses Big Data to cut cooling costs by up to €20m. Computer Weekly, 22
May 2013. http://www.computerweekly.com/news/2240184482/Tesco-uses-big-data-to-cut-
cooling-costs-by-up-to-20m
Isik O, Jones M, Sidorova A (2013) Business intelligence success: the roles of BI capabilities and
decision environments. Inf Manag 50(1):13–23
Karimi J, Somers TM, Gupta YP (2004) Impact of environmental uncertainty and task characteris-
tics on user satisfaction with data. Inf Syst Res 15(2):175–193
Kim K, Umanath NS (1992) Structure and perceived effectiveness of software development units:
a task contingency analysis. J Manag Inf Syst 9(3):157–181
LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N (2011) Big Data, analytics and the
path from insights to value. MIT Sloan Manag Rev 52(2):21–31
Liang H, Saraf H, Hu Q, Xue Y (2007) Assimilation of enterprise systems: the effect of institu-
tional pressures and the mediating role of top management. MIS Q 31(1):59–87
Liberatore M, Luo W (2013) ASP, the art and science of practice: a comparison of technical and
soft skill requirements for analytics and OR professionals. Interfaces 43(2):194–197
Lukoianova T, Rubin VL (2013) Veracity roadmap: is Big Data objective, truthful and credible?
Adv Class Res Online 24(1):4–15
Madrigal AC (2014) How Netflix reverse engineered hollywood. The Atlantic, 2 Jan. http://www.the-
atlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
McAfee A, Brynjolfsson E (2012) Big Data: the management revolution. Harv Bus Rev
90(10):60–68
Menon S (2013) Stop assuming your data will bring you riches. HBR Blog Network, 20 Sept.
http://blogs.hbr.org/2013/09/stop-assuming-your-data-will-bring-you-riches/
40 Ö. Isik
Mirani R, Lederer AL (1998) An instrument for assessing the organizational benefits of IS projects.
Decis Sci 29(4):803–838
Perrey J, Arikr M (2014) CMOs and CIOs need to get along to make big data work. HBR Blog Network,
4 Feb. http://blogs.hbr.org/2014/02/cmos-and-cios-need-to-get-along-to-make-big-data-work/
Popovič A, Hackney R, Coelho PS, Jaklič J (2012) Towards business intelligence systems suc-
cess: effects of maturity and culture on analytical decision making. Decis Support Syst
54(1):729–739
Premkumar G, Ramamurthy K, Saunders CS (2005) Information processing view of organiza-
tions: an exploratory examination of fit in the context of interorganizational relationships.
J Manag Inf Syst 22(1):257–294
Rogers S (2011) Big Data is scaling BI and analytics. information management, 01 Sept. http://
www.information-management.com/issues/21_5/big-data-is-scaling-bi-and-analytics-
10021093-1.html
Schroeck M, Shockley R, Smart J Romero-Morales D, Tufano P (2012) Analytics: the real-world
use of Big Data. How innovative enterprises extract value from uncertain data. IBM Global
Business Services Business Analytics and Optimization Executive Report. http://www.sttho-
mas.edu/gradsoftware/files/BigData_RealWorldUse.pdf
Sicular S (2013) Big Data is falling into the trough of disillusionment. Gartner Blog Network, 22 Jan. http://
blogs.gartner.com/svetlana-sicular/big-data-is-falling-into-the-trough-of-disillusionment/
Sidorova A, Torres RR (2014) Business Intelligence and Analytics: A Capabilities Dynamization
View. AMCIS 2014 Proceedings
Simon P (2014) How to get over your inaction on Big Data. HBR Blog Network, February 24.
http://blogs.hbr.org/2014/02/how-to-get-over-your-inaction-on-big-data-2/
Tushman ML, Nadler DA (1978) Information processing as an integrating concept in organiza-
tional design. Acad Manag Rev 3(3):613–624
Umanath NS (2003) The concept of contingency beyond ‘It Depends’: illustrations from IS
research stream. Inf Manag 40(6):551–562
Venkatmaran N (1989) The concept of fit in strategy research: toward verbal and statistical cor-
respondence. Acad Manag Rev 14(3):423–444
Viaene S, Van den Bunder A (2011) The secrets to managing business analytics projects. MIT
Sloan Manag Rev 53(1):65–69
Wamba SF, Akter S, Edwards AJ, Chopin G, Gnanzou D (2015) How ‘Big Data’ can make Big
Impact: findings from a systematic review and a longitudinal case study. Int J Prod Econ
165:234–246
Wixom B, Ariyachandra T, Douglas D, Goul M, Gupta B, Iyer L, Kulkarni U, Mooney JG, Phillips-
Wren G, Turetken O (2014) The current state of business intelligence in academia: the arrival
of Big Data. Commun Assoc Inf Syst 34(1):1–13
World Economic Forum Briefing (2012). Big Data, Big Impact: new possibilities for international
development, January. http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_
Briefing_2012.pdf
Chapter 5
Business Analytics Capabilities and Use:
A Value Chain Perspective
Abstract This paper presents a mapping of the business analytics (BA) capabilities
of a firm from a value chain lens similar to Porter’s (Harv Bus Rev 79:62–78, 2001)
internet capabilities framework. The generally accepted classification of analytics:
descriptive, predictive and prescriptive, is used as basis for mapping BA capabilities.
Using an extensive search of the academic and practitioner literature, analytics
applications were analyzed and mapped onto the value chain framework. Given the
increased interest and investment in BA, it is important to have a good understanding
of what analytics capabilities firms use to enhance value through its value chain
activities. We illustrate exemplar uses of BA applications, tools and technologies
used by firms. Preliminary results suggest that organizations are focusing on
T. Ghoshal (*)
Naveen Jindal School of Management, University of Texas at Dallas,
800 West Campbell Road, Dallas 75080, TX, USA
Department of Operations and Information Management, Isenberg School of Management,
University of Massachusetts at Amherst, Amherst, MA, USA
e-mail: [email protected]
R.T. Bedeley
Department of Operations and Information Management, Isenberg School of Management,
University of Massachusetts at Amherst, Amherst, MA, USA
UMass, Amherst, 315 Isenberg Building, 121 Presidents Drive, Amherst, MA 01003, USA
e-mail: [email protected]
L.S. Iyer
Walker College of Business, Appalachian State University,
287 Rivers St, Boone, NC 28608, USA
e-mail: [email protected]
J. Bhadury
School of Business Administration and Economics, The College at Brockport,
350 New Campus Drive, Brockport, NY 14420, USA
e-mail: [email protected]
5.1 Introduction
Several current studies in the literature have proposed models, typologies and
domains of BA in organizations (Chen et al. 2012; Holsapple et al. 2014; Wixom et al.
2013). Other studies focus on the supply chain analytics capabilities (Chae et al. 2014)
of organization from a resource-based view (Barney 1991) and dynamic capabilities
(Eisenhardt and Martin 2000) perspectives (Chae and Olson 2013). To understand the
value created by BA in organizations, one important strand of research is to investigate
how the value chain activities and processes of firms can be improved by the inclusion
of BA. However, based on our extant review, no study yet has focused on the frame-
work of analytics capabilities for the entire value chain and on empirically testing the
application of different types of analytics capabilities in different activities of the value
chain. In an effort to address this gap, this research accomplishes the first step by pro-
viding a framework to map business analytics capabilities and applications to Porter’s
value chain perspective (1985) of a firm.
When viewed from the standpoint of applications, analytics can be classified into
three major categories (Davenport 2013): descriptive analytics, predictive analytics
and prescriptive analytics. Descriptive analytics is used to answer ‘what has hap-
pened?’, predictive analytics to answer ‘what could happen?’, and prescriptive ana-
lytics to answer ‘what should happen?’ (Heching et al. 2013). Descriptive analytics
can provide information of value chain components, predictive analytics can help in
managerial planning, designing and value chain management, prescriptive analytics
can be used to provide decision support tools and optimization based on the out-
comes of descriptive and predictive analytics. Business organizations may use dif-
ferent combinations of the three types of analytics for deployment in their value
chain. Hence, this study’s research question: What are the prominent analytics
capabilities that can be applied in different value chain activities of a firm?
In pursuit of the same, the objective of this study is to develop a framework to
classify different types of analytics capabilities from a value chain perspective.
To accomplish this, a review of both academic and practitioner literature to uncover
documented cases of BA applications in different industry/firms has been con-
ducted. Thereafter, each application has been analyzed separately from two distinct
standpoints: its categorical BA classification and its primary placement within the
context of Porter’s (1985) value chain framework. This was then used to develop the
mapping given in the results section of the paper. Based on the findings, our future
research will focus on developing a process model for how effective application of
analytics can add value to organizations. In the following sections we provide the
theoretical background and motivation; research methodology; preliminary analysis
and results; and conclusion.
At first a brief description of Porter’s (1985) value chain activities and BA classifica-
tion are provided, and then followed by literature conducted to review the application
of analytics in different value chain activities that are present in extant literature.
44 T. Ghoshal et al.
Porter (1985) introduced the concept of value chain in his book with an idea that
every organization has two distinct sets of activities to create value for the organiza-
tion. One activity set is the primary activities organizations employ in creating
physical product or service, marketing and delivery of the product or service, and
support and after-sale service for that product or service. Another set is the support-
ing activities of the organization. The supporting activities are composed of internal
activities of the organization which provides inputs and infrastructure to support the
primary activities of the organization. Porter (1985) describes five primary activities
as generic supply chain activities of organizations’ value chain: Inbound logistics,
operations, outbound logistics, marketing & sales, and after-sales service. The four
supporting activities of organizations’ value chain are: procurement, technology
development, human resource management and firm infrastructure.
Analytics capabilities can generally be classified into three different categories. These
are descriptive analytics, predictive analytics and prescriptive analytics (Davenport
2013). Following is a brief discussion on these different analytics capabilities:
• Descriptive Analytics: Descriptive analytics is the type of statistics that provide
descriptive analysis of what is evident from the data. Based on this type of anal-
ysis it is possible to find out current trends, statistics of available data. This type
of analytics tries to answer the question of what has happened.
• Predictive Analytics: Predictive analytics is the type of analytics where future of a
process, product or activity can be predicted based on the result of the descriptive
analytics. This type of analytics tries to answer the question of what could happen.
• Prescriptive Analytics: Prescriptive analytics is the most active type of analyt-
ics where the optimum output can be prescribed based on results of descriptive
and predictive analytics. This type of analytics tries to answer the question of
what should happen.
5.3 Methodology
The results of the literature search are summarized in Tables 5.1 and 5.2. As indicated
therein, most organizations are focusing on blends of descriptive, predictive and pre-
scriptive analytics sets in their value chain activities. In Tables 5.1 and 5.2, different
analytics capabilities are presented in primary and supporting activities of value chain
respectively. Descriptive analytics is the first level of analytics conducted in any firm
before predictive and prescriptive analytics (P&G 2012; Watson 2014). Given that all
predictive and prescriptive analytics have underlying descriptive analytics (Davenport
2013), to highlight the prominent applications of analytics, Tables 5.1 and 5.2 are
focused primarily on identifying prescriptive and predictive analytics.
Owing largely to their relative ease of implementation, long-standing and
widespread availability of managerial statistics software, adoption of descriptive
analytics by businesses is well-established over the past few decades. However, the
same cannot be said of predictive and prescriptive analytics in today’s firms.
Nonetheless, in general the most noteworthy applications of analytics in recent
times in terms of value creation has been in the use of predictive and prescriptive
analytics. Therefore, while Table 5.2 above lists applications of all the three differ-
ent types of analytics, the remaining focus is on illustrating examples in the use of
predictive and prescriptive analytics by organizations.
Table 5.1 Analytics capabilities in primary activities of a value chain
Activity Descriptive analytics Predictive analytics Prescriptive analytics
Inbound • Ad hoc query and search-based BI • Dynamic inventory balancing model to • Inventory optimization using tools like IBM-ILOG
Logistics (Gifford 2013). adjust inventories based on demand Inventory and Product Flow Analyst (Nelson 2012).
• Interactive visualization of goods prediction (Happonen 2012). • Telematics technology to optimize transportation. For
in transit (Chen et al. 2012). • Warehouse location planning based on example, UPS uses telematics in their logistics (Levis
capacity and demand analysis (Nelson 2011).
2012).
Operations • Multi-objective analysis for • Radio frequency identification (RFID) • Network optimization, and sourcing optimization using
warehouse location selection analytics to improve inventory tools e.g. IBM ILOG LogicNET Plus XE (Nelson
(Nelson 2012). management by predicting inventory count 2012).
• Pattern recognition for supply (Bertolucci 2014). • Tabu search for optimization of distribution network
chain performance improvement • Neural network for various scheduling and (Gündüz 2015).
(Chae and Olson 2013). planning tasks (Smith and Gupta 2000). • Genetic algorithm for assembly-line balancing
• Interactive visualization of • Clustering techniques such as k-means problem, production scheduling (Che and Chiang
processes and transformation of algorithm to find out causes of faults and 2012).
goods (Chen et al. 2012). process variations in production systems • Cloud analytics to improve faster processing by
(Chien et al. 2007). providing capabilities to analyze large volume of data
(BAH 2012).
Outbound • Cluster analysis for network • Network flow model for vehicle routing • Transportation optimization using analytics e.g.
Logistics design (Gifford 2013). (Gifford 2013). ORION software of UPS (Rosenbush and Stevens
• Dispatch tool interface (Gifford • Fleet progress prediction using interactive 2014).
2013). dashboards (Gifford 2013). • Self-organizing map clustering to determine the best
• Delivery Information Acquisition • Capacity analysis for prediction of ways of delivering products in terms of profitability
Device (DIAD) of UPS for segmentation and delivery of orders and optimization of supply chain (Chae and Olson
efficient parcel delivery (Levis (Gifford 2013). 2013).
2011).
Marketing & • Data mining of sales invoices to • Machine Learning techniques such as • Price optimization using revenue analytics (Koushik
Sales gain pattern information such as, Partial Recurrent Neural Network for sales et al. 2012).
Association Rule Mining for forecasting (Müller-Navarra et al. 2015). • Use of analytics in creating profit optimizing search
pattern identification of products • Customer churn analytics and upsell prediction engine advertising tool such as, PROSAD (Skiera and
bought together (Bhattacharjee model using logistic regression, random Nabout 2012).
2012). forests, and decision trees (Gillespie 2012). • Integrated web data analytics to customize product
• Customer Lifetime Value (CLV) • Predictive modeling techniques to identify offerings to customers (Franks 2011).
analytics using stochastic models and retain the most profitable customers • Use of analytics in agent-based modeling for
(Schweidel 2013). (nGenera 2008). marketing optimization (Ragusa 2013).
• Consumer heterogeneity analysis • Profitability prediction using activity-based • Landing page optimization using analytics tools such
using neural networks (Hayashi analytics (Sadovy 2010). as Google website optimizer (King 2008).
et al. 2010).
After-sales • Speech analytics in call centers to • Text mining and Sentiment analysis to • Real time analytics for faster customer service (HBR
Service identify customer concerns (ITS understand consumers’ attitude and 2014).
2015). feedback (Gan and Yu 2015; Mayes 2015). • Customer relationship management using logistic
• Location intelligence can help to • Customer propensity modeling using regression, decision trees, and neural network
provide real-time customer decision tree, logistic regression, neural (Leventhal and Langdell 2013).
service by incorporating network, and support vector machine • Consumer-centric website development using
geospatial data with business (Leventhal and Langdell 2013). analytics (Albert et al. 2004).
data. (Steiner 2015). • Social media and other web 2.0 analytics to
• Web-based, unstructured content understand customer concern and act accordingly
such as information retrieval and (Watson 2014).
extraction, opinion mining, web
intelligence & analytics, social
media analytics, social network
analysis, spatial-temporal
analysis (Chen et al. 2012).
Table 5.2 Analytics capabilities in supporting activities of value chain
Activity Descriptive analytics Predictive analytics Prescriptive analytics
Firm Infrastructure • Financial analytics to provide organization • Monte Carlo simulation technique can be • Energy informatics to go green. UPS used
better visibility into the factors that drive used to better understand all possible this technology to reduce CO2 emission
revenues, costs, and shareholder value scenarios for planning, making decisions and (Levis 2011).
(Oracle 2011). mitigating risk (Underwood 2014). • Optimization modeling for financial
• Fuzzy analytic network process to assess • Modeling predictive relationships between planning, budgeting (Underwood 2013).
ERP post-implementation success corporate strategy, short-run financial health,
(Moalagh and Ravasan 2013). and the performance of a company using
• Relational-DBMS and data warehousing. neural network (Smith and Gupta 2000).
• ETL & OLAP (Chen et al. 2012).
Human Resource • Human resource dashboard/scorecard • Hierarchical demand forecasting to help • Attrition modeling and compensation
Management (Watson 2009, p. 496). balancing workload across teams (Heching optimization (Mojsilovic 2013).
• Mobile and sensor-based content analytics et al. 2013). • Capacity scenario analysis to optimize skill
such as location-aware analysis, person- • Talent analytics to recommend unique capacity considering constraints on skill
centered analysis, context-relevant analysis, organizational practice improvement requirements and agent utilization and
and mobile visualization (Chen et al. 2012). (Davenport et al. 2010). service level agreements (Heching et al.
2013).
• Optimatch technology used by IBM to match
tasks with professionals (Mojsilovic 2013).
Technological • Heat map to identify potential technological • Real-time risk prediction by monitoring • Social media analytics to design products
Development problems across the organization (ITS potential fraudulent activities through features tailored to customers’ choices
2015). analysis of customer data (SAS 2012). (Chau and Xu 2012).
• Data mining workbenches (Chen et al. • Real-time predictive analytics to improve • Content and text analytics for faster
2012). business processes (Gartner 2013). processing of large datasets (Chen et al.
• Text analytics to gain novel insights in various 2012).
business processes (Mcneill 2015).
Procurement • Spend analysis to make intelligent • Procurement and spend analytics to develop • Adaptive modeling for real-time bidding
classification on spending, risk strategies to rationalize supply base and to based on logistic regression (Kumar 2013).
identification in supply base (Aberdeen compare organization’s practice with that of • Search engine advertisement optimization
Group 2014). industry’s (Aberdeen Group 2014) (Skiera and Nabout 2012)
• Interactive visualization analytics (Thomas • Pattern recognition to find hidden pattern and • Supplier/firm and firm/customer relational
and Cook 2006). potential problems in purchase orders (Chae analytics (Chen et al. 2012)
• Web intelligence and analytics such as and Olson 2013).
social media analytics (Chen et al. 2012)
5 Business Analytics Capabilities and Use: A Value Chain Perspective 49
Based on the preliminary findings, it is evident that analytics is mostly used in the
primary activities of the value chain. Building on Porter’s value chain, a framework
that illustrates how the different types of analytics are being applied in the various
stages of the value chain is now proposed (Figs. 5.1 and 5.2); the examples shown
therein are generalizations from the specific applications cited in Tables 5.1 and 5.2.
Based on the preliminary data analysis, we can convincingly argue that most orga-
nizations are using analytics in their primary activities of the value chain than in
their supporting activities. The reason perhaps is that the outputs generated from
Firm Infrastructure:
Predictive: Monte Carlo simulation for planning. Modelling relationships between
corporate strategy, short-run financial health,and the performance of a company using
neural network.
Human Resource Management:
Predictive: Hierarchical demand for ecasting to balance workload across teams. Capacity
scenario analysis. Talent analytics to recommend unique organizational practice
improvement.
Technology Development:
Predictive: Real-time risk assessment by monitoring potential fraudulent activities through
analysis of customer data. Customer disengagement analysis. Self-service analytics environment
for internal users.
Procurement:
Predictive: Procurement decision making based on outcome analysis. Procurement and
spend analytics. Pattern recognition to detect hidden patterns and potential problems in
purchase orders.
Inbound Operations: Outbound Marketing & Service:
Logistics: Predictive: RFID Logistics: Sales: Predictive:
Predictive: analytics to improve Predictive: Predictive: Social Sentiment
Dynamic inventory Network flow media analytics to analysis.
inventory management in model for get insight about Customer
balance model overall supply chain. routing. Fleet customer big data. propensity
based on Cloud analytics for progress Partial Recurrent modelling
demand faster processing. prediction. Neural Network for using decision
prediction. Neural network for Capacity analysis sales forecasting. tree, logistic
Warehouse quality control for segmentation Customer churn regression,
management and delivery of analytics and upsell neural
analytics. orders. prediction model network, and
Warehouse using logistics support vector
location regression, randow machin
planning based forests, and
on capacity decision trees.
and demand
analysis.
Fig. 5.1 Application of predictive analytics tools and techniques in a firm’s value chain
50 T. Ghoshal et al.
Firm Infrastructure:
Prescriptive: Energy informatics for green initiative. Optimization modeling for financial
planning and bud geting.
Human resource Management:
Prescriptive: Staffing optimization. Attrition modeling. Compensation optimization.
Capacity scenario analysis to optimize skill capacity considering constraints on skill
requirements and agent utilization and service level agreements. Optimatch technology
used by IBM to match tasks with professionals.
Technology Development:
Prescriptive: Social media analytics to design products features tailored to customers’ choices.
Staffing allocation for call centres using volume analysis.
Procurement:
Prescriptive: Adaptive Modeling for real-time bidding based on logistic regression.
Search engine advertisment optimization.
Fig. 5.2 Application of prescriptive analytics tools and techniques in a firm’s value chain
primary activities are easy to measure and quantify. Future research can examine the
reason behind the highlighted gaps in BA use. For practitioners this framework
helps not only to catalog the BI&A systems according to different value chain activ-
ities but also to allow managers to evaluate how different BI&A systems in different
value chain activities can impact and create overall value in their organizations.
In future, the framework will be evaluated empirically to identify actual usage of
analytics in organization’s value chain. This should begin with studies that focus on
delineating the specific organizational cultural and analytic infrastructural and capa-
bilities variables that are germane to the different parts of the value chain (Marketing
& sales, After sales service, Firm infrastructure, etc.). For example, it is quite con-
ceivable that infrastructural/capability variables such as “Skills” will be different
for “Marketing and sales” than for another aspect of the value chain such as
“Technological development”; and exploration of the same for each part of the
value chain will substantially enhance the research on the key success factors for
BI&A in terms of contribution to the value chain of an organization.
5 Business Analytics Capabilities and Use: A Value Chain Perspective 51
Biographies
References
Aberdeen Group (2014) State of marketing automation 2014: processes that produce, A research
report, Sept 19, 2014
Albert BTC, Goes PB, Gupta A (2004) Gist: a model for design and management of content and
interactivity of customer-centric web sites. MIS Q 28(2):161–182
Barney J (1991) Firm resources and sustained competitive advantage. J Manag 17(1):99–120
Bertolucci J (2014) Radio tags generate vast quantities of information, but enterprises need to
find ways to ingest, analyze, and archive that data. http://www.informationweek.com/bigdata/.
Accessed 30 Apr 2015
BAH (2012) Improving intelligence analysis through cloud analytics. http://www.boozallen.
com/media/file/Improving-Intelligence-Analysis-Through-Cloud-Analytics-fs.pdf. Retrieved
May 1, 2015
Bhattacharjee S (2012) Quantifying growth and product assortment decisions across multiple
retail stores: combining data analytics and optimization to connect global patterns with local
constraints. In: Klampfl E (ed), Proceedings of INFORMS conference on business analytics &
operations research. Huntington Beach, CA, pp 1–27
Chae B, Olson DL (2013) Business analytics for supply chain: a dynamic-capabilities framework.
Int J Inf Technol Decis Mak 12(1):9–26
Chae B, Olson D, Sheu C (2014) The impact of supply chain analytics on operational performance:
a resource-based view. Int J Prod Res 52(16):4695–4710
Chau M, Xu J (2012) Business intelligence in blogs: understanding consumer interactions and
communities. MIS Q 36(4):1189–1216
Che ZH, Chiang T-A (2012) Designing a collaborative supply-chain plan using the analytic hierar-
chy process and genetic algorithm with cycle-time estimation. Int J Prod Res 50(16):4426–4443
Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from Big Data to Big
Impact. MIS Q 36(4):1165–1188
Chien C-F, Wang W-C, Cheng J-C (2007) Data mining for yield enhancement in semiconductor
manufacturing and an empirical study. Expert Syst Appl 33(1):192–198
Davenport TH (2013) Analytics 3.0. Harv Bus Rev 91(12):1–14
Davenport TH, Harris J, Shapiro J (2010) Competing on talent analytics. Harv Bus Rev 88(10):1–13
Eisenhardt KM, Martin JA (2000) Dynamic capabilities: what are they. Strateg Manag J 21(1):1105–1121
Franks B (2011) Optimizing customer analytics: how customer level web data can help. In:
Leonhardi DR (ed) Proceedings of INFORMS conference on business analytics & operations
research. Chicago, IL, pp 1–35
Gan Q, Yu Y (2015) Restaurant rating: industrial standard and word-of-mouth a text mining and
multi-dimensional sentiment analysis. In: Sprague RH Jr, Bui T, Laney S (eds) Proceedings of
the 48th annual Hawaii international conference on system sciences. Kauai, HI, pp 1332–1340
Gartner Report (2012) Gartner says worldwide business intelligence, analytics and performance
management software market surpassed the $12 Billion mark in 2011. http://www.gartner.com/
newsroom/id/1971516. Accessed 1 May 2015
Gartner (2013) Gartner says by 2016, 70 percent of the most profitable companies will manage
their business processes using real-time predictive analytics or extreme collaboration. http://
www.gartner.com/newsroom/id/2349215. Retrieved October 25, 2015
Gartner Report (2014) Gartner says advanced analytics is a top business priority. http://www.gart-
ner.com/newsroom/id/2881218. Accessed 1 May 2015
Gifford T (2013) Integrated analytics in transportation and logistics. In: Williams JT (ed) Proceedings
of INFORMS conference on business analytics & operations research. San Antonio, TX, pp 1–39
Gillespie J (2012) Understanding customer behavior through marketing analytics: case studies
from online gaming and chain restaurants. In: Klampfl E (ed) Proceedings of INFORMS con-
ference on business analytics & operations research. Huntington Beach, CA, pp 1–36
Gündüz HI (2015) Optimization of a two-stage distribution network with route planning and time
restrictions. In: Sprague RH Jr, Bui T, Laney S (eds) Proceedings of the 48th annual Hawaii
international conference on system sciences. Kauai, HI, pp 1088–1097
5 Business Analytics Capabilities and Use: A Value Chain Perspective 53
Happonen A (2012) Adjusting inventories based on demand prediction using dynamic inven-
tory balancing model. In: Proceedings of technology management for emerging technologies
(PICMET), Proceedings of PICMET’12: IEEE, pp 3549–3565
Hayashi Y, Hsieh M-H, Setiono R (2010) Understanding consumer heterogeneity: a business intel-
ligence application of neural networks. Knowl-Based Syst 23(8):856–863
HBR (2014) The new path to customer engagement: real-time analytics. https://hbr.org/resources/
pdfs/comm/sap/18764_HBR_SAP_Telcom_July_14.pdf
Heching A, Lin P, Pratsini E (2013) Smarter workforce analytics for customer fulfillment trans-
action centers. In: INFORMS conference on business analytics & operations research. San
Antonio, TX, April 7–9, 2013
Holsapple C, Lee-Post A, Pakath R (2014) A unified foundation for business analytics. Decis
Support Syst 64:130–141
ITS (2015) Analytics application in information technology services department. A focus group
Interview at a large US Public University
King A (2008) Website optimization Nutshell handbook. O’Reilly Media. https://books.google.
com/books?id=f8-7pWbn9KEC
Koushik D, Higbie JA, Eister C (2012) Retail price optimization at intercontinental hotels group.
Interfaces 42(1):45–57
Kumar M (2013) Predictive analytics in social media and online display advertising. In: Williams
JT (ed) Proceedings of INFORMS conference on business analytics & operations research, San
Antonio, TX, pp 1–37
Leventhal B, Langdell S (2013) Adding value to business applications with embedded advanced
analytics. J Market Anal 1(2):64–70
Levis J (2011) Brown turning green delivering sustainability through data and technology. In
Leonhardi DR (ed) Proceedings of INFORMS conference on business analytics & operations
research. Chicago, IL, pp 1–58
Mayes M (2015) Sentiment analysis: more than a good idea monitoring and managing perceptions
in social media, pp 1–3. http://www.b-eye-network.com/view/11395
Mcneill F (2015) Text analytics: deriving meaning from the deluge of documents and purging con-
tent chaos how organizations turn text into gold three keys to success, pp 1–4. http://www.b-
eye-network.com/view/14437
Moalagh M, Ravasan AZ (2013) Developing a practical framework for assessing ERP post-
implementation success using fuzzy analytic network process. Int J Prod Res 51(4):1–22
Mojsilovic A (2013) Smarter workforce changing the landscape of workforce management. In:
Williams JT (ed) Proceedings of INFORMS conference on business analytics & operations
research. San Antonio, TX, pp 1–23
Müller-Navarra M, Lessmann S, Voß S (2015) Sales forecasting with partial recurrent neural networks:
empirical insights and benchmarking results. In: Sprague RH Jr, Bui T, Laney S (eds) Processing
of the 48th annual Hawaii international conference on system sciences, Kauai, HI, pp 1108–1116
Nelson D (2012) Multi-objective optimization for strategic network design. In: Klampfl E (ed)
Proceedings of INFORMS conference on business analytics & operations research. Huntington
Beach, CA, pp 1–18
nGenera (2008) Business analytics six questions to ask about information and competition. nGenera
Corporation report. http://www.sas.com/resources/whitepaper/wp_5483.pdf
Oracle (2011) Oracle knowledge for web self service. pp 1–4. http://www.oracle.com/us/products/
applications/knowledgemanagement/oracle-knowl-web-self-service-2168374.pdf. Retrieved
May 1, 2015
P&G (2012) Driving competitive advantage with OR. In: Klampfl E (ed) Proceedings of INFORMS
conference on business analytics & operations research. Huntington Beach, CA, pp 1–35
Phillips-Wren G, Iyer LS, Kulkarni U, Ariyachandra T (2015) Business analytics in the context of
big data: a roadmap for research. Comm AIS 37:23
Porter ME (1985) Competitive advantage. The Free Press, New York
Porter ME (2001) Strategy and the Internet. Harv Bus Rev 79(3):62–78
54 T. Ghoshal et al.
Ragusa D (2013) Bringing the consumer in the mix: using agent-based modeling to power market-
ing mix optimization. In: Williams JT (ed) Proceedings of INFORMS conference on business
analytics & operations research. San Antonio, TX, pp 1–38
Rosenbush S, Stevens L (2014) At UPS, the algorithm is the driver—turn right, turn left, turn right:
inside orion, the 10-year effort to squeeze every penny from delivery routes. Wall Street J: 14–17
Sadovy L (2010) Better decisions through profitability analysis, pp 1–4. http://www.b-eye-
network.com/view/12489
SAS (2012) Banks, big data and high-performance analytics. http://www.teradatauniversitynetwork.
com/assetmanagement/DownloadAsset.aspx?ID=475e4f52-caba-4ca7-acd1-51abe13c70d1&versi
on=f6b950445fed4b46ab3aa5cd5eaf8c4e1.pdf. Retrieved May 1, 2015
Schweidel DA (2013) Stochastic models for customer analytics. In: Williams JT (ed) Proceedings of
INFORMS conference on business analytics & operations research. San Antonio, TX, pp 1–37
Skiera B, Nabout NA (2012) PROSAD: a bidding decision support system for profit optimizing
search engine advertising. In: Klampfl E (ed) Proceedings of INFORMS conference on busi-
ness analytics & operations research. Huntington Beach, CA, pp 1–32
Smith KA, Gupta JND (2000) Neural networks in business: techniques and applications for the
operations researcher. Comput Oper Res 27(11–12):1023–1044
Steiner J (2015) Business intelligence and GIS, systems within systems, and ubiquity how the
world becomes part of every application in the 21st century, pp 1–4. http://www.b-eye-network.
com/view/7956
Thomas JJ, Cook K (2006) A visual analytics agenda. IEEE Comput Graph Appl 26(1):10–13
Underwood J (2013) Beginning prescriptive analytics with optimization modeling. http://www.b-
eye-network.com/view/17152. Accessed 1 May 2015
Underwood J (2014) Prescriptive analytics: making better decisions with simulation, pp 4–9.
http://www.b-eye-network.com/view/17224. Accessed 1 May 2015
Watson H (2009) Tutorial: business intelligence—past, present, and future. Commun Assoc Inf
Syst: 487–510
Watson HJ (2014) Tutorial: Big Data analytics: concepts, technologies, and applications tuto-
rial: Big Data analytics: concepts, technologies, and applications. Commun Assoc Inf Syst
34:1246–1269
Wixom BH, Yen B, Relich M, Wixom BH, YenB RM (2013) Maximizing value from business
analytics. MIS Q Exec 12(2):111–123
Zafar H, Ko MS, Clark JG (2014) Security risk management in healthcare: a case study. Commun
Assoc Inf Syst 34(1):737–750
Zhao Y, Zhu Q (2012) Evaluation on crowdsourcing research: current status and future direction.
Inf Syst Front 1(18):417–434
Chapter 6
Critical Value Factors in Business Intelligence
Systems Implementations
Abstract Business Intelligence (BI) systems have been rated as a leading technology
for the last several years. However, organizations have struggled to ensure that high
quality information is provided to and from BI systems. This suggests that organiza-
tions have recognized the value of information and the potential opportunities avail-
able but are challenged by the lack of success in Business Intelligence Systems
Implementation (BISI). Therefore, our research addresses the preponderance of
failed BI system projects, promulgated by a lack of attention to Systems Quality
(SQ) and Information Quality (IQ) in BISI. The main purpose of this study is to
determine how an organization may gain benefits by uncovering the antecedents
and critical value factors (CVFs) of SQ and IQ necessary to derive greater BISI suc-
cess. We approached these issues through adopting ‘critical value factors’ (CVF) as
a conceptual ‘lens’. Following an initial pilot study, we undertook an empirical
6.1 Introduction
Research evidence shows that spending on business intelligence (BI) systems has
comprised one of the largest and fastest growing areas of information technology
(IT) expenditures (Luftman and Ben-Zvi 2010). In spite of these investments, only
24% of 513 companies surveyed in a study conducted by Howson (2008), consid-
ered their BI implementations to be very successful. Furthermore, Marshall and de
la Harpe (2009), noted 80% of the time spent in BI support involves investigating
and resolving information quality (IQ) issues which if inadequately addressed, will
severely affect organizations through decreased productivity, regulatory problems,
and reputational issues.
It is apparent that pre-implementation activities for BI projects, particularly
addressing system quality (SQ) and information quality (IQ) requirements are of
paramount importance to business intelligence systems implementations (BISI) suc-
cess (Howson 2008; Marshall and de la Harpe 2009; Negash and Gray 2008; Power
2008; Watson et al. 2002). Moreover, there has been a significant body of research
that seeks to determine the role of SQ and IQ in information systems (IS) success
(DeLone and McLean 2003; Petter and McLean 2009). However, very little atten-
tion has been given in the literature to addressing the role of SQ and IQ in the success
of BISI (Arnott and Prevan 2008; Nelson et al. 2005; Ryu et al. 2006). Also, little
attention has been given to the user’s perceived value of SQ and IQ characteristics
that have an impact on BISI success (Nelson et al. 2005; Popovic et al. 2012). Nelson
et al. (2005) acknowledged the importance of identifying the appropriate SQ and IQ
factors for BI success and indicated that some factors may not be stable across tech-
nologies or applications. Researchers in BI success have also suggested constructs
and associated measurement items that consider the decision support environment
and its maturity in BISI success (Dinter et al. 2011; Isik et al. 2013). However, few
empirical studies have sought to uncover SQ and IQ characteristics that are of value
to users of BI analytical systems, as measured by user satisfaction from BISI.
6 Critical Value Factors in Business Intelligence Systems Implementations 57
The relationships between the constructs of user perceived value (level of impor-
tance) and user satisfaction in the context of understanding the SQ and IQ necessary
for BISI success have also received little attention in the literature. Research has been
limited to studies that rely only on specific SQ and IQ factors for BI that are based on
prior research, not on the universal set of antecedents for SQ and IQ that had been
subjected to empirical analysis (Nelson et al. 2005). Thus, in the context of emerging
technologies such as BI, it is important to focus on objectives and decisions that are
of value, often requiring the exposure to underlying or hidden values that allow
researchers and practitioners to be proactive and hence create more alternatives
instead of being limited by available choices (Dhillon et al. 2002; Keeney 1999).
According to Sheng, Siau, and Nah (2005), it is important to elicit and organize val-
ues in “developing constructs in relatively new and under-studied areas” (p. 40).
Therefore, our research addresses the preponderance of failed BI system projects,
promulgated by a lack of attention to SQ and IQ in BISI (Arnott and Prevan 2008;
Jourdan et al. 2008). The main purpose of this study is to determine how an organiza-
tion may gain benefits in the context of BISI by uncovering the antecedents and criti-
cal value factors (CVFs) of SQ and IQ necessary to derive greater BISI success.
Furthermore, this study will empirically assess the cross-over relationships between
the perceived SQ and IQ of BISI and perceived system and information quality to
address any ambiguity in BI analytical user perceptions in distinguishing between the
system (SQ) and the output (IQ) of the BI system as identified by Nelson et al. (2005).
In cognitive value theory, value refers to the individual’s perceived level of impor-
tance (Rokeach 1969). According to Rokeach (1973), a value is “an enduring belief
that a specific mode of conduct or end-state of existence is personally or socially
preferable to an opposite or converse mode of conduct or end-state of existence”
(p. 5). The concept of value is often referenced in various fields of social research
but mainly in the context of economic value, thereby neglecting the applications of
user perceived cognitive value (Levy 2006). According to Levy (2008), “several
scholars have suggested that although it is important to investigate the nature of
attitudes and opinions, it is more fundamental to investigate the nature of value
since attitudes and opinions can often change based on experience, while value
remains relatively stable over time” (p. 161). Keeney (1992) stated that values are
what one desires to achieve. Bailey and Pearson (1983) measured the value (or level
of importance) of information system (IS) characteristics using a scale featuring the
semantic differential pair, important to unimportant (Levy 2003). These measures
provided a deeper understanding of satisfaction with the IS (Etezandi-Amoli and
Farhoomand 2011; Levy 2003; Sethi and King 1999). Levy (2009) defined user
perceived value as a “belief about the level of importance that users hold for IS
characteristics” (p. 94).
58 P.P. Dooley et al.
The measurement of IS success has been a top concern of researchers and practi-
tioners for some time. Several models have been proposed to define and identify
the causes of IS success. However, a universally agreed definition of IS success has
not emerged due to differences in the needs of stakeholders who assess IS success
in an organization (Urbach et al. 2009). For the purposes of this study, IS success
is defined as a multi-dimensional phenomenon comprised of the technical, seman-
tic, and effectiveness levels. Based on this definition, research models applicable to
the specific requirements of a corresponding problem domain may be devised. The
need for a general but comprehensive definition of IS success was recognized by
DeLone and McLean (1992) in their review of existing definitions of IS success
and their associated measures. This led to the multidimensional and interdependent
model that classified the six major categories of system quality, information qual-
ity, user satisfaction, use, individual impact, and organizational impact. Since the
publication of the DeLone and McLean (1992) IS success model, many researchers
6 Critical Value Factors in Business Intelligence Systems Implementations 59
System
Quality
Intention
to Use Use
User Satisfaction
Service
Quality
Clark et al. (2007) followed the guidance of the DeLone and McLean IS success
model (1992, 2003) to study the underlying threads of commonality with BISI
success. Their study suggested that BISI success was theoretically grounded in IS
success research. While much attention has been paid to IQ, SQ, and user satisfac-
tion in IS success literature, little research has focused on the constructs of IS suc-
cess in the domain of BISI. This may be related to a lack of understanding of BI
technologies caused, in part, by the multifaceted nature of BI which combines a
nonconventional application-based set of systems with infrastructure related proj-
ects (e.g. ERP and CRM) in an analytical user based decision support environment.
Dinter et al. (2011) suggested alternatives for establishing BI specific success mod-
els to assist organizations in understanding the maturity of their BI decision envi-
ronment by taking into consideration their BI capacity and capabilities. For instance,
an organization may use the report writing and query capability of the BI implemen-
tation more than the analytical functionality in their implementation while another
organization may use the analytical features of the BI system, such as predictive
analytics, as their primary reason for implementing BI systems. In essence, BI suc-
cess will be measured differently depending on the BI maturity level of the organi-
zation. Recognizing the differences in BI system maturity, Dinter et al. (2011)
adopted and extended the updated DeLone and McLean (2003) IS success model in
the BI domain thereby broadening its scope by adding additional constructs and
items that have a causal relationship to the existing constructs in the BI decision
environment.
Isik et al. (2013) examined the maturity of the required decision environment of
BI to assess what capabilities are necessary to achieve success. They suggested
technical, functional, and organizational elements of the decision environment that
could lead to BISI success. Moreover, Isik et al. (2013) concluded that while the
technical capabilities of the BI system represented a necessary foundation for BI
success organizational capabilities that support flexibility in decision making should
also be managed in relation to the decision environment in which the BI is employed.
Nelson et al. (2005) addressed a gap in the literature involving confusion in dif-
ferentiating between SQ and IQ factors in the context of user satisfaction when
using BI analytical tools in a data warehouse environment. Their model, which
extended the DeLone and McLean (1992, 2003) success model, studied factors of
SQ and IQ identified in the literature and their relationships with the constructs of
system satisfaction and information satisfaction. The results of the Nelson et al.
(2005) study suggested that “crossover or interaction effects may exist between the
two constructs” (p. 207). They found that while the crossover effects of SQ on infor-
mation satisfaction was significant within the context of BI analysis tools, the path
leading from IQ to information satisfaction in the same context was surprisingly not
significant. They concluded that future research was necessary to understand the
characteristics of BI that led to the user perception that IQ did not strongly influence
information satisfaction in the BI analytics domain. Nelson et al. (2005) expressed
concern regarding this finding and offered the explanation that, from the user’s
perspective, it may be difficult to differentiate the BI system from the output it
6 Critical Value Factors in Business Intelligence Systems Implementations 61
produces, leading to potential over-reliance on the system for IQ while ignoring the
responsibility for user interaction with the interface and the generation of output.
While there is no commonly agreed definition of BI and BISI, there is some
agreement in the literature on categorizing BI using the process, technological, and
product perspectives. As a specific IS problem domain, BISI success falls within the
context of key business analytics and processes that lead to decisions and actions
that result in improved business performance. According to Chee et al. (2009), the
organizational, functional, and process perspectives of BI focuses on the gathering
of data from internal and external sources, followed by the generation of relevant
information for decision making. BI success relies on multi-dimensional factors
which also include those related to technology. Chee et al. (2009), acknowledged
the similarities and differences in the interpretations of BI success and suggested
that the technological aspect of BI be considered as a BI system, where the process
perspective be regarded as the implementation of BI systems. Moreover, the product
perspective is considered to be associated with the requirement for actionable infor-
mation with established tools. For the purpose of this study we assessed BI success
with the understanding that the process lifecycle approach addresses success in the
implementation of BI systems following the premise that BI success has roots in
technical and process capabilities in a decision environment.
Nelson et al. (2005) derived a model, depicted in Fig. 6.2, that identified, inte-
grated, and assessed the dimensions of SQ and IQ as antecedents of the constructs
of perceived user systems satisfaction and perceived user information satisfaction in
their model titled “Determinants of information and system quality” (p. 208). Their
model assumed that user satisfaction may be a reasonably good surrogate for net
benefits if measures are confined to decision performance (Iivari 2005). Therefore,
in this study the underlying theory of the DeLone and McLean (2003) model was
explored with emphasis on the user satisfaction construct as the dependent variable
for success (Iivari 2005). Furthermore, the BISI was considered effective when
users perceived the characteristics of SQ and IQ to be of value or highly important
and were also highly satisfied with these same characteristics. This study also
uncovered the SQ and IQ characteristics that are of value in BISI as measured by
user satisfaction. Participants in the study implemented BI analytical systems which
represent a higher level of organizational BI system maturity in comparison to those
who primarily perform report and query generation. The model expanded the user
satisfaction construct and suggested that user perceived system satisfaction and user
perceived information satisfaction could be considered as dependent variables and
as a combined surrogate for user satisfaction. In essence, this study tested a pro-
posed BI SQ and IQ research model which was based on the DeLone and McLean
(1992, 2003) IS success model as extended by Nelson et al. (2005) and specifically
tested the influence of the CVFs of SQ and IQ in BISI with user satisfaction from
BISI in a decision support environment that leveraged BI analytics to improve and
optimize decisions.
Various frameworks have been developed for categorizing and measuring IQ,
SQ, and user satisfaction leading to IS success. The framework for IQ developed by
62 P.P. Dooley et al.
Fig. 6.2 Nelson et al. (2005) Determinants of information and system quality
Lee et al. (2002), for instance, provided four different categories used to assess IQ
in IS. These categories were based on an empirical study of characteristics of a
group of conventional IS. Moreover, Nelson et al. (2005) suggested a framework
for the measurement of SQ for BI system satisfaction based on five dimensions of
system quality.
Past confusion in differentiating SQ from IQ factors in BISI success suggested
that crossover or interaction effects may exist between the two constructs leading
Nelson et al. (2005) to explore the possibility that more complex quality/satisfaction
relationships may exist. Thus, Nelson et al. (2005) studied the determinants of SQ
and IQ which included the study of crossover relationships from quality (informa-
6 Critical Value Factors in Business Intelligence Systems Implementations 63
tion and systems) to satisfaction (systems and information) as well as the interaction
effect of information satisfaction and systems satisfaction. They suggested that
future research should explore the relationship of SQ, IQ and perceived user satis-
faction in the context of BI analytical systems to address the surprising results of
their empirical analysis that indicated that the influence of SQ on user perceived IQ
satisfaction was stronger than the influence of IQ on user perceived IQ satisfaction.
It was, therefore, necessary to understand what dominant SQ and IQ characteristics
are deemed important in BI to guide the design of BI systems and distinguish the
system from its output. To address the surprising results of Nelson et al. (2005), the
universal set of antecedents were empirically studied and once identified, data were
gathered to determine what BI analysts valued in BI analytical systems with the
expectation that the proposed CVFs of BISI could change after exploratory factor
analysis (EFT) when subjected to confirmatory factor analysis (CFA). Therefore,
this study used the BI SQ and IQ research model of Nelson et al. (2005) with the
proposed CVFs of BISI depicted in Fig. 6.3.
The first specific goal of our research, following Keeney’s (1992) methodology,
was to gather a list of user perceived SQ and IQ characteristics from literature and
augment it with input from an expert panel. The second research aim was to use the
SQ and IQ characteristics to uncover the CVFs of SQ and IQ associated with
BISI. The third specific goal of this research was to test the impact of the CVFs of
SQ on perceived SQ of BISI and the CVFs of IQ on perceived IQ of BISI. The
fourth research goal was to test the impact of perceived SQ of BISI on perceived
user system satisfaction from BISI and perceived SQ of BISI on perceived user
information satisfaction from BISI. The impact of perceived IQ of BISI on per-
ceived user information satisfaction and perceived IQ of BISI on perceived user
system satisfaction from BISI was also tested using the BI SQ and IQ research
model based on the DeLone and McLean (1992, 2003) model for IS success as
extended by Nelson et al. (2005).
The main research questions addressed in this study were:
RQ1: What SQ characteristics are valued in BISI by users? What IQ characteris-
tics are valued in BISI by users?
RQ2: What are the CVFs for SQ that users’ value in BISI? What are the CVFs for
IQ that users’ value in BISI?
Stemming from the research questions, this study then addressed the following
specific hypotheses:
H1a–d: The CVFs of SQ will have a positive significant impact on perceived SQ
of BISI.
H2a–d: The CVFs of IQ will have a positive significant impact on perceived IQ of
BISI.
H3: The perceived SQ of BISI will have a positive significant impact on perceived
user system satisfaction from BISI.
64 P.P. Dooley et al.
Proposed CVFs
of BISI
Reliability SQ
H1a
System Quality (SQ)
Response
Time SQ H1b Perceived Perceived User
System System
H1c Quality of H3 Satisfaction
Flexibility SQ BISI H7a
From BISI
H1d
Integration
SQ
SystemSat
X
H5 H6 InfoSat
Information Quality (IQ)
Contextual IQ H2a
H2b
Intrinsic IQ Perceived Perceived
Information H7b
User
Quality of Information
H2c BISI H4 Satisfaction
Accessible IQ
From BISI
H2d
Representational
IQ
Fig. 6.3 BI SQ and IQ research model based on DeLone and McLean (1992) IS Success Model
as extended by Nelson et al. (2005)
H4: The perceived IQ of BISI will have a positive significant impact on perceived
user information satisfaction from BISI.
H5: The perceived SQ of BISI will have a positive significant impact on perceived
user information satisfaction from BISI.
H6: The perceived IQ of BISI will have a positive significant impact on perceived
user system satisfaction from BISI.
H7a: The interactions of perceived user system satisfaction from BISI and the per-
ceived user information satisfaction from BISI will have a positive significant
impact on perceived user system satisfaction from BISI.
H7b: The interactions of perceived user system satisfaction from BISI and the per-
ceived user information satisfaction from BISI will have a positive significant
impact on perceived user information satisfaction from BISI.
6.3 Methodology
Our study used a mixed method approach following the work of Keeney (1999),
utilizing both qualitative and quantitative research methods. Using value theory and
IS success theory, the study validated empirically a model for IS success that inves-
tigated how an organization may gain user satisfaction in the context of BISI by
uncovering the CVFs of SQ and IQ necessary to derive BISI success. Hanson et al.
(2005) stated that quantitative and qualitative data could be complementary when
variances are uncovered that would not have been found by a single method.
6 Critical Value Factors in Business Intelligence Systems Implementations 65
Qualitative research could be used to discover and uncover evidence, while quanti-
tative methods are often used to verify the results, thereby improving the integrity
of the findings of the study (Shank 2006). Additionally, both qualitative and quanti-
tative methods each carry their own capabilities to uncover the underlying meaning
of phenomena in research (Straub 1989).
The qualitative process (Phase I) began with the creation and distribution of an
open-ended questionnaire designed to elicit SQ and IQ characteristics considered to
be important in BISI. Development of the instrument followed the process proposed
by Straub (1989). The open-ended questionnaire was developed to uncover new
characteristics of SQ and IQ for BISI. An expert panel was formed, consisting of a
small group of six individuals with experience in business analytics. The expert
panel members had an average of 20 years’ experience implementing business ana-
lytics systems in large organizations. Four experts were Business Analysts with
leading financial institutions in banking, pension finance, and brokerage services.
Two of these experts have also managed departments devoted to analytics. The
remaining two experts, in addition to implementing business analytics systems were
also responsible for BI system infrastructure and implementation services for orga-
nizations providing systems services. All experts have performed business analyst
functions and have been responsible for decision making using BI system output.
SQ and IQ characteristics drawn from the expert panel’s responses to the open-
ended questionnaire and the literature review of validated sources (Arazy and Kopak
2011; Goodhue 1995; Jarke and Vassiliou 1997; Lee et al. 2002; Nelson et al. 2005;
Wand and Wang 1996; Wang and Strong 1996) were analyzed using Keeney’s
(1999) approach. Similar SQ and IQ characteristics identified from literature as well
as responses from the expert panel were grouped into the four main proposed SQ
categories of reliability SQ, response time SQ, flexibility SQ, and integration SQ, as
well as the proposed four high level IQ categories of intrinsic IQ, contextual IQ,
representational IQ, and accessibility IQ. These SQ and IQ characteristics were
evaluated for inclusion in an updated list of SQ and IQ items. Items that did not
appear to relate to any category were investigated for inclusion in a new SQ or IQ
category. After considering the grouping of similar responses as well as the feed-
back from the expert panel using Keeney’s (1999) approach there were 33 SQ and
IQ characteristics identified, consisting of 16 SQ items and 17 IQ items identified
and grouped under the appropriate SQ and IQ category. This included nine SQ and
IQ items identified by the expert panel that did not correspond with any of the initial
sources of BI success identified in the literature. As a result, the following nine
measurement items were added to the survey instrument: functionality and features
of the BI system are dependable, frequency of data generation and refresh in the BI
system are flexible, the BI system accommodates remote access, the BI system is
scalable, the BI system has an intuitive user interface, the BI system provides
66 P.P. Dooley et al.
6.3.2 P
hase II: Instrument, Data Collection, and Exploratory
Factor Analysis (EFA)
The quantitative process (Phase II) began with the development of a two part quan-
titative survey instrument to collect data. This preliminary survey instrument was
based on the results of phase I. The quantitative assessment of the SQ and IQ char-
acteristics found in literature, augmented by additional SQ and IQ characteristics
uncovered in phase I of the study was performed using value theory under Keeney’s
(1999) methodology. After a further review by the expert panel, an instrument was
developed that had content validity, construct validity, and reliability. The feedback
from the expert panel was used to adjust the proposed instrument and included the
removal of unnecessary items and the modification of questions, language, and the
layout of the instrument (Straub 1989). The final survey instrument emerged from
this process which was distributed to a larger group of users of BI systems to assess
the perceived value attributed to the items using a 7-point Likert scale ranging from
not important to highly important. Our study used the revised quantitative survey
instrument to collect data in order to empirically determine the CVFs of SQ and IQ
for BISI success. Hair, Anderson, and Tatham (1994) suggested 15–20 observations
for each variable for the results of a study to be generalizable. This study targeted
250 participants as an appropriate sample size (Schumacker and Lomax 2010).
Approximately 1300 survey invitations were sent to analysts through a service of
SurveyMonkey to achieve the response rate necessary to reach the targeted sample
size of 250 participants. After completion of pre-analysis data screening, 257
responses were available for analysis for a 20.8% response rate with 176 or 68.5%
completed by females and 31.5% completed by males. Analysis of the ages of
respondents indicated that 217 or 84.4% were above the age of 30. Additionally, 55
or 21.4% of the respondents considered themselves novices in the use of BI sys-
tems, 115 or 44.7% considered themselves average users, 77 or 30% considered
themselves advanced users and only 10 or 3.9% considered themselves expert users.
Respondents with graduate degrees comprised 35% of the subject population.
Overall, 198 respondents or 77% had a university degree.
The study used EFA techniques to uncover the CVFs of SQ and IQ of
BISI. Factorial validity assessed whether the measurement items corresponded to
the theoretically anticipated CVFs of SQ and IQ in a successful BISI. Principal
component analysis (PCA) was used as the extraction method to provide variances
of underlying factors (Mertler and Vannatta 2001). The perceived SQ and IQ CVFs
of BISI were identified by conducting EFA via PCA using Varimax rotation. PCA
was used to extract as many factors as indicated by the data.
6 Critical Value Factors in Business Intelligence Systems Implementations 67
In phase III, hypotheses were tested to validate the proposed BI SQ and IQ research
model based on IS success theory and the DeLone and McLean (1992, 2003) IS
success model as extended by Nelson et al. (2005). This study then gathered data
regarding the perceived SQ and IQ of BISI as it relates to perceived user system
satisfaction and perceived user information satisfaction from BISI. Since SQ and IQ
can separately influence user satisfaction, after determining the CVFs for SQ and IQ
of BISI, this study tested each construct of the proposed BI SQ and IQ research
model for reliability followed by the testing of the entire model. In addition to the
data analysis performed in phase II of the study that established the CVFs for SQ
and IQ of BISI, data was also analyzed in Phase III for the conceptual model con-
structs of perceived SQ of BISI, perceived IQ of BISI, perceived user system satis-
faction from BISI, and perceived user information satisfaction from BISI.
After conducting EFA via PCA using Varimax rotation, the Kaiser criteria was
applied to the SQ factor analysis. Based on the Kaiser criterion, the results of the
PCA factor analysis suggested that two SQ factors with a cumulative variance of
61.9% should be retained. Using the factor loadings, survey items were scrutinized
for low loadings (<0.4) or for medium to high loadings (~0.4 to 0.6) on more than
one factor. The results of this review indicated that five items could be eliminated
from further analysis. Furthermore, the Cronbach Alpha analysis indicated that all
remaining items supported the reliability of the items and the factors. Moreover, the
Cronbach’s Alpha of each factor was 0.83 or higher, indicating very high reliability.
As a further test of reliability, the Cronbach’s Alpha “if item is deleted” was calcu-
lated to test the reliability of the items for all SQ factors. Based on an analysis of the
results it was concluded that the appropriate number of SQ factors for extraction
were two as represented in Table 6.1 and were comprised of 12 items.
As a result of the analysis, integration flexibility SQ was found to explain the
largest variance in the SQ data collected and consisted of characteristics that
addressed the ability of the BI system to combine information using compatible
systems that supported integrated communications and transmissions among a vari-
ety of systems and the associated data in various functional areas. The new factor of
integration flexibility SQ was also comprised of the BISI SQ characteristics of
extendibility, expandability, modularity, and configurability, as well as adaptability
and scalability with an intuitive user interface. In particular the characteristic of
data portability was considered to be very important to BI users. It is clear that flex-
ibility in integrated systems is important to BISI success. Reliability SQ explained
68 P.P. Dooley et al.
the remaining variance in the data collected and represented a combination of the
characteristics of system dependability, recoverability, and low downtime. In
essence, BI users found the technical quality of the system to be important. The list
of SQ characteristics of BISI is provided in Table 6.2.
The results of the IQ EFA under PCA using Varimax rotation and the Kaiser criteria
suggested that three IQ factors with a cumulative variance of 75.3% should be
retained. Using the factor loadings, survey items were scrutinized for low loadings
(<0.4) or for medium to high loadings (~0.4 to 0.6) on more than one factor. The
results of this review indicated that three items could be eliminated from further
analysis. The Cronbach’s Alpha’s of the individual factors provided high reliabil-
ity: representation IQ—0.896, intrinsic IQ—0.957, accessibility IQ—0.852. Based
on an analysis of the results it was concluded that the appropriate number of SQ
factors for extraction were three, as represented in Table 6.3 and were comprised of
14 items.
Representation IQ was found to explain the largest variance in the IQ data col-
lected and consisted of characteristics that addressed the representation of infor-
mation in BI systems which rely on the user to ensure that IQ is retained as
information from various sources are joined, aggregated, updated, configured,
manipulated, and mapped into suitable representations and formats. The item IQC4
“traceability and verifiability of the source of information in BISI” loaded high on
the CVF of representation IQ. Accessibility IQ explained the next largest variance
in the data collected and included items representing a combination of ease of
access to locatable, obtainable, and searchable information. In essence, BI users
6 Critical Value Factors in Business Intelligence Systems Implementations 69
Integration flexibility SQ
SQI2 The compatibility of BI system software with other software
and hardware
Representation IQ
IQR4 Information is reproducible in the BISI
The strength and direction of the hypothesized relationships (Fig. 6.4) in the con-
ceptual model were validated using the partial least squares (PLS) method, a sub-
type of structured equation modeling (SEM) used in performing CFA. The
bootstrapping resampling method (5000 samples) was also employed. As a result of
Phase II factor analysis, the model was revised to replace the proposed theoretically
anticipated CVFs of BISI with the empirically determined CVFs of BISI. The paths
from the two empirically assessed CVFs of SQ to the perceived SQ of BISI have
been named H1.1 and H1.2. Likewise, the paths from the three empirically assessed
CVFs of IQ to the perceived IQ of BISI have been named H2.1, H2.2, and H2.3. The
paths from user perceived SQ and user perceived IQ of BISI to perceived user sys-
tem satisfaction and perceived user information satisfaction from BISI as hypothe-
sized in the proposed BI SQ and IQ research model, based on the Delone and
McLean IS success model (2003) as extended by Nelson et al. (2005) were tested in
the overall context of BISI success.
The PLS generated loadings for the SQ and IQ selected items and CVFs from
phase II EFA were again considered high with the lowest item loading at 0.675.
Moreover, the SQ CVF’s of reliability-SQ and integration flexibility-SQ loaded at
0.836 and 0.898 respectively. The CVF’s of intrinsic-IQ loaded at 0.957, accessibil-
ity-IQ at 0.868, and representational-IQ at 0.897. PLS was then used to empirically
6 Critical Value Factors in Business Intelligence Systems Implementations 71
Perceived
Integration H1.1
flexibility SQ Perceived User System
0.290***
System Satisfaction
Quality H3 From
H1 .2 of BISI 0.263** BISI H7a
Reliability SQ 0.029
0.151* R2 = 0.164 R2 = 0.576
H5 H6 SystemSat
0.129* 0.552*** X
InfoSat
Information Quality (IQ)
Representational H2.1
IQ
0.164* Perceived
Perceived
Information User H7b
Accessibility H2.2 Quality Information 0.038
IQ 0.158* of BISI H4 Satisfaction
0.682*** From
H2.3 R2 = 0.143
Intrinsic IQ BISI
0.119*
R2 = 0.589
p<.05 *
p<.01 **
p<.001 ***
Fig. 6.4 Structural equation model testing results of conceptual model. p < 0.05*, p < 0.01**,
p < 0.001***
test the conceptual model path coefficients to determine the significance of the
relationships. As indicated in the conceptual model in Fig. 6.4, all CVFs of BISI for
SQ and IQ had significant positive impacts on the perceived SQ and IQ of BISI.
6.5 Findings
The results of the testing of the hypotheses clearly indicated support for the empiri-
cally determined CVFs of SQ and IQ of BISI as depicted in Table 6.5. Moreover,
these results provided evidence that many of the antecedents uncovered in the litera-
ture and by the expert panel in the qualitative phase of the study were highly valued
by BI users and contributed to the strength of the relationships between the CVFs of
BISI and perceived SQ and IQ of BISI. Furthermore, seven of nine items recom-
mended for inclusion in the survey by the expert panel were reliable and grouped
accordingly within the retained CVFs.
The results confirm that there is a significant positive impact between perceived
SQ and perceived user system satisfaction as well as a significant positive impact
between perceived IQ and perceived user information satisfaction. The results also
provided confirmation that there is a significant positive impact in the crossover
relationships between the perceived SQ and IQ of BISI and the perceived user sys-
tem satisfaction and perceived information satisfaction from BISI. It is also noted
72 P.P. Dooley et al.
H2.1-3: The CVFs of representational IQ, accessibility IQ, and intrinsic IQ Supported
will have a positive significant impact on IQ for BISI success.
H3: The perceived SQ of BISI will have a positive significant impact on Supported
perceived user system satisfaction from BISI.
H4: The perceived IQ of BISI will have a positive significant impact on Supported
perceived user information satisfaction from BISI.
H5: The perceived SQ of BISI will have a positive significant impact on Supported
perceived user information satisfaction from BISI.
H6: The perceived IQ of BISI will have a positive significant impact on Supported
perceived user system satisfaction from BISI.
H7a: The interactions of perceived user system satisfaction from BISI and Not Supported
the perceived user information satisfaction from BISI will have a positive
significant impact on perceived user system satisfaction from BISI.
H7b: The interactions of perceived user system satisfaction from BISI and Not Supported
the perceived user information satisfaction from BISI will have a positive
significant impact on perceived user information satisfaction.
that the interaction effect did not have a significant positive impact on either
perceived user information satisfaction from BISI or perceived user system satisfac-
tion from BISI. These results were shared with members of the expert panel who
expressed their agreement and support of the findings.
6.6 Discussion
The main goal of this study was to validate empirically a model for IS success that
investigated user satisfaction in the context of BISI by uncovering the CVFs of SQ
and IQ necessary to derive BISI success. The study found that a BISI project should
place emphasis on the CVFs of integration flexibility SQ and reliability SQ as the
primary drivers for SQ of BISI success. Emphasis should also be placed on the
CVFs for IQ of representational IQ, intrinsic IQ, and accessible IQ, as the primary
drivers for IQ of BISI success.
The CVF of integration flexibility SQ had the most significant effect on the SQ
of BISI as greater emphasis was placed on the capability of the BI system to easily
combine information from multiple sources while retaining compatibility with
other software and hardware. This is important to users of BI analytics as the abil-
ity of the BI system to communicate and transmit a variety of data between other
systems supporting different functional areas is necessary for BISI success. This
had previously been understood to be merely a relevant attribute and expected in
BI systems that leveraged data warehouse technologies (Nelson et al. 2005). The
results of this study also confirm the importance of integration flexibility SQ to
facilitate integration of changing information from various sources to support
6 Critical Value Factors in Business Intelligence Systems Implementations 73
business decisions. The system must be flexible in supporting ad hoc and unplanned
requests for information in various representations. Reliability SQ was also con-
sidered as an important CVF as system dependability, recoverability, and low
downtime are valued by BI users. On the other hand, the SQ CVF of response time
SQ was not a reliable CVF in BISI success. It may be that response time for BISI
was considered less important as a separate CVF but was assumed to be available
in reliable and flexible BI systems. It might also be possible that due to the analyti-
cal nature of BI systems, response time does not carry the same level of impor-
tance as would be necessary in a transaction based system.
The CVF of representation IQ had the most significant effect on IQ as the repre-
sentation of information in BI analytical systems, as with most analytical based
applications, relies on the user to ensure that IQ is retained as information from vari-
ous users and sources are joined, aggregated, updated, configured, manipulated, and
mapped into suitable representations and formats. Of particular interest was the
high level of importance placed on traceability, verifiability, and the ability to repro-
duce information in BISI. This may point to user recognition of the need for
accountability for the output produced by the user in BI analytical systems. The
CVF of accessibility IQ was also considered important in successful BISI as empha-
sis was placed on the importance of ease of access to locatable, obtainable and
searchable information as well as the security of the accessed information and the
ability to navigate within the BI system. Intrinsic IQ was also a reliable CVF as
information accuracy, consistency, reliability, and correctness have generally been a
cornerstone to BI success. The CVF of contextual IQ, however, was not a reliable
CVF of perceived IQ of BISI. This may be due to the nature of BI systems which
often rely on historical data to perform analytics and, as with response time expecta-
tions and assumptions, the contextual characteristics of currency, timeliness, suffi-
ciency, and relevancy of information may be assumed to be of less importance than
in systems that are more time dependent and transaction oriented.
The effects of perceived SQ and IQ of BISI on perceived user system and infor-
mation satisfaction from BISI were also of particular interest in the study. The
perceived IQ of BISI had a significant positive impact on perceived user informa-
tion satisfaction from BISI. Perceived IQ of BISI also had a significant positive
impact on perceived user system satisfaction from BISI. While the perceived SQ of
BISI had a significant positive impact on perceived user system satisfaction from
BISI there was less of an impact on perceived user information satisfaction from
BISI, thereby highlighting the differences between the BI system and the informa-
tion produced. It is apparent that BI analytical systems provide advanced interfac-
ing capabilities that may influence the users’ perception that the interaction with the
interface has an impact on the output produced thereby making it difficult to dif-
ferentiate between the system interface and the user’s responsibility for the quality
of the output. This study also confirms that while the empirically determined CVFs
of SQ and IQ of BISI and their crossover effects are perceived to be important to
user perceived SQ and IQ user satisfaction from BISI, the strength of the impact of
IQ on the system corresponds to the importance users place on the output in analyti-
cal BISI. Moreover, this finding emphasizes the differences between the BI system
tools and the output that is produced as well as the need for BI system implementers
74 P.P. Dooley et al.
to accept responsibility for IQ. The results of this study and particularly the cross-
over effects found in the research model shed light on our understanding of quality
and highlight a continuum of interactivity in BISI that distinguishes SQ and IQ
characteristics and their effect on output and user perceived satisfaction.
Our study has several implications in the field of BI, particularly for practitioners. First,
it contributes to the body of knowledge by empirically identifying the CVFs and char-
acteristics of SQ and IQ that users find important in successful BISI. Secondly, this
study empirically addressed the relationship between the quality of the BI system (SQ)
and its output (IQ). The study determined that there was a significant positive impact
from perceived SQ and IQ of BISI on perceived user system and information satisfac-
tion from BISI. Previous studies in BISI placed emphasis on the use of a data warehouse
within the BISI domain. However, while DW can be used with varying levels of impor-
tance in BI systems, they can also exist without a DW. There had also been some ambi-
guity between the system (SQ) and its output (IQ) whereby the strength of the relationship
between SQ and information satisfaction was stronger than the relationship between IQ
and information satisfaction. The empirically developed findings of this study are in line
with expectations for system success as theorized in the BI SQ and IQ research model,
based on the DeLone and McLean IS success model (1992, 2003) as extended by
Nelson et al. (2005). Lastly, this study identified characteristics of SQ and IQ that are
valued or important in BISI, thereby assisting practitioners in determining the best areas
of focus for BISI success. This study provided compelling evidence that the antecedents
and CVFs of integration flexibility SQ and reliability SQ are important to BISI success.
Moreover, this study also provided compelling evidence that the antecedents and CVFs
of representation IQ, accessibility IQ, and intrinsic IQ are important to successful BISIs.
This study represents the first empirical analysis of CVFs that affects SQ and IQ for
BISI success and has uncovered important factors and characteristics for BISI success
that will enable BI stakeholders to better optimize scarce resources.
The primary limitation of this study surrounds the possibility that participants may
have varying degrees of exposure to analytical BI systems. While BI systems are
associated with decision making, the complexity of the implemented system and the
interpretation of its output could require skill levels that may not be consistent among
all participants. It is, therefore, assumed for the purposes of this study that partici-
pants had, at a minimum, BI or analytical system implementation experience. The
gender differences among BI users may also be examined more closely, as there were
twice as many females that participated in the survey than males. Another limitation
surrounds the lack of consistency in the BI technologies used. For example, one
6 Critical Value Factors in Business Intelligence Systems Implementations 75
participant may have experienced BI using the IBM Cognos tool. Another participant
may have experienced BI using systems that were integrated in an ERP system.
Our study provided a solid theoretical foundation from which future studies can
originate. Firstly, it was designed to empirically validate a model for IS success for
user satisfaction in the context of BISI and although the individual CVFs of SQ and
IQ necessary to derive BISI success were significant, future studies may be war-
ranted to examine and assess other constructs and items that are important to BI
systems users that lead to BISI success such as governance and service quality.
Moreover, BI systems are expected to accommodate the big data phenomenon
which represents additional, unusual, and complex sources of data in BISI (Wixom
et al. 2014). Furthermore, future research could assess the needs of BISI in a big
data environment whereby information is often unstructured. With more attempts to
manipulate input streams, many issues have been raised in the field of big data,
accompanied by a wide variety of potential failures. There have been few attempts
to actually apply big data analytics to the validation of big data, particularly in the
analysis of data streams (Wixom et al. 2014). Social media for instance is open to a
wider range of validation techniques. This could explain, in part, the high degree of
importance placed by BI users in this study on validity of data sources. This finding
may also point to the need to establish tailored systems development methodologies
with emphasis on testing and verification for the delivery of BI systems in the future.
6.9 Conclusion
This study provided further evidence that the antecedents of integration flexibility
SQ and reliability SQ are important to BISI success. Moreover, it demonstrated
compelling evidence that the antecedents and CVFs of representation IQ, accessi-
bility IQ, and intrinsic IQ are important to successful BISI. These findings confirm
the widely held view that BISI is not a conventional application-based IT project but
a complex undertaking requiring an appropriate infrastructure over a lengthy period
of time. The findings also confirm that successful BISI require a robust and easy to
use interface for user-driven information representation in an analytical user-based
decision support system context from multiple integrated heterogeneous sources
(Goodhue and Thompson 1995; Yeoh and Koronios 2010). Our study also reported
that there is a significant effect in the relationships of perceived IQ of BISI to per-
ceived user information and system satisfaction thereby confirming the importance
BI system users place on information and the system output produced.
Biographies
include data science and business intelligence with specific attention to quality fac-
tors affecting successful implementations.
References
Arazy O, Kopak R (2011) On the measurability of information quality. J Am Soc Inf Sci Technol
62(1):89–99
Arnott D, Prevan G (2008) Eight key issues for the decision support systems discipline. Decis
Support Syst 44(3):657–672
Bailey JE, Pearson SW (1983) Development of a tool for measuring and analyzing computer user
satisfaction. Manag Sci 29(5):530–545
Boynton AC, Zmud RW (1984) An assessment of critical success factors. Sloan Manag Rev
25(4):17–27
6 Critical Value Factors in Business Intelligence Systems Implementations 77
Chee T, Chan L-K, Chuah M-H, Tan C-S, Wong S-F, Yeoh W (2009) Business intelligence sys-
tems: state-of-the-art review and contemporary applications. In: Symposium on progress in
information & communication technology, p 96–101
Clark TD, Jones MC, Armstrong CP (2007) The dynamic structure of management support sys-
tems: theory development, research focus, and direction. MIS Q 31(3):579–615
DeLone WH, McLean ER (1992) Information systems success: the quest for the dependent vari-
able. Inf Syst Res 3(1):60–95
DeLone WH, McLean ER (2003) The DeLone and McLean model of information systems suc-
cess: a ten-year update. J Manag Inf Syst 19(4):9–30
Dhillon G, Bardacino J, Hackney R (2002) Value-focused assessment of individual privacy con-
cerns for internet commerce. In: Proceedings of the Twenty-Third international conference on
information systems, p 705–709
Dhillon G, Torkzadeh G (2001) Value-focused assessment of information system security in orga-
nizations. In: Proceedings of the twenty-second international conference on information sys-
tems, p 561–565
Dinter B, Schieder C, Gluchowski P (2011) Towards a life cycle oriented business intelligence suc-
cess model. In: Proceedings of the Americas conference on information systems
Etezadi-Amoli J, Farhoomand AF (2011) On end-user computing satisfaction. MIS Q 15(1):1–5
Gatian AW (1994) IS user satisfaction: a valid measure of system effectiveness? Inf Manag
26(1):119–131
Goodhue DL (1995) Understanding user evaluations of information systems. Manag Sci 41(12):
1827–1844
Goodhue DL, Thompson RL (1995) Task-technology fit and individual performance. MIS Q
19(2):213–236
Hair JF, Anderson RE, Tatham RL, Black WC (1994) Multivariate data analysis. Prentice Hall,
Upper Saddle River, NJ
Hanson WE, Plano-Clark VL, Petska KS, Creswell JW, Creswell JD (2005) Mixed methods
research designs in counseling psychology. J Counsel Psychol 52(2):224–235
Howson C (2008) Successful business intelligence: secrets to making BI a killer application.
McGraw-Hill, New York
Iivari J (2005) An empirical test of the DeLone-McLean model of information system success.
ACM SIGMIS Database 36(2):8–27
Isik O, Jones MC, Sidorova A (2013) Business intelligence success: the roles of BI capabilities and
decision environments. Inf Manag 50(1):13–23
Jarke M, Vassiliou Y (1997) Data warehouse quality: a review of the DWQ project. In: Proceedings
of the conference on information quality, p 299–313
Jourdan Z, Kelly RK, Marshall TE (2008) Business intelligence: an analysis of the literature. Inf
Syst Manag 25(2):121–131
Keeney RL (1992) Value-focused thinking. Harvard University Press, Cambridge, MA
Keeney RL (1999) The value of internet commerce to the customer. Manag Sci 45(4):533–542
Lee YW, Strong DM, Kahn BK, Wang RY (2002) AIMQ: a methodology for information quality
assessment. Inf Manag 40(1):133–146
Levy Y (2003) A study of learner’s perceived value and satisfaction for implied effectiveness of
online learning systems. Diss Abstr Int A65(03):1014
Levy Y (2006) Assessing the value of e-learning systems. Information Science, Hershey, PA
Levy Y (2008) An empirical development of critical value factors (CVF) of online learning activi-
ties: an application of activity theory and cognitive value theory. Comput Educ 51(4):1664–1675
Levy Y (2009) A value-satisfaction taxonomy of IS effectiveness (VSTISE): a case study of user
satisfaction with IS and user-perceived value of IS. Int J Inform Sys Service Sect 1(1):93–118
Luftman J, Ben-Zvi T (2010) Key issues for IT executives 2009: difficult economy’s impact on
IT. MIS Q Exec 9(1):49–59
Marshall L, de la Harpe R (2009) Decision making in the context of business intelligence and data
quality. SA J Inform Manage 11(2):1–15
Mertler CA, Vannatta RA (2001) Advanced and multivariate statistical methods: practical applica-
tion and interpretation. Pyrczak, Los Angeles, CA
78 P.P. Dooley et al.
Nah F, Siau H, Sheng H (2005) The value of mobile applications: a study on a public utility com-
pany. Commun ACM 48(2):85–90
Negash S, Gray P (2008) Business intelligence, handbook on decision support. C.W. Holsapple,
Berlin
Nelson RR, Todd PA, Wixom BA (2005) Antecedents of information and system quality: an empir-
ical examination within the context of data warehousing. J Manag Inf Syst 21(4):199–235
Petter S, McLean E (2009) A meta-analytic assessment of the DeLone and McLean IS success
model: an examination of IS success at the individual level. Inf Manag 46(3):159–166
Petter S, DeLone W, McLean E (2013) Information systems success: the quest for the independent
variables. J Manag Inf Syst 29(4):7–61
Popovic A, Hackney R, Coelho PS, Jacklic J (2012) Towards business intelligence systems
success: effects of maturity and culture on analytical decision making. Decis Support Syst
54(1):729–739
Power DJ (2008) Understanding data-driven decision support systems. Inf Syst Manag 25(2):
149–154
Rai A, Lang SS, Welker RB (2002) Assessing the validity of IS success models: an empirical test
and theoretical analysis. Inf Syst Res 13(1):50–69
Rokeach MJ (1969) Beliefs, attitudes, and values. Jossey-Bass, San Francisco, CA
Rokeach MJ (1973) Nature of human values. The Free Press, New York, NY
Ryu KS, Park JS, Park JH (2006) A data quality management maturity model. ETRI J 28(2):191–204
Schumacker RE, Lomax RG (2010) A beginner’s guide to structural equation modeling. Routledge,
New York, NY
Seddon PB (1997) A respecification and extension of the DeLone and McLean model of IS suc-
cess. Inf Syst Res 8(3):240–253
Sethi V, King RC (1999) Nonlinear and noncompensatory models in user information satisfaction
measurement. Inf Syst Res 10(1):87–96
Shank G (2006) Six alternatives to mixed methods in qualitative research. Qual Res Psychol
3(4):346–356
Sheng H, Nah K, Siau K (2005) Strategic implications of mobile technology: a case study using
value-focused thinking. J Assoc Inf Syst 9(6):344–376
Sheng H, Siau K, Nah FF (2010) Understanding the values of mobile technology in education: a
value-focused thinking approach. ACM SIGMIS Database 41(2):25–44
Siau K, Nah F, Sheng H (2004) Value of m-Commerce to customers. In: Proceedings of the tenth
Americas conference on information systems, p 2811–2815
Straub D (1989) Validating instruments in MIS research. MIS Q 13(2):147–169
Todd G (2009) The imperative of analytics. Inf Manag 19(2):44–47
Urbach N, Smolnik S, Riempp G (2009) The state of research on information systems success: a
review of existing multidimensional approaches. Busin Inf Syst Eng 1(4):315–325
Wand Y, Wang RY (1996) Anchoring data quality dimensions in ontological foundations. Commun
ACM 39(11):86–95
Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers.
J Manag Inf Syst 12(4):5–34
Watson HJ, Goodhue DL, Wixom BH (2002) The benefits of data warehousing: why some organi-
zations realize exceptional payoffs. Inf Manag 39(1):491–502
Wixom BH, Ariyachandra T, Douglas D, Goul M, Gupta B, Iyer L, Kulkarni U, Mooney JG,
Phillips-Wren G, Turetken O (2014) The current state of business intelligence in academia: the
arrival of Big Data. Commun Assoc Inf Syst 34(1):1–13
Yeoh W, Koronios A (2010) Critical success factors for business intelligence systems. J Comput
Inf Syst 50(3):23–32
Chapter 7
Business Intelligence System Use in Chinese
Organizations
Abstract Chinese business has developed exponentially in the last few decades
and Chinese firms are highly influential in world trade. Business intelligence (BI)
systems are large-scale decision support systems (DSS) that analyze enterprise data
to generate business insights. BI was developed in the West and is integral to con-
temporary Western management practices. It is generally assumed that western BI
systems are useable and effective in a Chinese context. No study has been under-
taken to investigate the use behavior of large-scale DSS in Chinese organizations.
We conducted two exploratory case studies in large indigenous Chinese organiza-
tions. The case analysis shows that a complex cultural factor (provisionally termed
Factor X) affects BI systems use in China. A set of propositions are formulated from
the analysis. They will be used as a foundation for future research on Chinese BI.
7.1 Introduction
Guanxi is a universal and unique Chinese cultural norm (CN). Guanxi refers to ‘a
whole complex of social practices, strategies and ethics of the exchange and reciproc-
ity of gifts, favors and banquets’ (Davis 2005, p. 232). It has been practised for centu-
ries and remains highly relevant in Chinese society today. Guanxi has affected the
evolution of Chinese society, economy, and business environment. Guanxi is not as
simple as Western relationships; Western relationships grow out of deals, while
Chinese deals grow out of relationships. Gu et al. (2008) identified eight key differ-
ences from Western perspectives when considering guanxi as a construct and gover-
nance strategy in business environments. Guanxi has attracted research attention.
Many studies have been conducted in foreign companies, which operate their business
in China (for example, Millington et al. 2005).
Business intelligence (BI) is a large-scale decision support systems (DSS)
approach that analyzes enterprise data using specialized techniques to generate busi-
ness insights. Managers, at different levels, may leverage these business insights to
make decisions in order to improve organization performance. The “mega BI ven-
dors” (Microsoft, Oracle, IBM, and SAP) have increased their investments in BI over
recent years. Gartner Inc. reported that BI and analytics is the top technology invest-
ment priority for CIOs, and has been in the top five technology priorities over the last
decade (Gartner 2015). The Chinese economy attracts considerable attention from IT
vendors that wish to sell contemporary Western technology, such as BI. The assump-
tion is that Western IT is appropriate to Chinese uses. However, research has lagged
the vendor efforts and the nature of the use of BI systems in large indigenous Chinese
organizations remains unknown. This is the research gap addressed by this project.
This chapter describes an exploratory study that develops a set of propositions of
BI systems use in indigenous Chinese organizations. It is explicitly concerned with
Chinese CN, the nature of Chinese organizations, and Chinese business decision-
making. The remainder of this chapter is structured as follows: the next section
outlines relevant literature, and identifies the research constructs and concepts that
will inform the empirical research. This is followed by the research design discus-
sion, which in turn is followed by the case study analysis and results. This analysis
leads to the formulation of the Chinese BI systems use propositions.
The aim of this section is to identify research constructs and concepts that may be
relevant to Chinese BI systems use. The section discusses literature that is relevant
to the project in three subsections: IS and BI research in China, Chinese CN, and
research constructs and concepts.
The literature review scope was broadened to information systems (IS) in general,
and was restricted to mainland Chinese IS studies. Davison and Martinsons (2016)
argued that particularism of research design is critical to the validity of research
design. Studies that were conducted of the Chinese diaspora, that is where data were
collected in Hong Kong, Macau, and Taiwan, were excluded from the sample due to
variations in their CN.
A keyword search was conducted in the ‘basket of eight’ journals. Two elite IS
journals Information & Management, and Decision Support Systems were then
added to the sample because of their A* status by Australian Business Deans
Council, and their relevance to DSS. Within the sample of 162 published papers,
7 Business Intelligence System Use in Chinese Organizations 81
only eight papers (5.4%) focused on DSS. The only Chinese BI study, Li et al.
(2013), focused on intrinsic and extrinsic motivation as a lens to investigate both
routine and innovative use of BI systems in the post-acceptance period. Cultural
aspects of BI systems use were not investigated in this study. Two of the Chinese
DSS papers, Zhang et al. (2007) and Lowry et al. (2010), employed Hofstede’s cul-
ture dimensions in their investigation on how national cultural differences affect
group decision-making. Hofstede’s theory (1980) concerns national comparisons
and is not widely used outside of IT research where this theory was formulated
between 1967 and 1973 for IBM. Hofstede’s theory is not appropriate for this proj-
ect as its focus is on cultural comparisons between countries. This project concerns
one culture and country—People’s Republic of China.
There have been many debates about applying theories, guidelines, and other forms of
research outcomes that have been derived from studying Western business to Eastern
contexts. In China, IS use is constantly formed by Chinese business culture (Martinsons
and Westwood 1997). Cultures determine how people react and behave, and research
results should only be applied to other cultural contexts with caution (Mao and Palvia
2008). Western IS theories may or may not be applicable to Chinese practice.
Researchers have been aware of cultural impacts in research conducted in China.
Many researchers briefly mentioned cultural impacts in their research limitations dis-
cussion; examples include Liu et al. (2013) and Wang et al. (2013). A small number
of researchers have considered culture as a factor in their studies (14 of 162 or 8.64%).
Two of these 14 papers studied guanxi as a research concept: Shin et al. (2007) pointed
out that guanxi and collectivism had a stronger influence on in-group than on external
information sharing, while Confucian dynamism appeared to have less effect on in-
group information sharing. Davison et al. (2013) argued that guanxi strengthened
transactive memory networks, and facilitated knowledge sharing. Nevertheless,
Guanxi has not been investigated in the terms of BI or the wider DSS discipline.
Technology adoption and use represent different phases of the utilization of technol-
ogy, and are affected by different factors. Karahanna et al. (1999) concluded that
adoption and use are sequential phases of technology acceptance. Adoption is deter-
mined by normative considerations, while continued usage is determined by attitu-
dinal factors (Karahanna et al. 1999). It is consistent with Deng and Chi’s (2012)
identification of different BI use patterns between the initial usage phase and the
continued usage phase. Researchers have used the Technology Acceptance Model
(TAM) (Davis 1989) and the Unified Theory of Acceptance and Use of Technology
(UTAUT) (Venkatesh et al. 2003) to investigate not only technology acceptance but
82 Y. Song et al.
also use behaviors in China. For example, Liu and Forsythe (2011) used these
theories to examine the continuous use of online shopping channels and Zhou
(2011) investigated the use of mobile Internet in China. Further, using constructs
from TAM and UTAUT to investigate BI systems use is informed by the adaptation
of TAM and UTAUT for IT continuance by Bhattacherjee and Lin (2015) and the
adaptation of UTAUT for clinical DSS by Shibl et al. (2012).
In UTAUT, use behavior (UB) is the actual use of a system and, behavioral inten-
tion (BeI) is an individual’s attitudes towards using the system. Early research indi-
cated that BeI is a major determinant of UB (Davis et al. 1989). Social influence
focused on the extent to which important others believe the person should use or not
use a particular technology. This factor involves investigating how CN directs social
interactions between people, and how people will react to these social interactions.
CN is a subjective factor that could influence both BeI and UB. Leidner and
Kayworth (2006) reviewed 82 IS journal articles, and concluded that CN has signifi-
cant impacts, especially in IS development and use. In China, CN consists of many
dimensions but is especially informed by guanxi.
Perceived usefulness (PU) and perceived ease of use (PEOU) and their effect on
BeI has been extensively researched in IS (Taylor and Todd 1995). Perceived facili-
tating conditions (FC) refers to the degree to which an individual believes that an
organizational and technical infrastructure exists to support the use of the system
(Venkatesh et al. 2003). Gogindarajan and Trimble (2012) argued that the degree of
IT innovation is higher in emerging economies than in Western countries. This is
because organizations in an emerging economy do not have extensive legacy sys-
tems. This means that there are less constraints on the level of IS innovation. As a
result, FC may have a higher influence in a Chinese context than in the West.
Gender and age are self-explanatory constructs. Generational differences are
derived from, and shaped by, political, socioeconomic, and cultural events. Unlike
the west, the major events that caused generational difference in China were the
foundation of the PRC in 1949, the Great Leap Forward in the 1960s, the Cultural
Revolution in the 1970s, and the One Child and Open Policies in the 1980s. Erickson’s
(2009) framework of classifying Chinese generations has been adopted in this
research, namely Traditionalists (1928–1945), Boomers (1946–1964), Generation X
(1965–1979), and Generation Y (1980–1995). Age affects an individual’s status in an
organization, as respecting one’s elders is a key moral principle in China.
Emerging economies do not follow Western trajectories of economic develop-
ment because of their different infrastructures, geographies, cultures, languages,
and governments (Gogindarajan and Trimble 2012). This means that managers who
have worked or trained in the West can have difficulty in adapting to current Chinese
circumstances. Therefore in this research, experience will have at least two dimen-
sions: category (work or education) and place (domestic or overseas).
Voluntariness of use refers to the degree to which BI users are forced to use the
system. Chinese managers may have less discretion in their use of BI systems than
those in the West. In addition, managers who experienced the Cultural Revolution
may lack technology and mathematics literacy. They may use assistants or interme-
diaries to use BI systems at a higher rate than Western managers.
Table 7.1 summarizes the research constructs and definitions that were discussed.
7 Business Intelligence System Use in Chinese Organizations 83
The research scope excludes Hong Kong, Macau, and Taiwan. Taking Hong Kong as
an example, executives from Hong Kong may be considered as Chinese insiders by
Westerners but they will be treated as outsiders by indigenous Chinese business peo-
ple (Fock and Woo 1998). As a result, this research focuses only on large business
organizations in mainland China. Further, these organizations are indigenous to
China and are not the Chinese subsidiaries of foreign multinational corporations. The
majority of general IS research that has been conducted in China has used a survey
method. However, survey data is not rich enough for an exploratory study of BI sys-
tems use in a Chinese context. A multiple case study research design was adopted
from Yin (2014) with semi-structured interviews as the data collection technique.
The case selection criteria were based on the above discussion. First, the organi-
zations involved were indigenous mainland Chinese. Second, the size of the case
organizations was large as BI systems provide large-scale decision support. Large
organizations are more likely to have a complex managerial hierarchy, which will
help to understand BI systems use behavior across different managerial levels.
Third, researchers had to obtain access and receive sufficient support from high-
level management in the case organizations. There were two sets of participants:
developers who could provide sufficient details about BI systems, and managers
and senior professionals who use BI systems to support their decision tasks.
Based on the identified research constructs and concepts, interview protocols were
developed in English and translated into Chinese for data collection. This research
adapted a back-translation technique based on Brislin (1970). These adaptations
were inspired by Jones et al. (2001) and Sousa and Rojjanasrirat (2011). A pilot study
to pre-test the interview protocols was conducted with professionals and academics
who have extensive experience in the BI field.
Two large indigenous Chinese organizations were selected for this research. The
first case is the Chinese Insurance Company (CIC). The name of the company has
been disguised for ethics approval reasons. CIC has more than 5,000 employees.
Though CIC was founded by a local Chinese group with foreign investment in 2002,
CIC transferred to total Chinese ownership in 2010. Most employees are mainland
Chinese and do not have overseas qualifications or work experience. CIC sells
insurance policies all over China via traditional physical sales offices in many prov-
inces and online transactions.
The second case is Alibaba Group (AG), which is the largest Internet business in
China. AG is representative of a group of newly established Internet companies in
China. AG has a Western-like appearance that is significantly different to traditional
Chinese companies like CIC. AG had about 35,000 employees in 2015. AG was
founded by a local Chinese group led by Jack Ma in 1998; it floated on the New York
7 Business Intelligence System Use in Chinese Organizations 85
Empirical data collection was carried out from July to September, 2014. Data was col-
lected in three branches of CIC, located in two cities, and three campuses of AG that
are located in two cities. No incentive was offered to any participant. Detailed notes
were taken during all interview sessions at CIC, while at AG, in addition to notes, all
interview sessions were audio recorded. All field notes were recorded, stored, and
sorted by interview session, and all audio recordings were transcribed. This research
adopted recommendations from Miles et al. (2013) to guide data analysis.
All field notes and transcripts were loaded into a qualitative data analysis software
package—NVivo. The first cycle of coding used techniques from Miles et al. (2013),
namely provisional coding (an exploratory method) and simultaneous coding (a
grammatical method). Provisional codes were generated from the literature review.
This set was later expanded to include new codes. Under simultaneous coding one
piece of a transcript could be coded under multiple codes. These techniques assisted
in creating codes from emerging themes, and for the second cycle of coding.
This section discusses the removal of some constructs from further consideration,
refining relevant constructs, and adding emerging constructs to the propositions.
This section also identifies relationships among constructs according to their
strength and importance. It is important to mention that only constructs and con-
cepts were taken from the literature review, relationships between constructs were
developed from the BI case study data.
After the case study analysis, Gender, voluntariness of use, FC, and BeI have been
removed from the proposition development on the BI systems use. No significant
difference was found between male and female use patterns in Chinese BI systems.
Most BI users stated that the use of BI was mandatory, while voluntary use was typi-
cal for managers and senior professionals who did not have extensive analysis
86 Y. Song et al.
needs. All users confirmed that managers were in favor of staff using BI systems,
and both organizations provided sufficient support to BI systems development and
enhancement. There was no budget constraint on BI systems in either case. UB in
Chinese BI systems does not depend on, or is affected by, BeI. This is partly a con-
sequence of the lack of voluntary usage patterns. Based on the data analysis, gender,
voluntariness of use, FC, and BeI did not appear to affect the relationships among
important research constructs.
The case data analysis yielded five propositions to guide further investigation on
Chinese BI use.
Proposition 1. BI systems are developed for supporting specific decision tasks
in the organizations. This proposition involves two important emerging constructs
from the data analysis. A decision is a commitment to an action. A decision task is
part of a manager’s job that requires decision-making. The nature of a decision task
(NT) has been categorized differently in DSS research with the most influential
classifications being Simon (1960) and Anthony (1965). Simon’s decision types are
based on the levels of understanding or structure that is perceived in a decision situ-
ation. Anthony’s managerial activity categories help to classify decision tasks in
terms of operational, tactical, and strategic management activities. Gorry and Scott
Morton (1971), the seminal article for the DSS field, proposed a two-dimensional IS
task framework based on the Simon and Anthony decision typologies.
The decision tasks of CIC and AG managers were applied to the Gorry and
Scott Morton (1971) decision task framework. In the case studies no BI systems
were used to support unstructured or strategic tasks. Semi-structured tactical tasks
were the most commonly reported decisions in both CIC and AG. The low propor-
tion of structured tasks may be due to the seniority of the participants, but it also
may be the situation that structured tasks are supported by other IT systems.
Examples of decision tasks include, a structured operational decision task—after-
sale support in order to respond to clients’ further enquiries about purchased insur-
ance policies (CIC); a structured tactical decision task—to identify which
developmental phase of electronic business has more laws and regulations estab-
lished, and this analysis may be used in later analysis for strategic decisions (AG);
a semi-structured operational decision task—financial data analysis support of
consolidating data from alternative source systems (CIC); and, a semi-structured
tactical decision task—identifying and consolidating real life business cases and
providing these cases to external researchers (AG).
The nature of a BI system (NS) refers to the overall characteristics of a BI sys-
tem, which are determined by the technology environment, data quality, and system
quality. Data with high quality are current, well maintained, and at the appropriate
level of detail. High system quality occurs when BI systems are highly reliable and
dependable, and available on the platforms that users need. In both CIC and AG it
7 Business Intelligence System Use in Chinese Organizations 87
was often difficult to locate where and when the needed data were available.
Understanding NS is essential in identifying its impacts on the alignment of deci-
sion tasks with the BI system (TSA) and PEOU.
CIC has several generations of DSS/BI that were developed in house, including
the Management Information Systems (MIS), Key Performance Indicator (KPI),
Customer Management Systems (CMS), and the new core BI system. MIS was
copied directly from a Western product and was developed for regulatory reporting
to the Insurance Supervision Department. CIC built most decision support for other
decision tasks on the MIS, and the system became inefficient. The two most men-
tioned BI applications are KPI and CMS, where KPI mainly focuses on audit opera-
tions and summary purposes, and CMS is mainly used to investigate the details of
policies. These two systems were built to solve the shortcomings of MIS. KPI and
CMS also assisted with CIC’s compliance with industrial standards. Data stored in
these systems need to be manually exported and consolidated, and inconsistencies
were frequently reported due to different extract, transform, and load (ETL) process
and data definitions among systems, branches, and reports. In October 2014, CIC
tested a new system that will replace all existing BI systems in late 2015.
AG’s technological environment is unique. Data are often only collected and
used within the same business unit (BU) to maintain a high level of data security,
though AG has a consolidated data platform in operation. If anyone requests to
review data from another BU, then the request needs to be assessed by many layers
of management. As a result, different BUs have developed their own BI systems
based on their data analysis needs. At AG, all developed platforms, applications,
tools and systems are assigned to individual system managers who take responsi-
bilities of monitoring the status of the systems, processing requests by users, and
refining the systems. The most common decision tasks were summarizing, monitor-
ing, and predicting business operation performance.
In both case studies BI systems were developed for one fundamental reason—
supporting specific decision tasks. This could be a point of difference with western
BI. The data from the cases show that NT determines NS.
Proposition 2. Decision task—BI system alignment is a superior research con-
struct than perceived usefulness in Chinese BI use. According to past research,
PU is important in determining BeI and therefore UB (Taylor and Todd 1995). PU
refers to an actual user’s perception of subjective probability that using a specific
application system will increase their performance. PU is one of the key factors in
utilization-focused IS models (e.g. TAM). BI systems were used for multiple pur-
poses at CIC and AG. Internal users focused on comparing, managing, reporting, and
analysis. For external users, BI systems were used to report on industrial standards,
respond to data requests by government officials, and to assist business partners.
TAM was built by testing word processor use and TTF was built by testing dif-
ferent applications in industry in 1990s. The technology has advanced significantly,
and contemporary technology BI use will be different from the TAM and TTF era.
Besides, TAM is a utilization focus model, while TTF is a fit focus model. A utiliza-
tion focus model has two major limitations: First of all, system use is not always
voluntary. System use was more about how a job was designed than the usefulness
88 Y. Song et al.
of the systems or the attitude of the user (Goodhue and Thompson 1995). Secondly,
more utilization does not necessary lead to better performance. Increasing use of a
poor system may even produce negative impacts on organizations. TAM and TTF
have significant overlapping constructs. The dependent constructs of both models
are related to the actual use of IT, and the aims of both models concern understand-
ing users’ choices and evaluation of information systems.
Many users conveyed that it was not important how useful BI systems were but
whether BI systems could support completion of the decision tasks. Goodhue and
Thompson (1995) proposed the task technology fit model to explain how task per-
formance and technology performance interact with each other, and the individual’s
role in using technology to perform tasks. More importantly, the expression “deci-
sion task—BI system alignment” was repeatedly mentioned by interviewees at both
CIC and AG, even though the interviewer had not explicitly mentioned the phrase
or the concept of alignment during any interview session. The emergent construct,
Decision task—BI system alignment (TSA), refers to the degree to which a system
assists an individual in performing his or her portfolio of tasks. This definition is
adapted from Goodhue and Thompson’s (1995) task technology fit idea. Therefore,
replacing PU with TSA offers a more comprehensive way to explore, explain, and
predict UB in Chinese organizations. Importantly, using TSA, as a construct, over-
comes the limitations of utilization-focused and fit-focused models in this project.
Proposition 3. BI system characteristics affect level of perceived ease of
use. Perceived BI system ease of use (PEOU) refers to the degree to which the
actual user experiences the BI system to be free of effort, during the process of gath-
ering system requirements, developing and maintaining, learning and training, using,
and understanding and communicating. BI system development was not the focus of
this research and a simpler conception of PEOU emerged. Many users discussed BI
ease of use in terms of learning, training, understanding, and communicating. This is
due to the intuitive nature of the BI systems interfaces, as well as data and system
quality. Hence, the case studies indicate that NS affects levels of PEOU.
Proposition 4. Use behavior is affected by decision task—BI system alignment
and perceived ease of use. BI system use behavior (UB) refers to the actual hands-
on use of a BI system. Interview data helped to tease out the UB construct in terms
of user types, usage, and use satisfaction. Most users fell into the category of direct
users, that is, they had direct access to the BI systems. Assisted users are often
senior managers in the organization, and they do not personally access BI systems.
The data analysis required and inadequate technical skills were often the reasons for
receiving assistance in using BI systems. There were much fewer assisted users than
direct users at both organizations, but all assisted users could have direct access to
BI systems if they wished.
For semi-structured and tactical decisions, users did not feel that they were ade-
quately supported by BI. Users encountered technical as well as business operation
issues. When users met technical issues, they contacted the person who is in charge
of managing a particular BI system. When users met business operation issues, they
would communicate with BI team members in order to find data that would support
7 Business Intelligence System Use in Chinese Organizations 89
solving those problems. Other use issues include inconsistent data feeds among
different systems and departments, different BI systems had repeated functions,
some decisions required more advanced analytic support, and users required better
performance from the systems. All of these use issues were caused by lower levels
of TSA or lower levels of PEOU.
Usage was measured by the frequency and duration of sessions with the BI sys-
tems. All CIC interviewees, except five developers, reported daily usage of BI sys-
tems. Most AG interviewees reported daily use. Three AG managers revealed less
use frequency; the nature of their tasks did not require a significant amount of data.
Spending additional time using BI systems did not necessary lead to higher quality
information being discovered or sourced. Finally, most CIC and AG users appraised
medium to high satisfaction of BI systems use, in terms of sufficient data feeds,
adequately formatted reports, improved decision logic, and overall satisfaction.
Proposition 5. Factor X, a composite of trust, closeness, experience, and gen-
erations, affects decision task—BI system alignment and perceived ease of use.
Higher levels of FX lead to higher decision task—BI system alignment and
perceived ease of use. A number of concepts relating to Chinese CN emerged
from the analysis of the case study data and the literature review. This group of
concepts is best conceived as a multi-attribute construct. There is no obvious name
for this factor and to avoid biasing the analysis it was given a provisional title of
“Factor X” or FX. This is a similar neutral naming approach to the System 1/System
2 terminology of cognitive systems in behavioral economics (Kahneman 2011).
Guanxi may influence users’ decisions to use BI systems and affect their system
use patterns. Social cognitive theory (SCT) suggests that future behaviors are
shaped by past behavior and beliefs about ability and the environment. Applying
SCT to the post-adoption use of IS, users’ beliefs are more likely to be shaped by
repeatedly presented opportunity and the outcomes of using IS (Craig et al. 2010).
Unlike guanxi in business negotiations, guanxi inside an organization is bounded by
hierarchical positions and tasks. These activities, such as collaboration, often initi-
ate, maintain, and utilize a guanxi dyad. Chinese managers make decisions accord-
ing to specific circumstances that are determined by the people involved, the
occasion of the event, and place where the event takes place (Fu et al. 2006).
Relationships are based on the interaction between guanxi hu (two individuals in
one guanxi dyad), and trust is one of the common measures of guanxi quality. For
instance, people in general perceive a lower level of trust when they work for differ-
ent organizations. For example, an AG manager assisted external researchers to con-
duct research by providing them with data from AG. However, he expressed that
many researchers requested sensitive data that he cannot offer due to security and
ethical concerns. In this case TSA is low because one AG manager had lower trust
with external researchers although NT could be supported by NS. Another AG man-
ager reported a high level of trust with his subordinates when acquiring extra
resources. This manager requested extra technical and analysis support from the BI
team to assist his subordinates’ use of the BI system. This means that PEOU might
be low for this manager’s subordinate but trust helped to increase PEOU although
NS was constant. Trust is therefore important in investigating BI systems use.
90 Y. Song et al.
The decision tasks in the case studies usually required at least two users to work
in a collaborative manner. This is not a common use pattern in Western organiza-
tions. Closeness is a concept that was introduced by multiple interviewees who
were responding to follow-up questions related to xinren (trust) in describing their
relationships with their colleagues. For example, an AG analyst’s superior was in
charge of all data products in an AG BU. Due to a shortage of resources, his superior
had to prioritize development tasks. The analyst asked his superior for help when he
needed his superior to promote a particular development task. The analyst declared
that his supervisor trusts him with development decisions, because they have worked
closely together for some years.
Often, subordinates reported their obedience to their supervisors. However many
interviewees believe this obedience does not lead to a closer relationship or a higher
level of trust. Guanxi closeness comprises two components, trust and feelings, where
trust is more cognitive based and feeling is more affect based (Chen and Peng 2008).
A higher level of trust may motivate a closer relationship, while the degree of close-
ness may impact on the level of trust. One AG analyst described that he was close to
his immediate supervisor, so he trusted that his supervisor would provide additional
resources when required. In this research closeness refers to the distance of relation-
ship a person feels between them and another. Therefore, closeness is important in
understanding and assessing the level of trust, and as a result has been added to FX.
None of the TAM or UTAUT articles explicitly define experience. Presumably
experience means the users’ experience with using a particular technology. A deci-
sion maker is born with a natural endowment, and through dint of practice, learning,
and experience they develop their endowment into a mature skill (Simon 1960). This
means that, to some extent, decision-making skills can be learned through education
and work experience. The majority of employees from CIC and AG possessed bach-
elor degrees. AG had more employees with graduate degrees (two with PhDs), while
CIC had more employees with diploma education and below. Some employees had
studied graduate degrees part-time while working. From the case analysis, the level
of qualification did not explicitly affect their competence in completing assigned
decision tasks. However, the fields of their degree major and their work experience
did lead to different ways of looking at, and analyzing, data. For example, some
managers only considered BI as a data extraction tool, while others believed that BI
systems were essential in assisting and supporting their decision-making process. In
this research, experience consists of education and work experience. Both dimen-
sions of experience are critical to TSA and PEOU, and therefore UB.
The different generations in China have received considerably different educa-
tion. None of the interviewees from CIC and AG were Traditionalists or Boomers.
Generation X was the first generation to receive high quality education after the
Chinese education system recovered from the Cultural Revolution. Generation Y
was born when China opened trade with international companies in the 1980s.
Modern business IT was introduced around the same time. Generation Y was able
to be educated and adopt innovative technology whereas Generation X did not have
the same opportunities. Both CIC and AG have more Generation Y than Generation
7 Business Intelligence System Use in Chinese Organizations 91
In recent years, the Chinese economy has grown to be one of the largest and most
influential in the world. BI systems contribute to overall organization performance
and may lead to competitive advantage. However, no indigenous Chinese organiza-
tion has been investigated regarding BI systems use nor has any study considered
CN and BI use. This project attempts to fill this research gap.
The contribution to knowledge of the project is a set of five research propositions
based on empirical research of BI systems use in Chinese organizations. These
propositions introduce guanxi into BI use theories. This set of propositions is a
foundation for future BI research in China. For practitioners, these propositions
contribute to the understanding of Chinese management and decision support. This
improved understanding may help achieve more effective use of large-scale DSS,
and in turn lead to higher organization performance.
This research is subject to the normal limitations of exploratory research. The
BI use propositions were based on two case studies, and the selected organiza-
tions operated in quite specific industrial environments. It is not possible to gen-
eralize the research outcomes to all Chinese organizations. Another common
concern raised about case study research is the lack of objective interviews
(Eisenhardt 1989). This project adopted the most rigorous possible case study
methods and techniques.
Factor X is a complex attribute construct. It is an umbrella construct that repre-
sents a group of important concepts that have influence from, or on, guanxi. Analysis
suggested there are impacts from this group of concepts on TSA and PEOU. In
addition, these concepts have effects on each other. For example, the level of close-
ness may be interrelated with the level of trust. The case study data also suggested
that there may be a complex interaction between two groups of attributes (genera-
tion and experience, and closeness and trust). However, these complex interactions
remain ambiguous.
The next stage of this research project will have two aims. The first is to investi-
gate the detailed nature of FX. The second aim is to formulate a model of BI use in
Chinese organizations. It is planned to return to CIC and AG to conduct further
investigations.
92 Y. Song et al.
Biographies
References
Anthony RN (1965) Planning and control systems: a framework for analysis. Harvard University,
Boston, MA
Bhattacherjee A, Lin CP (2015) A unified model of IT continuance: three complementary perspec-
tives and crossover effects. Eur J Inf Syst 24(4):364–373
Brislin RW (1970) Back-translation for cross-cultural research. J Cross-Cult Psychol 1(3):185–216
Chen XP, Peng S (2008) Guanxi dynamics: shifts in the closeness of ties between Chinese cowork-
ers. Manag Organ Rev 4(1):63–80
Craig K, Tams S, Clay P, & Thatcher J (2010) Integrating trust in technology and computer self-effi-
cacy within the post-adoption context: an empirical examination, In: Proceedings of Americas
Conference on Information Systems
Davis ELE (2005) Encyclopedia of contemporary Chinese culture. Routledge, New York, NY
Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information
technology. MIS Q 13(3):319–340
Davis FD, Bagozzi RP, Warshaw PR (1989) User acceptance of computer technology: a compari-
son of two theoretical models. Manag Sci 35(8):982–1003
7 Business Intelligence System Use in Chinese Organizations 93
Sousa VD, Rojjanasrirat W (2011) Translation, adaptation and validation of instruments or scales
for use in cross-cultural health care research: a clear and user-friendly guideline. J Eval Clin
Pract 17(2):268–274
Taylor S, Todd PA (1995) Assessing IT usage: the role of prior experience. MIS Q 19(4):561–570
Venkatesh V, Morris MG, Gordon BD, Davis FD (2003) User acceptance of information technol-
ogy: toward a unified view. MIS Q 27(3):425–478
Venkatesh V, Thong JYL, Xu W (2012) Consumer acceptance and use of information technology:
extending the unified theory of acceptance and use of technology. MIS Q 36(1):157–178
Wang Y, Wang S, Fang Y, Chau PYK (2013) Store survival in online marketplace: an empirical
investigation. Decis Support Syst 56(12):482–493
Yin RK (2014) Case study research: design and method, 5th edn. Sage, Los Angeles, CA
Zhang DS, Lowry P, Zhou L, Fu X (2007) The impact of individualism-collectivism, social pres-
ence, and group diversity on group decision making under majority influence. J Manag Inf Sys
23(4):53–80
Zhou T (2011) Understanding mobile internet continuance usage from the perspectives of UTAUT
and flow. Inf Dev 27(3):207–218
Chapter 8
The Impact of Customer Reviews on Product
Innovation: Empirical Evidence in Mobile Apps
8.1 Introduction
A product innovation strategy is critical for firms to survive and prosper in a dynamic
business environment (Alegre and Chiva 2008; Holahan et al. 2014; Wales et al.
2013). Dunk (2011) defines product innovation as an innovation process that
conceives new and better products, which are unique or different in some ways from
existing products (Nakata and Sivakumar 1996). The ability of firms to develop inno-
vative products is key to their competitive advantages (Cankurtaran et al. 2013;
Jayaram et al. 2014). Evidence suggests that product innovation can assist firms to
enter an emerging industry and strengthen their competitiveness in the corresponding
market (Keupp et al. 2012; Kotabe et al. 2011). Therefore, product innovation is criti-
cally important to a firm’s performance (Prajogo and Ahmed 2006; Yao et al. 2013).
Existing innovation literature suggests several different channels of information
acquisition for the product innovation process (von Hippel 1998; O’Hern and
Rindfleisch 2008; Ramaswamy and Prahalad 2004). The traditional perspective
suggests that firms dominate the product innovation decisions (Porter 1980). It
views product innovation as a firm-centric activity, with most information flowing
one way from the firm to its customers (Ramaswamy and Prahalad 2004). While
customers are considered as the passive recipients of product innovation, firms have
very limited understanding of customers’ perception and opinions before the release
of a new product. Firms only target the “right” customers and cannot accurately
capture their customers’ needs. Recent studies show that customers have been more
involved in product co-creation processes. For example, the 3M company took
advantage of identifying lead users before creating breakthrough products in order
to avoid a market decline (Von Hippel et al. 1999). In addition, Cohen et al. (2002)
show evidence that customers can offer useful ideas for new R&D projects and
contribute substantially to the improvement of existing R&D projects. However,
firms, still taking a dominant position in the innovation process, present their ideas
to customers and gather customers’ needs and feedback from only a small fraction
of customers (Di Gangi and Wasko 2009; von Hippel and Katz 2002). Firms tend to
be biased towards listening to their current customers, and even among these, to
their most important customers or those who speak the most (Sawhney et al. 2005).
Recent literature shows that the product innovation process is shifting from a
firm-centric view to customer-driven perspective. While customers are considered
as the passive recipients, firms have very limited understanding of customers’ per-
ception and opinions before the release of a new product. Due to the limitations of
developing new knowledge internally, integrating and using external knowledge is
critical. While most existing literature searches external knowledge from other com-
panies and alliances and finds evidence that sourcing this knowledge is beneficial to
the firm’s product innovation decisions, some other scholars identify customers as
the most important source of information for new product development. Also, cus-
tomer oriented innovation literature illuminates why and how external knowledge is
significant and potentially valuable. O’Hern and Rindfleisch (2008) consider users
as being central and vital participants in the product innovation process. Particularly,
with the fast development and proliferation of online customer review communities,
customers today willingly contribute and share their thoughts and opinions online.
Zhang et al. (2013) show that innovative users share common interests and ideas in
online communities. Since product innovation aims to provide higher quality prod-
ucts and give higher benefit to users. Therefore, customer-driven product innova-
tion, enabled by the Web 2.0 technologies, is getting more and more attention from
researchers and practitioners.
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 97
Online customer reviews have become an important new channel to acquire cus-
tomers’ feedback about product features and potential product defects (Abrahams
et al. 2013; Lee and Seo 2013; Mudambi and Schuff 2010). Existing studies show that
online customer reviews have a significant impact on other customers’ adoption deci-
sion and firms’ sales performance due to the word-of-mouth (WOM) effect (Chen and
Xie 2008; Duan et al. 2008; Ghose and Ipeirotis 2011). In addition to being useful for
customers and marketing purposes, online customer reviews also contain useful
information for product development and improvement. Jin et al. (2015) find that
online customer reviews are an important information source for collecting customer
feedback and new requirements for product developers or designers. Some research-
ers have developed text analysis methods in order to extract and measure aggregated
customers’ preferences and feedback on product features (Decker and Trusov 2010;
Xiao et al. 2015). However, to the best of our knowledge, we have not found any lit-
erature that show empirical evidence about the impact of online customer reviews on
product innovation decisions. Our study aims to find empirical evidence that online
product reviews can affect product developers’ innovation decisions.
The mobile app industry provides a perfect research context for our research
question. First, unlike physical products or enterprise software products, mobile apps
are updated much more frequently (Syer et al. 2013). It provides more observation
instances for product innovation activities than traditionally developed products that
usually have a much longer development cycle. Second, mobile apps have a large
user base through which a large number of user reviews have been generated through
online app stores. Our research can potentially help improve the communication
between app users and developers. Knowing the impact of user reviews on develop-
ers, the app users will be more motivated to contribute reviews. App developers can
quickly identify those reviews that have high impact on their innovation decisions.
Our study can also benefit app store providers such as the Google Play Store by mak-
ing a better ecosystem in support of product innovation for mobile apps.
This study contributes to literature in several ways. First, our research enriches
existing customer-driven product innovation literature. Prior studies suggest that
firms have to design the right toolkits in order to get users involved in the product
innovation process. Our study shows that product developers can learn from the
widely available online customer reviews without developing specialized tools.
Specifically, we reveal how online customer reviews can affect the product innova-
tion cycles. By analyzing online customer reviews, product developers can learn
customer feedback and feature requests in their complete view compared to tradi-
tional ways of collecting customer feedback such as surveys. The second contribu-
tion of our study is to complement the online customer reviews literature, which
mainly show the impact of online customer reviews on the perception and purchas-
ing decisions of future customers (Chen and Xie 2008; Duan et al. 2008; Ghose and
Ipeirotis 2011). Our study will be the first to empirically show the impact of online
customer reviews on product developers and designers. The third major contribu-
tion lies in a deeper understanding in how innovation works in the emerging mobile
apps industry. According to the Silvias (2014), the global mobile app market will
reach $187 billion in 2017. Examining the innovation activities in this emerging
industry will be economically significant.
98 Z. Qiao et al.
From a managerial perspective, our study underscores the business strategy value
of online customer reviews that executives struggle to quantify. Our results indicate
that investment made on analyzing online customer reviews would pay off over time
in terms of better product quality and a higher customer retention rate. In addition,
based on marketing literature, it is important to understand customers’ needs so that
product managers can allocate resources to more productive and promising product
innovation activities.
The rest of the chapter is organized as follows. We first review the Elaboration
Likelihood Model, a persuasion theory that we use to guide our research design. We
then develop our research hypotheses followed by our empirical study. We provide
conclusions, discussions, and future directions at the end.
Existing literature indicates that online customer reviews provide important external
knowledge for product developers to identify new user requirements, detect product
defects, and incorporate user solutions (Abrahams et al. 2013; Lee and Seo 2013;
Mudambi and Schuff 2010). Therefore, online customer reviews have not only a
word-of-mouth (WOM) effect for fellow customers, but also an implicit persuasion
effect on product designers and developers.
The Elaboration Likelihood Model (ELM) has been commonly used to explain
how a message can possibly change the perception of the message recipient. The
theory suggests that a message recipient has a continuum of elaboration methods to
deal with persuasive messages (Tam and Ho 2005). The essence of elaboration pro-
cessing goes beyond simply focusing on comprehending the arguments embedded
in the text content of the received message. When a message recipient does not have
the motivation or ability to read and understand the arguments in a received mes-
sage, persuasion is made through the peripheral route rather than the central route
or argument quality, according to the ELM model. However, in most cases, both
central and peripheral routes work collectively in persuading message recipients’
decisions.
The central route of persuasion requires a message recipient to carefully scruti-
nize the arguments in a received message, thus the recipient’s cognitive efforts on
argument processing determines its influence (Zhang 1996). Existing studies have
found that argument quality, such as information completeness and accuracy, has a
significant impact on the message recipient’s perception on information usefulness
and willingness to adopt the message (Sussman and Siegal 2003).
The peripheral route relies on simple cues that are content-irrelevant indicators
reflecting a recipient’s perception of the credibility of the message source (Chaiken
1980). ELM researchers find that source credibility becomes an important predictor
of the recipient’s attitude change especially when the recipient cannot comprehend
the arguments embedded in the received message (Petty et al. 1981). When a mes-
sage recipient cannot or is not willing to scrutinize the message arguments, he or she
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 99
Central Routes
The Amount of
Information
H1+
Review Readability
H2-
Product Innovation
Peripheral Routes
H3-
Review Valence
H4-
Review Sentiment
Strength
because many user reviews are anonymous. As we discussed earlier, positive senti-
ment or mood conveyed through messages may positively influence the attitude of
message recipients (Petty et al. 1993), i.e., the app developers in our context. In the
rest of this section, we describe our research hypotheses in details.
The amount of information directly influences the capability of the message recipi-
ent to scrutinize the arguments embedded in a message. When the amount of infor-
mation in a message is low, the recipient has few opportunities to elaborate because
the motivation to elaborate is low (Palmer and Griffith 1998). Previous literature
shows that the amount of information in product reviews, which is measured by the
review length, has a positive influence on customers’ adoption decisions due to the
WOM effect (Chen and Turut 2013; Duan et al. 2008). Mudambi and Schuff (2010)
also suggest that long reviews are perceived as being more useful than short ones
because readers consider the word count as the depth of information usefulness and
comprehensiveness.
A greater amount of information in app reviews presumably has more value to
app developers. Lee (2007) suggests that customer reviews reflect customer needs.
Some studies show that customer reviews contain critique about existing product
features and suggestions about new product features (Mudambi and Schuff 2010;
Troy et al. 2001). Customer feedback and suggestions on product features can help
app developers reduce the uncertainty in customers’ perception on new products,
increase their confidence in product innovation decisions, and increase the fre-
quency of product innovation (Dougherty and Dunne 2011; Zirger and Maidique
1990). Therefore, we propose:
Hypothesis 1: Mobile apps that have received a higher amount of information in its
user reviews are more likely to have a new product release.
investors to devote more time and effort to identify and extract relevant information.
By contrast, easy-to-read text improves the message recipient’s reading speed, com-
prehension, and memory retention (Ghose and Ipeirotis 2011). Existing studies
have shown that the readability of product reviews can be used to predict the useful-
ness and impact of the reviews (Ghose and Ipeirotis 2011). Similarly, we propose:
Hypothesis 2: Mobile apps that have received reviews with higher readability are
more likely to have a new product release.
Review sentiment, that includes both sentiment valence and extremity, reflects the
subjectivity of the reviewers. Although it is derived from the message content, senti-
ment is often considered as a peripheral cue because it reveals the mood or affection
status of the author (Bardzil and Rosenberger 1996). Past research shows that review
sentiment can affect the perceived value or usefulness of product reviews. For
example, Schindler and Bickart (2012) find that a moderate proportion of positive
evaluative statements in product reviews positively relates to consumers’ perceived
helpfulness. Sen and Lerman (2007) find that review readers are more likely to con-
sider negative opinions as being helpful for utilitarian products. Cheung et al. (2012)
concludes that fair reviews are perceived more favorably when they cover both posi-
tive and negative aspects of the reviewed product. Existing studies do not provide
consistent conclusions for the impact of review sentiment because of the modera-
tion effects of product category and different message recipients. In our research,
we aim to study the effect of review sentiment on the perceived usefulness of prod-
uct reviews for app developers, not for fellow consumers. We use the SentiStrength
method proposed by Thelwall et al. (2010) to automatically identify and classify the
emotional information of customers. SentiStrength estimates the strength of posi-
tive and negative sentiment in informal short text messages using sentiment word
dictionaries. We consider negative app reviews to be the major concerns for app
developers due to the negative word-of-mouth (nWOM) effect, which has shown to
have both short-term and long-term effects on firms’ financial performance (Luo
2009). Moreover, Schindler and Bickart (2012) find that a product review with more
descriptive statements is considered as being more helpful. Reviews with extreme
sentiment do not have increased value and may decrease the readers’ perceptions
about its helpfulness. According to review sentiment strength in the text, this will
make reviews lean toward neutral polarity. Therefore, we hypothesize the
following:
Hypothesis 3: Mobile apps that show negative sentiment in their reviews are more
likely to have a new product release.
Hypothesis 4: Mobile apps with a lower review sentiment strength are more likely to
have a new product release.
102 Z. Qiao et al.
Our primary interest is to study the impact of app user reviews on product innova-
tion, i.e., the probability of having a new product release. More specifically, we
would like to know if app reviews might shorten the time to the next product release.
If we define an app update to be an event, we can use survival analysis to model this
“time to event” data. On the other hand, survival analysis is capable of incorporating
time-independent explanatory variables, which fits our scenario since our hypothe-
sized explanatory variables are time invariant. Moreover, mobile apps usually have
several updates over time. Therefore, events are recurrent and event time order mat-
ters (i.e., for each mobile app, an update at time t1 is different from that at time t2).
We choose the Stratified Cox Proportional Hazard (SCPH) model (Cox 1972) for
our empirical analysis, which makes no assumption about the form of the baseline
hazard function. The SCPH model does not depend on distributional assumptions of
survival time and defines the hazard ratio as the relative risk based on comparison
of event rates. Thus, we employ the SCPH model to examine the relative association
between the effects of independent variables (i.e., amount of information, review
readability and review sentiment) and a subsequent product release event.
The hazard function, h(t), represents the occurrence rate of a product per unit
time (t). We use T to denote the time to event. The hazard function has the following
form:
Pr ( t £ T < t + Dt|T ³ t )
h ( t ) = lim (8.1)
Dt ® 0 Dt
The SCPH model assumes that the elapsed time to event T is conditional on the
independent variables (X1, X2, … , Xj). In our study, T measures the time between the
product launch date or the previous product update date and the date of the event of
interest—a new product release—or the end of the observation period. Thus, our
hazard ratio represents the “risk” of having a new product release within a time unit
(where the time is measured in days). The SCPH model is expressed as:
hg ( t ,X ) = h0 g ( t ) ´ e
b1 X1 + b 2 X2 ++ b j X j
(8.2)
where β1, β2, ⋯ , βj is a vector of regression parameters to be estimated. The baseline
hazard function h0g(t) corresponds to the case where xj= 0, involving time but not
independent variables at each stratum with g = 1, ⋯ , k∗. The second component is
the exponential functions with the sum of βjXj, which involves independent vari-
ables’ effects but not time.
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 103
(
hg t , X i + 1,|X j ( J ¹ i ) ) =e bi
(8.3)
(
hg t , X i ,|X j ( J ¹ i ) )
8.4.2 Data
As of Q1 of 2015, Google Play store was the largest mobile app provider in terms
of number of app downloads, surpassing the first mobile app store-the Apple App
store. We have collected data for 1215 mobile apps from the Google Play store
between April 8, 2014 and April 5, 2015. All apps have appeared at least once in
the top charts of new game apps. Because our study is focused on text analysis,
we dropped 63 apps that did not receive any user review or have empty reviews
during the data collection period. The final dataset contains 281,202 customer
reviews for 1152 mobile apps. Data collected include basic app attributes, the
business model (free or paid apps), app release/update dates, user ratings, user
reviews, and developer information. Table 8.1 shows the basic summary statistics
for collected data.
8.4.3 Variables
The dependent variable. We retrieved and analyzed the reviews that app users
posted for each mobile app before the app’s next update. We used the variable
Hazard Rate to represent whether the app update event happened and how long the
update interval (i.e., time between the previous update or the initial release date and
its next update date). If a mobile app has not been updated, Hazard Rate represents
the (instantaneous) rate of update for the apps to some time point during the next
instant of time. Some mobile apps were not updated at all during the observation
period. Therefore, the right censoring problem occurs in our data set. To solve the
problem, we used an event variable to indicate whether an observation is censored
(i.e., event is 1 for a complete observation and 0 for a censored one).
8.4.4 Results
Table 8.2 shows the descriptive statistics and pairwise correlations of our variables.
In our dataset, 97% mobile apps had released at least one update during the observa-
tion period. For those mobile apps where at least one update occurred, the average
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 105
Time to Update is 39.3 days. Fifty percent mobile app update cycles received user
reviews before an update occurred. All pairwise correlations between independent
variables are below 0.5 except the correlation between review valence and review
sentiment strength. We also checked the variance inflation factor (VIF) values for all
independent variables in our model. The result indicated that multicollinearity was
not a concern (Zhang et al. 2013).
Table 8.3 presents the estimates of our research model. Review length was found to
positively influence the likelihood that a future mobile app update would occur (β =
0.0083, p < 0.01). This suggests that mobile apps receiving longer user reviews on
average are more likely to receive a new update. However, we found that the total
number of words was negatively related to the possibility of having a future mobile
app update. The hypothesis 1 is only partially supported. We need to conduct further
analysis on this hypothesis in the future.
The Fog index, used to indicate review readability, was found to be negatively
associated with a future app update (β = −0.038, p < 0.01). Mobile apps that receive
user reviews with a high Fog index (i.e., more difficult to read) are less likely to
receive a new update. The observation supports Hypothesis 2.
As predicted by Hypothesis 3, mobile apps that had received positive user
reviews were less likely to receive a new update (β = −0.10, p < 0.1). Mobile apps
that have received user reviews with extreme sentiment were less likely to receive a
future update (β = −0.084, p < 0.05). Therefore, Hypothesis 4 is also supported.
106 Z. Qiao et al.
User generated product reviews have been found to have a word-of-mouth effect as
a new element of marketing communication. However, their implication on improv-
ing product innovation cycles have not been studied before. Guided by the ELM
persuasion theory, we examined the central and peripheral cues of online mobile
app reviews and their impact on app developers’ product innovation decisions. Our
empirical study shows that easy-to-ready user reviews with high average review
length and mildly negative reviews can increase the likelihood of a future app
update. Our findings highlight the need for researchers to explore user generated
reviews in the context of customer-centered product innovation.
Our work has theoretical and practical implications. First, our research enriches
customer-centered product innovation literature and is the first paper to empirically
examine the impact of online product reviews on product innovation in the mobile
app industry. Second, our research can benefit different stakeholders in the mobile
app industry. For customers, our research encourages them to continue contributing
reviews because those reviews do matter in getting better products in return.
Moreover, our study provides specific guidelines for writing online product reviews
that can be better perceived by app developers. For app developers, our study can be
used to automatically process user reviews and extract useful information content in
their product innovation processes. Lastly, our study can also benefit mobile app
platform providers such as the Google Play Store and Apple App Store by promot-
ing useful user reviews and making a better ecosystem for product innovation.
Our work has also several limitations. First, our data set may contain mobile apps
that was only updated once during our observation period. That will introduce
anomalies in our analysis. Second, our model can be improved by including impor-
tant control variables such as mobile app rank, app category, and app tenure. Third,
our findings can only be applied to the data set that we collected. Additional analysis
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 107
Biographies
Weiguo Fan is a Full Professor of Accounting and Information Systems and Full
Professor of Computer Science (courtesy) at Virginia Tech. He is also a L. Mahlon
Harrell Research Fellow. He received his Ph.D. in Business Administration from the
Ross School of Business, University of Michigan, Ann Arbor, in 2002, a M.Sc. in
Computer Science from the National University of Singapore in 1997, and a B. E. in
Information and Control Engineering from the Xi’an Jiaotong University, P.R. China,
in 1995. His research interests focus on the design and development of novel infor-
mation technologies—information retrieval, data mining, text/web mining, business
intelligence techniques—to support better business information management and
decision making. He has published more than 140 refereed journal and conference
108 Z. Qiao et al.
papers. His research has appeared in journals such as Information Systems Research,
Journal of Management Information Systems, Production and Operations
Management, IEEE Transactions on Knowledge and Data Engineering, Information
Systems, Communications of the ACM, Journal of the American Society on
Information Science and Technology, Information Processing and Management,
Decision Support Systems, ACM Transactions on Internet Technology, Pattern
Recognition, IEEE Intelligent Systems, Pattern Recognition Letters, International
Journal of e-Collaboration, and International Journal of Electronic Business. His
research has been cited more than 3600 times (h-index:33, i10-index:72) Google
Scholar. His research has been funded by five NSF grants, one PWC grant, and one
KPMG grant.
References
Abrahams AS, Jiao J, Fan W, Wang GA, Zhang Z (2013) What’s buzzing in the blizzard of buzz?
Automotive component isolation in social media postings. Decis Support Syst 55(4):871–882
Alan S. Dunk (2011) Product innovation, budgetary control, and the financial performance of
firms. The British Accounting Review 43(2):102–111
Alegre J, Chiva R (2008) Assessing the impact of organizational learning capability on product
innovation performance: an empirical test. Technovation 28(6):315–326
Bardzil J, Rosenberger P III (1996) Atmosphere: does it provide central or peripheral cues. Asia
Pacific Adv Consum Res 2:73–79
Batra R, Stayman DM (1990) The role of mood in advertising effectiveness. J Consum Res
17:203–214
Bloomfield RJ (2002) The ‘incomplete revelation hypothesis’ and financial reporting. Account
Horiz 16(3):233–243
Cankurtaran P, Langerak F, Griffin A (2013) Consequences of new product development speed: a
meta-analysis. J Prod Innovat Manag 30(3):465–486
Chaiken S (1980) Heuristic versus systematic information processing and the use of source versus
message cues in persuasion. J Pers Soc Psychol 39(5):752
Chen Y, Turut Ö (2013) Context-dependent preferences and innovation strategy. Manage Sci
59(12):2747–2765
Chen Y, Xie J (2008) Online consumer review: word-of-mouth as a new element of marketing
communication mix. Manage Sci 54(3):477–491
Cheryl Nakata, K. Sivakumar (1996) National Culture and New Product Development: An
Integrative Review. Journal of Marketing 60 (1):61
Cheung MY, Sia C-L, Kuan KKY (2012) Is this review believable? A study of factors affect-
ing the credibility of online consumer reviews from an ELM perspective. J Assoc Inf Syst
13(8):618–635
Cohen WM, Nelson RR, Walsh JP (2002) Links and impacts: the influence of public research on
industrial R&D. Manage Sci 48(1):1–23
Cox DR (1972) Regression models and life-tables. J R Stat Soc B Methodol 34:187–220
Decker R, Trusov M (2010) Estimating aggregate consumer preference from online product
reviews. Int J Res Mark 27(4):293–307
Di Gangi PM, Wasko M (2009) Steal my idea! Organizational adoption of user innovations from a
user innovation community: a case study of Dell IdeaStorm. Decis Support Syst 48(1):303–312
Dougherty D, Dunne DD (2011) Organizing ecologies of complex innovation. Organ Sci
22(5):1214–1223
8 The Impact of Customer Reviews on Product Innovation: Empirical Evidence… 109
Schindler RM, Bickart B (2012) Perceived helpfulness of online consumer reviews: the role of
message content and style. J Consum Behaviour 11(3):234–243
Sen S, Lerman D (2007) Why are you telling me this? An examination into negative consumer
reviews on the web. J Interact Mark 21(4):76–94
Silvias (2014) Mobile Load: Performance Testing for Mobile Applications. https://community.
saas.hpe.com/t5/LoadRunner-and-Performance/Mobile-Load-Performance-Testing-for-
Mobile-Applications/ba-p/273396#.WUjnXzOB000
Stewart DW, Zhao Q (2000) Internet marketing, business models, and public policy. J Public
Policy Mark 19(2):287–296
Sussman SW, Siegal WS (2003) Informational influence in organizations: an integrated approach
to knowledge adoption. Inf Syst Res 14(1):47–65
Syer MD, Nagappan M, Hassan A E, Adams B (2013) Revisiting prior empirical findings for
mobile apps: an empirical case study on the 15 most popular open-source android apps. In:
Proceedings of the 2013 conference of the Center for Advanced Studies on Collaborative
Research (CASCON), pp. 283–297
Tam KY, Ho SY (2005) Web personalization as a persuasion strategy: an elaboration likelihood
model perspective. Inf Syst Res 16(3):271–291
Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A (2010) Sentiment strength detection in
short informal text. J Am Soc Inf Sci Technol 61(12):2544–2558
Troy LC, Szymanski DM, Varadarajan PR (2001) Generating new product ideas: an initial inves-
tigation of the role of market Information and organizational characteristics. J Acad Mark Sci
29(1):89–101
Von Hippel E (1998) Economics of product development by users: the impact of ‘sticky’ local
information. Manage Sci 44(5):629–644
Von Hippel E, Katz R (2002) Shifting innovation to users via toolkits. Manage Sci 48(7):821–833
Von Hippel E, Thomke S, Sonnack M (1999) Creating breakthroughs at 3M. Harv Bus Rev
77:47–57
Wales WJ, Parida V, Patel PC (2013) Too much of a good thing? Absorptive capacity, firm per-
formance, and the moderating role of entrepreneurial orientation. Strat Manag J 34:622–633
Wu C, Shaffer DR (1987) Susceptibility to persuasive appeals as a function of source credibility
and prior experience with the attitude object. J Pers Soc Psychol 52(4):677
Xiao S, Wei C-P, Dong M (2015) Crowd intelligence: analyzing online product reviews for prefer-
ence measurement. Inf Manag 53(2):169–182
Yao Z, Yang Z, Fisher GJ, Ma C, Fang EE (2013) Knowledge complementarity, knowledge absorp-
tion effectiveness, and new product performance: the exploration of international joint ventures
in China. Int Bus Rev 22(1):216–227
Zhang Y (1996) Responses to humorous advertising: the moderating effect of need for cognition.
J Advert 25(1):15–32
Zhang C, Hahn J, De P (2013) Continued participation in online innovation communities: does
community response matter equally for everyone? Inf Syst Res 24(4):1112–1130
Zhou M, Du Q, Fan W, Qiao Z, Wang G, Zhang X (2015) Money talks: a predictive model on
crowdfunding success using project description. In: Twenty-first Americas conference on
information systems, Puerto Rico
Zirger BJ, Maidique MA (1990) A model of new product development: an empirical test. Manag
Sci 36(7):867–883
Chapter 9
Whispering on Social Media
Juheng Zhang
Abstract Using Twitter as the primary social media platform, we study the predic-
tive relationship of social media buzz in quiet periods and the IPO’s first-day return,
liquidity, and volatility. We compare social media buzz with conventional press
news coverage and show that social media buzz is stronger at predicting the first-day
returns than conventional press news.
9.1 Introduction
With the ease and speed of the Internet, individual investors, also called retail or
unsophisticated investors, have been attracted to financial markets. New technolo-
gies have streamlined the procedures to trade stocks and have provided investors
with an improved information environment. Individuals can access information
about companies online much more easily now than they could in the era when the
mainstream communication channels were television and radio broadcasting only.
Current information technologies also provide individuals with social media
platforms (e.g., Twitter, Facebook) that enable them to be socially connected and to
exchange opinions and information. Social media as a means of disseminating
information differ from traditional media in several ways: cost, speed, impact, and
reach. For instance, tweets can reach a large number of people almost immediately
and at a negligible cost. Social media have become a major platform for individuals
to access news and for investors to learn about investing opportunities. Social media
is changing how people find and interpret news, and has been identified as the most
powerful outlet of information (Greenslade 2014). The study (Bennett 2013) found
that 73% of online users are active on at least one social network. Through social
media, people can get the latest corporate news, market trends, investment
information, etc. They can also use social media to discuss stocks and the markets
with other investors and to research information about companies and brokers.
J. Zhang (*)
Department of Operations and Information Systems, University of Massachusetts Lowell,
One University Ave., Lowell, MA 01854, USA
e-mail: [email protected]
Research related to our study is found in the IPO literature. We refer readers to the
review paper (Ritter and Welch 2002), which overviews various theories and rea-
sons for going on public and includes a detailed discussion of the two stylized char-
acteristics of IPOs: first-day return (underpricing) and long-run underperformance,
both of which are related to the behaviors of retail investors. Liu et al. (2014) study
media coverage of IPOs. They examine conventional media coverage (press news)
prior to an IPO to predict the IPO firm’s long-run liquidity and its following analysts
and institutional investors. They measure the pre-IPO news coverage of a company
9 Whispering on Social Media 113
by the number of articles mentioning the company name during the 30 days prior to
the IPO date, and find a positive correlation between the pre-IPO press coverage, the
firm’s long-run following analysts, institutional investors, and the stock’s liquidity.
Da et al. (2011) use Google search trend as the index of retail investors’ attention in
predicting stock returns. Our paper differs from these studies in that we focus on
social media content rather than on conventional press news and in that we use
social media content to study both the IPO’s first-day performance and its long-term
underperformance rather than the long-term attention from analysts and institu-
tional investors.
In the information systems (IS) literature, social media content has been exam-
ined (e.g., Bollen et al. 2011; Luo et al. 2013; Zhang 2014, 2015a, b; Zhang et al.
2015). Bollen et al. (2011) use the mood of daily Twitter feeds to predict the Dow
Jones Industrial Average (DJIA) over time. Two mood-tracking tools, OpinionFinder
and Google Profile of Mood States, are used to detect the mood of daily tweets.
They find that the inclusion of public mood can improve the ability to predict stock
prices. Luo et al. (2013) use sentiment analysis to study the impact of the senti-
ment of social media content towards a company on the company’s equity value.
These IS studies focus on whether a message about a company is positive or nega-
tive and whether the sentiment affects the firm’s stock price. They use sentiment
analysis to study the impact of positive and negative social media content on firm
performance.
Cook et al. (2006) suggest that 99% of the news about IPOs is non-negative, so
we use the simple count of tweets mentioning a stock as the measure of the volume
of social media buzz about the stock, and then study the impact of that buzz amount
on the IPO’s first-day return, liquidity, and volatility. As for the content of tweets,
the informedness and consensus measure the value of information release
(Holthausen and Verrecchia 1990). We capture the informedness and consensus of
each tweet by the number of favorites and retweets of the tweet jointly, and study
their impact on IPOs’ initial returns.
We used three categories of media content in the quiet period—press news, business
tweets, and microbloggers’ tweets—to study the first-day return, liquidity, and vola-
tility. Individual investors are often influenced by press news about IPOs or recom-
mendations available on social media platforms. The volume of social media buzz
accumulated during quiet period may have an impact on the IPOs’ first day return,
trading volume, and bid-ask spread, and it may even lead to unjustified upward price
pressure on IPOs. Hence, we formulate our hypothesis as the following.
Hypothesis 1: Social media buzz about an IPO during quiet period is positively
correlated with the IPOs’ first day return, volatility, and liquidity.
114 J. Zhang
Twitter users read and evaluate feeds posted on Twitter, and the number of
retweets and favorites indicate the agreement among users on the information con-
tained in the tweet and the degree of its information value. If a Twitter user retweeted
or favorited a tweet, s/he often found that the tweet contained useful information or
could be helpful to others, and also somewhat agreed with the information conveyed
in the tweet. The number of retweets and favorites indicate the agreement among
users on the information contained in the tweet and the degree of its information
value. We use the number of retweets and favorites of a tweet to jointly capture the
informedness and consensus of the tweet. The informedness and consensus of the
tweets about IPOs during quiet period may have an impact on investors’ investment
decisions and further accelerate the impact of the tweets on IPOs’ initial return,
volatility, and liquidity. Therefore, we conjecture the following hypothesis:
Hypothesis 2: The information value of tweets about an IPO during quiet period
are positively correlated with the IPO’s first day return, volatility, and liquidity.
We began with the list of IPOs in the year 2014 and use them as our company
samples. We downloaded the offer prices, open prices, and closing prices of the
IPOs from IPOScoop.com. We retrieved the accounting numbers prior to the IPO
companies from the Compustat database. Around twenty of the companies have
missing data. We provide the basic statistics of the IPOs in Table 9.1.
We downloaded tweets by searching either ticker symbol or company name of
each IPO. Twitter users use variant company names when mentioning a company:
for example, “Alibaba Group Holding Ltd.,” “Alibaba Group Holding,” or “Alibaba”
for Alibaba company. We allowed variations of a company name when searching in
Twitter, but refined the search keywords and removed noisy or ambiguous ones: for
example, “King” for King Digital Entertainment PLC. We downloaded the tweets
along with each tweet’s posted date, number of times being retweeted, times being
“favorited,” and user account name.
14000
12000
10000
8000
6000
4000
2000
We checked the account names of downloaded tweets and found that the tweets
were mainly posted by individual Twitter users. We separated those tweets from the
ones posted by the companies themselves. We call the tweets by individual Twitter
users as “microbloggers’ tweets” and the tweets from businesses as “businesses’
tweets” in our analysis. Following the IPO literature (e.g., Liu et al. 2014), we
defined the 30 days prior to the IPO as the quiet period. We then used the volume of
tweets in the quiet period to study the effect of social media buzz on an IPO’s first-
day performance.
We plot the number of downloaded tweets mentioning IPO ticker symbols in
Fig. 9.1. The tweet age is the number of days between the tweet posted date and the
tweeted stock’s IPO date. Figure 9.1 shows that users start to tweet the IPOs actively
around 30 days before the IPO date. The volume of tweets peaks on the IPO day,
with 15,000 tweets mentioning the sampled IPOs in total. After the IPO date, the
stocks are tweeted less frequently than the IPO date but more than during the quiet
period, around 1000 tweets in total about the IPOs per day.
We collected detailed information about the IPO companies’ twitter accounts. In
Table 9.2, we provide the statistics for the twitter account information. 118 out of
288 IPO sample companies have a twitter account.
In addition to analyzing the collected Twitter data, we considered the news cov-
erage in the traditional media. We downloaded the articles in the press about IPOs
from the Lexis Nexis database. We used not only stock ticker symbols but also the
variant company names as search keywords to find the IPO news of the companies
during the period from January 1, 2013, to December 31, 2014.
116 J. Zhang
We include the prediction variables for the first-day returns as suggested in the IPO
literature (e.g., Da et al. 2011; Kim and Ritter 1999) to conduct the analysis of vari-
ance. The log of a company’s total assets, the reputation of underwriters, and market
sentiment are included. We also use the variables of media buzz, including the num-
ber of news articles, the log of the number of microbloggers’ tweets (including
retweets and favorites), the log of the number of company’s followers, and the age
of the Twitter account. Table 9.4 shows the variance using the stepwise selection of
variables. The adjusted R-Square of the prediction model is 0.27.
To determine if first-day return is related to long-run underperformance, we
looked at the cumulative return for the following time periods: 30 days, 45 days, 1
year, and 1 year after the IPO from 30 days after the IPO. Table 9.5 lists the correla-
tion coefficients and t-values. As shown in Table 9.5, price reversion is observed
9 Whispering on Social Media 117
as early as 30 days after the IPO. The first-day return is significantly negatively
correlated with the 30-day return, with a coefficient of −0.19. One percent of gain
in stock return on the first day will be reversed with an 0.19% loss after the 30th day.
The first-day return is also negatively correlated with the cumulative returns in the
45-day period, the 1-year period, and 1-year-after-30-day period. The finding is
consistent with IPO underpricing and long run underperformance in existing studies
(e.g., Loughran and Ritter 2004; Ritter and Welch 2002).
118 J. Zhang
9.6 Conclusion
We find that the number of feeds tweeted by Twitter users on the IPO and the num-
ber of times that the tweets are retweeted or favorited during the quiet period are
significantly positively correlated with the IPO’s first-day return, trading volume,
and bid/ask spread. The findings suggest that social media buzz in a quiet period is
significantly correlated with an IPO’s stock performance on the first day.
Biography
References
Bennett S (2013) 73% of online adults now use social media. http://www.mediabistro.com/alltwit-
ter/pew-social-study_b53501. Accessed 30 Dec 2013
Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2:1–9
Cook DO, Kieschnick R, Van Ness RA (2006) On the marketing of IPOs. J Financ Econ 82(1):35–61
Da Z, Engelberg J, Gao P (2011) In search of attention. J Financ 66(5):1461–1499
Granovetter M (1985) Economic action and social structure: the problem of embeddedness. Am
J Sociol 91(3):481–510
Greenslade R (2014) More digital disruption ahead for mainstream news groups, says survey.
http://www.theguardian.com/media/greenslade/2014/jun/12/digital-media-social-media.
Accessed 11 June 2014
Holthausen RW, Verrecchia RE (1990) The effect of informedness and consensus on price and
volume behavior. J Account Rev 65:191–208
Kim M, Ritter JR (1999) Valuing IPOs. J Financ Econ 53(3):409–437
Liu LX, Sherman AE, Zhang Y (2014) The long-run role of the media: evidence from initial public
offerings. Manag Sci 60(8):1945–1964
Loughran T, Ritter JR (2004) Why has IPO underpricing changed over time? Financ Manag 33(3):5–37
Luo X, Zhang J, Duan W (2013) Social media and firm equity value. Inf Syst Res 24(1):146–163
Newkirk RG (1991) Sufficient efficiency: fraud on the market in the initial public offering context.
Univ Chicago Law Rev 58(4):1393–1422
Ritter JR, Welch I (2002) A review of IPO activity, pricing, and allocations. J Financ 57(4):1795–1828
Zhang J (2014) Information revelation and social learning. Int J Bus Soc Sci 5(2):115–125
Zhang J (2015a) Ensuring trust online through the wisdom of crowd. J Internet e-Bus Stud 2015:886172
Zhang J (2015b) Voluntary information disclosure on social media. Decis Support Syst 73(2015):28–36
Zhang XM, Zhang L (2015) How does the internet affect the financial market? An equilibrium
model of internet-facilitated feedback. MIS Q 39(1):17–37
Zhang J, Khan RM, Shih D (2015) The rating determinants factored in decision-making for hotel
selection. Int J Appl Manag Technol 14(1):1–20
Chapter 10
Does Social Media Reflect Metropolitan
Attractiveness? Behavioral Information
from Twitter Activity in Urban Areas
Abstract The rapid and ongoing evolution of mobile devices allows for increasing
ubiquity of online handhelds, yet boosting the recent growth of social platforms.
This development facilitates participation in social media for an enormous amount
of individuals independently from time and location. When navigating through a
city and especially when following activities worthy to be shared with others, peo-
ple uncover their traces in both geographical and temporal dimension. Using these
traces to spot popular areas in a metropolitan region is valuable to a broad variety of
applications, reaching from city planning to venue recommendation and invest-
ment. We propose a density-based method to determine the attractiveness of areas
based solely on spatial and content characteristics of Twitter activity. Furthermore,
we show the relation of attached images, videos, or linked places to the activity
users are engaged in and assess the explanatory power of Twitter messages in a
geographical context.
10.1 Introduction
In recent years, the rapid evolution and development of mobile devices have heavily
boosted the growth of social web services and their users’ online activity. For
instance, in March 2013, Flickr users uploaded more than 3.5 million new images
per day (Jeffries 2013); during December 2013, Facebook had 757 million daily
active users worldwide on average with 556 million daily active users accessing the
service from their mobile devices (Facebook Inc. 2013). On an average day, more
than 500 million Twitter messages are sent (Twitter Inc. 2014), 60 million photo-
graphs are shared on Instagram collecting 1.6 billion likes (Instagram 2014) and
several million people share their locations by checking in on Foursquare (2014).
The ubiquity of smartphones and other devices that enable mobile internet access
facilitates the access to social networks independent of time and location. Wherever
users are, they are able to share their mood, feelings, activities, photographs, and
places. Not simply limited to posting statuses, a wide variety of online platforms are
present where people can additionally rate the place they are visiting. Recommen
dations and reviews from users cover any imaginable location, ranging from restau-
rants and bars, stores and malls to even beaches and other public areas, as well as
companies. For instance, foursquare is an online service that allows the sharing and
rating of arbitrary venues and collects several million user check-ins per day, aggre-
gating an enormous crowd-sourced recommendation database. With all this avail-
able geo-referenced data (e.g. tagged photographs or recommended venues) trips to
foreign cities have fundamentally changed. When searching for a trip, people can
rely on hotels others have rated and commented on before. When looking for a
round-trip, visitors can check tagged and rated photographs and interesting spots,
ranging from unappealing areas that should be left out to must-see locations.
Therefore, one could assume that visiting a foreign city is becoming less exciting,
because there is nothing left to be explored. Contrastingly, travelling rather seems to
require less preparation through the support of the social web and consulting mobile
devices, and so offers new opportunities, by spontaneously looking up recommen-
dations during the stay. The amounts of data generated by users in social services all
over the world provides evidence for the key aspects of reference and recommenda-
tion from others that makes information valuable.
One major drawback in social media data concerning the analyzability is that
users are aware of what they are doing. The disutility of this fact is twofold. On the
one hand, people are able to significantly manipulate comments, ranks, and recom-
mendations, whether maliciously or not, and thus can considerably skew the results,
possibly even rendering the entire analysis useless. On the other hand, a noticeable
dominance of extremes can be observed considering the online recommendations of
arbitrary entities. People tend to comment and rank either if they are positively
excited or if they are truly disappointed. Thus, recommendations and comments
would be distinctively more reliable if users posted independent of their current
mood and so had no opportunity to maliciously impact ratings on the social web. We
consider the behavior of users who are not aware of what traces they leave behind
to be more honest than publicly posted opinions where users know that they might
have an impact on others’ decisions.
In this research, we aim to explore the possibilities of identifying places of inter-
est in urban environments based on user-generated mass data that is more robust and
reliable than pure recommendations. Complete navigation tracking is impossible
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 121
because people need the ability to opt-out if they don’t want their geographical
location to be disclosed. Thus, we focus on Twitter as the data source of choice.
People use Twitter to show where they are, but also for sharing feelings, thoughts,
and desires with friends and followers. From earlier findings, we have evidence that
people use Twitter on an irregular basis, but status updates are particularly cumu-
lated around places where there is something worth being shared, whether it is inter-
esting, boring, exciting, or disgusting. Analysis of a social activity stream that
comes along with geo-spatial and temporal information is a promising and auspi-
cious option to collect robust data on urban hot-spots, though it may originally be
designed for a different purpose than pure location sharing. From these preliminary
thoughts, we derive our main research questions:
• Can general social media activity in an urban environment be utilized to spot
socially attractive places?
• How does social media blend into already available criteria (e.g. natural or
historical factors) to determine places of interest within a city?
• How do socially attractive places relate to the environmental conditions, such as
known hot spots or establishments in the vicinity?
The remainder of this work is structured as follows. In the first section, we ana-
lyze relevant work from related research streams, namely geographical attractive-
ness, location-based recommendations, and event recognition. The insights found in
related research are used to derive characteristics of social attractiveness in the sub-
sequent section. We describe the characteristics of our Twitter data set and derive a
model to measure attractiveness that solely relies on measurable attributes from
Twitter messages. In the subsequent section we provide a regression analysis to sup-
port and verify our attractiveness model. The resulting findings then are described
and visually outlined. This paper closes with a concluding section with a short sum-
mary and an outlook on potential future research.
10.2.1 D
efinition and Measurement of Geo-spatial
Attractiveness
Geo-spatial data (e.g. trajectories of people navigating through a city) has always been
employed to serve as input for location-based recommendation systems, but lately
received a boost due to the emergence of location-based social networks (Hongzhi
et al. 2013; Liu et al. 2013; Ye et al. 2011). Publications on the topic of point of interest
recommendation systems are manifold and many of them even propose the imple-
mentation of a well-functioning recommender system. Towards understanding the
general mobility pattern of individuals, González et al. (2008) carry out a study based
on the trajectories of 100,000 mobile phone users, which allows for important insights
in to human mobility that are useful when proposing location-based recommenders.
The certain types and formats of input data that are fed into recommendation systems
are manifold. These types contain, for instance, location data, trajectories or GPS
coordinates, data on activities as well as from services, point of interest categories in
the close vicinity, user profiles, personal preferences (on locations), or geo-tagged
photographs (Ballatore et al. 2010; Bao et al. 2012; Waga et al. 2012; Zheng et al.
2010). As baseline approaches, the named researchers choose clusters, matrices, and
collaborative filters. When places are scored for location recommendation, the sys-
tems rely on various metrics, some of them being users’ ratings, experts’ comments,
photo tags, or the distance between a user and a service (Arase et al. 2010; Bao et al.
2012; Waga et al. 2012; Zheng et al. 2010). The context-aware recommendation
system presented by Waga et al. (2012) is accessible via the web and recommends
relevant locations based on user-generated data, as well as based on a dataset consist-
ing of GPS routes, trusted services, and locations from geo-tagged photographs.
Divided into three different databases, the proposed system scores locations based on
the corresponding three sections service, photos, and routes.
A different approach for location-based recommendation is proposed by Arase
et al. (2010). The authors focus on detecting patterns from users’ trips and try to
make suggestions on the travel route. Trip patterns are mined by analysis of geo-
tagged photographs publicly posted on the social image sharing platform Flickr by
other users who have travelled to the same geographical area. Not only the geo-tag
itself is used by the system, but the authors additionally rely on each picture’s title
and tags. The large amounts of data backing the geographical scores increase the
trip recommendations’ accuracy and can thus be more convincing to users. The col-
laborative filtering approach proposed by Zheng et al. (2010) is another approach
for creating geo-spatial ranks. Potentially interesting locations and activities worth
being suggested to a user are mined by analyzing user locations and activity histo-
ries. In their approach, the authors generate further knowledge, such as location
profiles or activity-activity relations from geographical databases and the web.
Based on matrix factorization and grid-based clustering, Zheng et al. (2010) are able
to identify regions with different characteristics according to people’s behavior
within a city.
124 J. Bendler et al.
In addition to the location category, Bao et al. (2012) further take user preferences
and social opinions into account. In their approach, a user’s preference is compared
to the preferences of highly experienced users that are regarded as local experts.
Ballatore et al. (2010) propose a geographical information system called “RecoMap”
to provide personalized recommendations by monitoring social interaction and
context. Yue et al. (2009) rely on user-generated GPS data in order to discover
potentially interesting locations. Furthermore, the authors analyze areas where
users travel and where they rest. The idea behind this is that places of interest can be
inferred from clustering the pick-up and drop-off locations of taxi passengers. The
research of Leung et al. (2011) proposes clustering and recommendation based on
activities drawn from GPS log files. Yoon et al. (2012) analyze user-generated GPS
trails to learn transmission routes from experts and residents in order to propose
itineraries to first-time visitors.
Apart from the identification of location-bound activities and attractions that lie in
the common interest of many social media users, there is an active field of research
in recognition of global, regional, or even local events by analysis of social media
data. For instance, events can be popular sports games, festivals, regional weather
phenomena, epidemics, or natural disasters. The principle of studies by Lee et al.
(2011) and Rattenbury et al. (2007) is to monitor the sudden increase or decrease of
tweets within a short time period. If exceeding a regular amount of messages per
day according to the geographical regularities, the boost is considered an event. Lee
et al. (2011) monitor certain geographical areas and are able to identify unexpected
as well as expected activities from Twitter messages. Contrastingly, Rattenbury
et al. (2007) rely on an unstructured set of Flickr tags when extracting events and
places. They focus on photographs from the San Francisco Bay Area and extract
semantics from their assigned tags. The authors are able to detect important spots
and events from within the city.
One aspect that most social-media related research threads have in common is their
focus on the textual content of messages. The widely used abbreviations in internet
speech, as well as the large number of different languages used in Twitter messages
render textual analysis a demanding task. Since the exact semantics (i.e. not only
semantic estimates or sentiment) are not easily extracted and usually contain heavy
uncertainty given the style and pace at which Twitter messages are generated—even
given today’s tools and methods—we intend to explore the value of the faster and
simpler geographic analysis. Furthermore, to be able to use any Tweet in any city all
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 125
around the globe, no matter whether it contains information as such, we think that
there is a need to rely solely on the geo-spatial and temporal dimensions of informa-
tion provided by social media. Furthermore, in contrast to existing research, we aim
to detect popular places by people’s social activity without necessitating the reason
“why” exactly. We argue that digging for certain environmental reasons for which
people visit places may limit the findings of the research and analysis to pre-defined
spots. To our best knowledge, there is no present research that identifies geographi-
cal areas by their social attractiveness rather than environmental conditions. Our
proposed method is a more general approach, appropriate to serve for travel research,
city planners, and real estate investments, for example.
In order to fill the identified research gap, we propose a method to find hot spots
within a city by mining the geo-spatial information from Twitter messages. In this
section, we present categories of activities representing people’s actions when visit-
ing an area. We provide information on our dataset and draw evidence on the appli-
cability of Twitter data for measuring attractiveness. Finally, we propose a
mathematical formalization to estimate social attractiveness of areas and support it
visually. The model then is validated in the subsequent section.
Wherever there is an increased number of people in a definable geographical
vicinity, we expect something to be there that attracts their attention and thus lies in
their common interest. Identifying the position of people using social web services
with geo-tagging, we synonymize their common interest as the social attractiveness
of the location. From a tourist’s perspective, Gearing et al. (1974) and Enright and
Newton (2004) identified five groups of factors that can be used to measure the
touristic attractiveness of an area. Furthermore, 17 criteria were identified and
assigned to the five major groups, as outlined in Table 10.1. The authors have laid
out their work together with an assessment of the criteria and a calculation of
weights to fit larger geographic areas that may or may not attract tourists, such as
entire cities.
In this research, we focus on a finer grained resolution since we want to identify
attractive places within single cities. Nevertheless, the findings of Gearing et al.
(1974) are adoptable as a guideline. In order to measure not only the touristic attrac-
tiveness but to also judge the attractiveness for the social and daily lives of residents
as well, the original assignments may be inappropriate. To adapt the findings of
Gearing et al. (1974) to modern urban living, we rearrange the criteria to be geared
towards civil living and businesses, in addition to the touristic alignment. This
requires a more general definition of attractiveness that complies with the habits of
social service users. According to the research carried out by Leung et al. (2011),
activities can be the key aspect towards measuring location-based attractiveness.
Where many people are engaged in activities in a certain area, the density of indi-
viduals can possibly reflect the attractiveness of that region and thus serve well as a
126 J. Bendler et al.
Table 10.1 Groups and criteria to judge touristic attractiveness according to Gearing et al. (1974)
Group Criterion
(1) Natural factors Natural beauty
Climate
(2) Social factors Artistic and architectural features
Festivals
Distinctive local features
Fairs and exhibits
Attitudes toward tourists
(3) Historical factors Ancient ruins
Religious significance
Historical prominence
(4) Recreational and shopping Sports facilities
facilities Educational facilities
Facilities conductive to health, rest, and tranquility
Nighttime recreation
Shopping facilities
(5) Infrastructure and food Infrastructure above “minimal touristic quality”
and shelter Food and lodging facilities above “minimal touristic quality”
measure. Similar research is proposed by Jaffe et al. (2006), who use densities to
build an algorithm that is able to cluster geo-referenced images into collections.
Inspired by their research, our starting point is to model human attention and action
using official Twitter data at hand. In order to utilize Twitter status updates as a
proxy for attention and action, the following assumptions are to be made; (1) Twitter
status messages are posted in real-time when a user experiences something he
assumes to be attracting others’ attention and interest; (2) Twitter status messages
with geo-tags are especially posted near the locations that provide some important
activity or landmark; (3) The density of tweets reflects to some extent the number of
people moving around in a specific location, cf. Jaffe et al. (2006).
We refrain from questionnaire-based research as carried out by Gearing et al.
(1974) and employ a quantitative methodology instead. Due to the tools and meth-
ods available today, we can efficiently mine large amounts of data gathered from
Twitter, extract user activity and locations, and calculate densities on temporal and
geographical bases. Adjusting the categories outlined above, we propose mapping
according to Gearing et al. (1974), outlined in Table 10.2.
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 127
400
300
200
100
0
0 5 10 15 20
Hour of Day
People are engaged in activities. Tourists on a city trip can for example follow their
pleasure when sight-seeing, residents go about their everyday business. Indepen
dently from the certain activity, people encounter various circumstances and condi-
tions on their routines in a city. Among those people, active Twitter users post
status messages to inform their friends and followers about their experiences,
thoughts, feelings, or impressions. Irrespective of the kind of status update people
post, they reveal themselves in an activity at a certain location at that very time. In
this research, we rely solely on geo-tagged tweets in order to use spatial informa-
tion to determine patterns of activities. The patterns identified in the following
provide evidence of Twitter being an appropriate data source to measure geograph-
ical attractiveness.
We have gathered more than 600,000 geo-tagged tweets from the city area of San
Francisco that were posted from August to October 2013. The data has been directly
obtained by Twitter and as such can be considered to represent the full extent of
geo-tagged tweets from within the observation period. The data points at hand
contain the tweet’s user and text, the geographical location, as well as additional
information, such as contained URLs and images. The Twitter messages from our
three-month period shows a robust and stable pattern over a 24 h period, as depicted
in Fig. 10.1. With generally low activity at around 50–100 geo-tagged tweets per
hour, the night hours draw the baseline. The hourly volume of tweets increases dur-
ing the morning hours and peaks around 12 a.m., followed by a slight drop until
3 p.m. The afternoon and early evening cover the most active phase of the day with
a peak at around 6 p.m. After that peak, the tweet volume steadily drops until it
reaches the nightly baseline. Given the steadiness of this pattern, we can assume
Twitter users to act in daily patterns as well. As found by Bendler et al. (2014), there
is a causal link between the time of day and the Twitter activity in the vicinity of
certain points of interest. For example, Twitter activity is above average around
restaurants in the evening, around bars and night clubs during late-night hours, and
128 J. Bendler et al.
around cafés during forenoon. Thus, given the official Twitter data at hand, we are
confident in finding the attractiveness measures for certain areas of a city, tentatively
based on the hour of day.
Figure 10.2 illustrates the difference in averaged Twitter volume by hour of day
in four different districts of San Francisco. On the left-hand panel, the absolute aver-
age is plotted for the administrative districts South of Market, Pacific Heights, the
Golden Gate Park, and Sunset District. The right-hand panel shows the offset
between each district and the average Tweet volume of the entire city. The districts
are chosen in a manner to represent areas that possibly offer different activities and
thus yield different social user behaviors at different times of day. South of Market,
as a business district shows a Twitter volume pattern that is close to the entire aver-
age but still lies above average during daytime and below average in the evening
hours. Pacific Heights and Sunset District are residential areas, which is most likely
the cause of their similar patterning. We can identify an above-average volume
during morning and evening hours and a volume below average during working
hours. The Golden Gate Park, a recreational area, has an entirely different pattern.
It lies below average during most time of the day but shows a high peak from noon
to the early evening. These different patterns for different districts in San Francisco
provide some evidence that Twitter data can represent societal habits and thus may
be consulted as a source for measuring the attractiveness of urban areas.
The Twitter data enables the identification of distinctive usage patterns with respect
to city district and time of day. Based on this perception, we develop a model in this
section that allows for estimating the social attractiveness of regions within a city.
The outline of our approach is given in Fig. 10.3. Driven by the dataset of Twitter
messages, we formulate estimations of popularity and activity based on a grid.
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 129
Fig. 10.3 Calculating scores from estimated popularity and attractiveness on a grid
These two estimated measures will then be combined into a score that indicates the
social attractiveness of the area covered by the respective cell. Once performed for
the entire grid, the method allows comparison of the estimated social attractiveness
among grid cells, effectively identifying hot spots of a city.
The social attractiveness model proposed in this section targets the attractiveness
estimation for regions based on Twitter activity. Our approach is different from
recent relevant literature in various ways. Lee et al. (2011) attempt to detect unusual
events by dividing an area into a grid and employ clustering methods to discover
unusual regional social activities. The authors monitor the tweet count and Twitter
users within a specific time horizon to spot exactly the unusual growth. Bao et al.
(2012) employ a location-based social network to gather the location history data of
users, combine it with rating information from local experts, and are therefore able
to provide personalized recommendations. In contrast to these approaches, we nei-
ther rely on events that can be isolated in the temporal dimension, nor do we track
location histories of certain users or consult experts. Conversely, we carry out analy-
ses that rely solely on the social activity on Twitter using a grid that covers the city
area of San Francisco. We further refine the characteristics of activities by calcula-
tion of additional key metrics, such as the number of unique users, for example.
Following Tobler’s first law of geography “Everything is related to everything
else, but near things are more related than distant things” (Tobler 1970), we employ
a grid-based approach for analysis of attractiveness within an urban environment,
efficiently splitting the major problem into sub-problems of smaller regional diver-
sity and complexity. As Lee et al. (2011) note, determining an adequate cell size is
very difficult for grid-based approaches. If analyses are performed based on admin-
istrative regions, the results can be considered oversimplified because large regions
most likely span multiple areas with different focuses of interest. Furthermore,
accuracy and plausibility are questionable. The same applies to the case of choosing
a grid with a very high resolution. A single area of interest could be split up into
multiple grid cells yielding different results though they belong together from an
130 J. Bendler et al.
activity point of view. Thus, we leave the grid size variable and carry out our
analyses with different resolutions. The grid’s definition is given in Eq. (10.1).
g1,1 g2,1 gx ,1
g1,2 gx ,2 (10.1)
G (l, f, ∆l, ∆f, x , y) =
g1,y g2,y gx ,y
with
x , y ∈ ℕ: grid dimensions
λ , ϕ: longitude and latitude describing grid’s origin
Δλ , Δϕ: longitude, latitude describing edge length of cells
gi , j ↦ (λg, ϕg): grid cells
Let a grid G be defined as a matrix of x × y grid cells gi , j. Each grid cell g is
defined as a tuple of longitude λg and latitude ϕg corresponding to the center of the
respective cell. The area covered by a single cell can thus be described by
Dl Df
lg ± , fg ± .
2 2
Since we attempt to measure social activity, we denote all Twitter messages a as
representatives of the set of all activities A, as outlined in Eq. (10.2). Each Twitter
message maps to an 8-tuple consisting of the point in time t the message has been
published, longitude λa and latitude ϕa describing the exact location, a unique user
identifier u, the message text m and sets I , V , P containing all images, videos, or
places attached to the tweet, respectively.
{
A = a1 ,a2 , ¼ a A }
a Î A ( t , la , fa , u , m, I , V, P ) (10.2)
with
t: point in time of publication
λa , ϕa: longitude, latitude describing location of publication
u: unique user identifier
m: message text
I = {i1 … i|I|}: set of all images attached to the tweet
V = {v1 … v|V|}: set of all videos attached to the tweet
P = {p1 … p|P|}: set of all places attached to the tweet
Let the geographic reference between a Twitter message and a grid cell be
denoted by the ~ operator. Following the principles of topological spaces in math-
ematics, if a is geographically enclosed, i.e. embedded in the area covered by g,
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 131
then the unary operation ~a yields the respective cell g as defined in Eq. (10.3). Note
that ~ does not represent proportionality in this context.
Dl Df
~: A ® G, a g Î G la - lg £ Ù fa - fg £ (10.3)
2 2
For each cell g within a grid G a score value σg is calculated as outlined in
Eq. (10.4). Our proposed score consists of two main parts; a popularity estimate φg
reflecting the relative social activity in terms of published Twitter messages, and
an activity estimate αg that describes the likelihood of contents being shared via
Twitter.
ì a1
ï g
s g = íj g if a g > 0 (10.4)
ïî 0 else
with
αg: activity estimate, likeliness of contents being shared
φg: popularity estimate, relative social activity
σg ↦ ℝ ≥ 0 , αg ↦ ℝ ≥ 0 , φg ↦ ℝ ∈ [0, 1]
Our baseline is to estimate the attractiveness of urban areas based on online
social activity. Both estimates, activity and popularity, should have a positive impact
on the score with growing values. A small value of α along with a small value of φ
should yield a low score. Accordingly, both estimates in the upper region of their
respective domains should result in a high score. Since both values can hardly be
compared directly to each other due to their different domains and distributions,
the popularity is risen to the power of the activity’s inverse to obtain a score value.
Note that it is crucial to employ the inverse due to the domain of α in order to pre-
serve the monotonically increase of σ for increasing values in both α and φ. With
increasing popularity value φ, the slope in dependency of α gets steeper. This means,
that for areas with a higher popularity, a higher activity estimate leads to increased
score values, whereas for lower popularity areas, a high activity estimate only yields
lower scores. This behavior is intended to constrain high scores caused by single
users in solitary areas. Eqs. (10.5) and (10.6) provide detailed descriptions on how
φ and α are being calculated.
The popularity estimate φg of a specific grid cell as given in Eq. (10.5) is defined
as the share of Twitter messages that relate to a certain cell from the total amount of
Twitter messages available. Thus, the resulting value reflects the relative activity
weight of the covered district, which can be directly compared among different
cells. Whether people only traverse the cell and publish a tweet in the meantime or
132 J. Bendler et al.
whether they actually follow some interest in the cell, the popularity value can be
regarded as the density of activities.
{a :~ a = g}
jg = (10.5)
A
with
φg ↦ ℝ ∈ [0, 1]
The activity estimate αg is outlined in Eq. (10.6). It is defined as the sum of the
Twitter messages’ characteristics of unique users Ug, image attachments Ig, video
attachments Vg, and linked places Pg, divided by the total amount of Twitter mes-
sages in the respective cell.
U g + I g + Vg + Pg
ag = (10.6)
{a :~ a = g}
with
Ug = |{ua : ~ a = g}|: the amount of distinct users within the cell
Ig = ∑a : ~ a = g|Ia|: the amount of images posted from within the cell
Vg = ∑a: ~ a = g|Va|: the amount of videos posted from within the cell
Pg = ∑a : ~ a = g|Pa|: the amount of places posted from within the cell
αg ↦ ℝ ≥ 0
The value Ug reflects the number of unique users identified in the respective cell
g, i.e. all distinct users that have posted at least one Twitter message in the cell. This
rate supports the distinguishability between areas where people reside longer and
those areas where people only stay for a short period of time, measured by their
actual activity. If people remain in the same area for an extended time-span, then it
is more likely that each unique user publishes more than only one Twitter status
update. For example, a residential area that is home to several Twitter users will
probably contain multiple Twitter activities by each user during the observation
period. Spots with a higher fluctuation of unique users (e.g. airports) are potentially
less likely to see the same user over and over again. The rate of Twitter activities that
contain an image reference, denoted Ig, is essential for estimating the activity in
sharing contents in an area. We expect users to be more likely to publish a tweet that
contains a photograph if something in their vicinity exists that they think is worthy
of being shared. For example, photographs of food and drinks are very common
among Twitter users, as well as pictures of people in front of well-known land-
marks. As a result, not only the contents of the photographs contain information on
the activity the user is following, but the pure fact that a photograph or some other
image is attached to the tweet underlines the importance of the activity. Though
videos Vg possibly have a different impact on the attractiveness than images, their
meta-information remains valuable. Finally, the amount of directly linked places Pg
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 133
is taken into account as well. Note that no textual analysis is performed and the
places mentioned are not extracted from the Twitter message itself. Instead, links
leading to websites that allow location-sharing are counted, such as Foursquare
check-ins.
Generally, the popularity estimate represents a sole social activity while the
activity estimate reflects the willingness to share additional content by many differ-
ent users. The visualization of estimated popularity φ and activity α on a fine-
grained grid covering San Francisco is shown in Fig. 10.4. Panel (a) depicts the
popularity estimate. The city center can easily be identified, as well as the university
campus in the south-west. While panel (a) does not reveal increased values along
the coastlines, they are easily recognized in panel (b). Furthermore, panel (b) also
highlights the Golden Gate Park and the entire pier area between the Golden Gate
Bridge (north) and the San Francisco-Oakland Bay Bridge (north-east). Bringing
these two estimates—popularity and activity—together by calculating a score as
defined in Eq. (10.4), we obtain the distribution delineated in panel (c). Darker areas
account for areas that are likely to attract more people, who in turn are more likely
to attach additional contents to their Twitter messages.
however that most of them are anchored to a very specific geographical location. In
order to employ points of interest categories for cross-checking our scores, we
assigned each of them to none, one, or more of our activity categories (a)–(e). When
carrying out geo-spatial analyses, the spatial presence of the entity to be employed
should be given in order to obtain valid results. Figure 10.5 shows the presence of
all activity categories over the city of San Francisco.
Visual inspection of the categories’ distribution in Fig. 10.5 suggests presence of
strong multicollinearity. This problem is inherent to the approach and requires s pecial
consideration. The rationale for present multicollinearity is twofold. First, between
some categories points of interest, there is a large overlap. This category overlap is
yielded by the assignment of multiple categories to the same point of interest,
such as, for instance, the additional labeling of restaurants with the c ategory ‘food’.
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 135
Fig. 10.5 Geo-spatial presence of points of interest for each activity category
The same holds for a souvenir shop, which can be tagged as ‘ sight-seeing’ and ‘shop’
at the same time. The fact that categories are not disjoint inevitably leads to implicit
correlation. The second reason for multicollinearity among points of interest is their
natural geographical clustering. Even when a point of interest is described by a single
category, it may naturally be more present in the proximity of other categories.
For example considering shopping malls or pedestrian zones, different kinds of
establishments often are located close to each other, which results in their densities
behaving similarly in a spatial manner.
When dealing with multicollinearity, the most common measure is the variance
inflation factor (VIF). However, O’Brien (2007) points out that researchers should
be cautious and should not blindly follow VIFs. According to his remarks, VIF
thresholds are somewhat arbitrary, and attempts to eliminate multicollinearity often
result in more damage than was originally caused. The major effect of a likely
multicollinearity in our model is that variances of coefficient estimates are likely
to be increased, even though the coefficients themselves are generally unbiased.
136 J. Bendler et al.
Fig. 10.6 Blending points of interest into the model for validation of results
In order to support our estimated attractiveness scores in each grid cell, we test our
measures against the activity categories using points of interest from map data as a
second, independent data source. Despite present multicollinearity, we are able to
employ linear regressions for identification of significant influence of variables
(O’Brien 2007).
Regressions carried out in this context are based on our complete dataset consist-
ing of more than 600,000 Twitter status messages and more than 60,000 points of
interest from the city of San Francisco. The Twitter data spans the 3 months of
August through to October 2013. The points of interest are available in many
different categories and are reassigned to our five categories (a)–(e). Figure 10.6
outlines how the external data source for result validation blends into our research
approach.
The classification of points of interest into our activity categories enables us
to test the explanatory power of measures from Twitter messages per cell, such
as unique users, the number of images or videos attached, and the number of
places shared. Each of these measures is employed as the dependent variable in
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 137
Table 10.4 Regression results for twitter measures and activity categories
Videos
Unique users Images attached attached Places shared
Intercept 26.46 (6.73)*** 11.38 (5.27)*** 0.29 (2.55)* 3.06 (1.73)
(a) Sightseeing, −1.40 (−0.98) 0.18 (0.22) −0.01 (−0.33) −1.29 (−2.00)*
culture,
landmarks
(b) Nightlife, events, 36.82 (19.10)*** 15.55 0.17 (3.09)** 15.38 (17.69)***
entertainment (14.68)***
(c) Shopping, sports, 0.87 (4.20)*** −0.19 (−1.67) 0.01 (1.09) 0.52 (5.56)***
business
(d) Restaurant, 1.77 (2.46)* 1.70 (4.31)*** −0.001 1.16 (3.591)***
accommodation (−0.06)
(e) Transportation −2.37 (−1.39) −1.88 (−2.01)* 0.03 (0.57) −1.22 (−1.59)
Adjusted R2 0.5147 0.3375 0.0415 0.5205
Stated: OLS coefficients, t-statistics in parentheses, based on 2500 observations
Significance levels: * 0.05 ** 0.01 *** 0.001
a linear regression (cf. Eq. (10.7)) and tested for dependency of activity categories
(a)–(e).
y ~ a + b + c + d + e + (10.7)
with
y: dependent variable; unique users U, attached images I, attached videos V, or
shared places P
a … e: independent variables, the amount of points of interest assigned to the respec-
tive category:
a: sightseeing, landmarks
b: nightlife, events, entertainment
c: shopping, sports, business
d: restaurant, accommodation
e: transportation
: the residual error
From the regression results shown in Table 10.4, we can identify different combi-
nations of activity categories responsible for the different shapes in Twitter message
characteristics. For instance, category (a) only has a significant influence on places
shared. Category (b), most likely due to its entertainment characteristic, has a highly
significant impact on all of the Twitter measures. The category (c) of shopping, sports,
and business shows an increased impact on high user fluctuation and the likeliness of
attaching a current location to a tweet. When in restaurants or at their hotel or hostel
138 J. Bendler et al.
(d), people tend to post images as well as link their location, whereas posting videos
from these venues is very uncommon. Finally, the transportation category (e) only has
a slightly significant influence on posting images. These results support the selection
of our measures since we obtain characteristic tweeting behavior based on the respec-
tive geographical environment. According to O’Brien (2007), we can interpret the
significance of our regression results despite existing multicollinearity. Variables that
show significant impact truly are significant, only their coefficient estimates are not
directly comparable. We can identify distinctive combinations of categories being sig-
nificant for each of our measures, effectively meaning that different conditions induce
different Twitter behavior. From Tobler’s first law of geography, we learned that the
impact of geographic relation depends on distance, and from research by Lee et al.
(2011) we know that determining an adequate cell size is difficult for geographical
analyses based on a grid (Tobler 1970). We used a grid resolution of 50 by 50 cells to
cover the city area of San Francisco. Low-resolution grids may blur results and over-
simplify the problem, whereas a very high resolution can lead to inappropriate results
due to over-fitting. The adequate grid resolution depends on the density of observa-
tions, points of interest, population, and presumably additional societal and environ-
mental circumstances.
10.4.2 Findings
Answering the first research question posed at the outset, general social media activity
can be utilized to spot socially attractive places. We propose a model that is capable of
estimating the areal attractiveness within a city by taking solely Twitter data into
account. Our model delivers a score that is composite of popularity of a region and visi-
tor’s activity density. From identified activity categories we can infer that social media
data implicitly covers aspects of attractiveness delivered by other criteria known from
literature. The cross-checking of our model with points of interest from map services
renders distinctive combinations of activity categories to be responsible for Twitter
characteristics. This observation reveals Twitter activity to be closely related to points
of interest in the vicinity and furthermore supports Twitter as a suitable proxy for peo-
ple’s activity within a city. Figure 10.7 depicts our results for San Francisco after being
smoothed by a Gaussian filter (3 × 3). The left-hand panel shows a 3D-plot, the right-
hand panel depicts the contours of the same data. Areas of interest can clearly be spot-
ted, while their intensity can be metered by the spike amplitudes. The contiguous areas
from the contour plot can directly be applied by a broad variety of domains.
Due to the ongoing increase in velocity and volume, social media becomes increasingly
powerful. In this research, we aimed to explore the methods and measures of identifying
places of interest within a city, solely relying on metrics from social media data.
10 Does Social Media Reflect Metropolitan Attractiveness? Behavioral Information… 139
Whenever people generate data unconsciously, this data reflects their true activities
and thus is of enormous value in the analysis of behavioral patterns. Furthermore, the
spatial relationship between people’s activities, their recent location, and points of
interest in their vicinity can reveal detailed information on attractiveness of places
within a city. Based on relevant related research, we have identified aspects that render
urban locations attractive to people, whether they are tourists, residents, city planners,
or investors. We identified five different categories describing activities in urban living
and, furthermore, were able to prove their existence using a linear regression model on
Twitter message characteristics after obtaining evidence from initial visual analysis.
We developed a grid-based scoring approach to determine areas of interest within a
city. Using points of interest from mapping services as an independent data source, we
showed the validity of our scores and identified an appropriate grid resolution to work
on when performing analyses on an inner-city level.
With respect to the research questions, we state that social media activity can be
exploited to identify urban hot spots of various kinds. The spatial coordinates that
are available as a part of the messages from social platforms and services reflect
public activity. Additional data from social media messages, such as appended
images, videos, or linked places allow inference on the type of activity dominant in
certain areas within a city. According to our findings, social media activity in an
urban environment can be utilized to spot socially attractive places. Furthermore,
we have shown that social media correlates to certain point of interest categories
and thus can serve as a proxy for environmental characteristics based on a grid.
Socially attractive places relate strongly to environmental conditions, including
tourist hot spots, attractions, restaurants, hotels, and many other establishments.
Our contribution of spotting areas of interest solely based on social media data is a
valuable addition to a broad variety of applications, covering for example city plan-
ning, disaster management, city safety, venue recommendation and trip advises,
launching businesses, and investment strategies.
Our study and findings are limited by the small fraction of tweets providing geo-
spatial information, since most users opt-out from disclosing their location when
publishing a message. Only around 1% of all Twitter messages sent contain the
140 J. Bendler et al.
Biographies
References
Arase Y, Xie X, Hara T, Nishio S (2010) Mining people’s trips from large scale geo-tagged photos.
In: Proceedings of the 18th ACM international conference on multimedia. ACM, New York,
pp 133–142
Ballatore A, McArdle G, Kelly C, Bertolotto M (2010) RecoMap: an interactive and adaptive
map-based recommender. In: Proceedings of the 2010 ACM symposium on applied computing.
ACM, New York, pp 887–891
Bao J, Zheng Y, Mokbel MF (2012) Location-based and preference-aware recommendation using
sparse geo-social networking data. In: Proceedings of the 20th international conference on
advances in geographic information systems. ACM, New York, pp 199–208
Bendler J, Wagner S, Brandt T, Neumann D (2014) Taming uncertainty in big data. Bus Inf Syst
Eng 6(5):279–288
Enright MJ, Newton J (2004) Tourism destination competitiveness: a quantitative approach. Tour
Manag 25(6):777–788
Facebook Inc. (2013) Facebook form 10-K annual report
Foursquare (2014) About Foursquare. https://foursquare.com/about
Gearing CE, Swart WW, Var T (1974) Establishing a measure of touristic attractiveness. J Travel
Res 12(4):1–8
González MC, Hidalgo CA, Barabási A-L (2008) Understanding individual human mobility
patterns. Nature 453(7196):779–782
Hongzhi Y, Yizhou S, Cui B, Zhiting H, Chen L (2013) LCARS: a location-content-aware rec-
ommender system. In: Proceedings of the 19th ACM SIGKDD international conference on
knowledge discovery and data mining. ACM, New York, pp 221–229
Hu Y, Ritchie JB (1993) Measuring destination attractiveness: a contextual approach. J Travel Res
32(2):25–34
Instagram (2014) Instagram Press News. http://instagram.com/press/
Jaffe A, Naaman M, Tassa T, Davis M (2006) Generating summaries and visualization for large
collections of geo-referenced photographs. In: Proceedings of the 8th ACM international work-
shop on multimedia information retrieval. ACM, New York, pp 89–98
Jansen-Verbeke M (1986) Inner-city tourism: resources, tourists and promoters. Ann Tour Res
13(1):79–100
Jeffries A (2013) The man behind Flickr on making the service ‘awesome again’. http://www.thev-
erge.com/2013/3/20/4121574/flickr-chief-markus-spiering-talks-photos-and-marissa-mayer
Kozak M, Rimmington M (1999) Measuring tourist destination competitiveness: conceptual
considerations and empirical findings. Int J Hosp Manag 18(3):273–283
Lee R, Wakamiya S, Sumiya K (2011) Discovery of unusual regional social activities using
geo-tagged microblogs. World Wide Web 14(4):321–349
142 J. Bendler et al.
Leung KW-T, Lee DL, Lee W-C (2011) CLR: a collaborative location recommendation framework
based on co-clustering. In: Proceedings of the 34th international ACM SIGIR conference on
research and development in information. ACM, New York, pp 305–314
Lew AA (1987) A framework of tourist attraction research. Ann Tour Res 14(4):553–575
Liu B, Fu Y, Yao Z, Xiong H (2013) Learning geographical preferences for point-of-interest
recommendation. In: Proceedings of the 19th ACM SIGKDD international conference on
knowledge discovery and data mining. ACM, New York, pp 1043–1051
Niedomysl T (2010) Towards a conceptual framework of place attractiveness: a migration
perspective. Geogr Ann Ser B, Hum Geogr 92(1):97–109
O’Brien RM (2007) A caution regarding rules of thumb for variance inflation factors. Qual Quan
41(5):673–690
Rattenbury T, Good N, Naaman M (2007) Towards automatic extraction of event and place seman-
tics from Flickr tags. In: Proceedings of the 30th annual international ACM SIGIR conference
on research and development in information retrieval. ACM, New York, pp 103–110
Rogerson RJ (1999) Quality of life and city competitiveness. Urban Stud 36(5–6):969–985
Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr
46:234
Twitter Inc (2014) About Twitter. https://about.twitter.com/company
Waga K, Tabarcea A, Fränti P (2012) Recommendation of points of interest from user generated
data collection. In: 8th IEEE international conference on collaborative computing: networking,
applications and worksharing. IEEE, pp 550–555
Ye M, Yin P, Lee W-C, Lee D-L (2011) Exploiting geographical influence for collaborative point-
of-interest recommendation. In: Proceedings of the 34th international ACM SIGIR conference
on research and development in information retrieval. ACM, New York, pp 325–334
Yoon H, Zheng Y, Xie X, Woo W (2012) Social itinerary recommendation from user-generated
digital trails. Pers Ubiquit Comput 16(5):469–484
Yue Y, Zhuang Y, Li Q, Mao Q (2009) Mining time-dependent attractive areas and movement
patterns from taxi trajectory data. In: 17th international conference on geoinformatics. IEEE,
pp 1–6
Zheng VW, Zheng Y, Xie X, Yang Q (2010) Collaborative location and activity recommendations
with GPS history data. In: Proceedings of the 19th international conference on world wide web.
ACM, New York, pp 1029–1038
Chapter 11
The Competitive Landscape of Mobile
Communications Industry in Canada:
Predictive Analytic Modeling with Google
Trends and Twitter
Michal Szczech and Ozgur Turetken
Abstract Google Trends, the service that illustrates the trends in Google search
activity, has recently received attention form analytics researchers for the prediction
of economic trends and consumer behavior. Previous studies used Google Trends to
estimate consumption and sales for a particular business, or provide general trends
for an economic sector or industry. This study reported here differs from these
attempts as it aims to estimate the performance of a single player in an industry by
not only trends related to that player, but also those of its competitors. Further, these
trends have been modified by Twitter based sentiment scores. It is demonstrated that
the incorporation of competitive factors results in better estimates by as much as 5%
while the addition of a Twitter sentiment score is not beneficial. The Twitter related
findings could be because the tweet volumes in the particular industry that was
examined are low and volatile.
11.1 Introduction
Modern day people turn to the Internet more and more often to find information for
making decisions and educating themselves about various topics. At the end of
2014, about 42% of world’s population had experienced at least one internet based
M. Szczech
CGI, 750 Hillman Crescent, Mississauga, ON, Canada, L4Y2J2
Ted Rogers School of Management, Ryerson University,
350 Victoria Street, Toronto, ON, Canada, M5B 2K3
e-mail: [email protected]
O. Turetken (*)
Ted Rogers School of Management, Ryerson University,
350 Victoria Street, Toronto, ON, Canada, M5B 2K3
e-mail: [email protected]
service. In North America, the penetration of the Internet is close to 87% of the
population whereas in Canada the figure was even higher at 94% of the population
in 2014 (Internet World Stats 2014). For most people, the Internet is practically
synonymous with the Web. At the time of the writing, the number of active websites
on the Internet is estimated to be over one billion (Internet Live Stats 2015). The
sheer amount of web-based content emphasizes the importance of powerful and
effective search engines. Over 65% of all searches on the Internet are conducted by
Google, which amounts to over 3.5 billion searches per day. Like other search
engines, Google stores the search queries that users submit along with geolocation
information. Most of this data are available to the wide public through a service
called Google Trends. The data are updated daily, even hourly, through a special
service called Google Trends Hourly. Typically, whatever people search for in a
given region of the world tends to somewhat reflect events in that region. This pro-
vides rich and timely data for predictive analytics, which many modelers have
attempted to use.
Meanwhile, the typical Web-content that can be reached through a service like
Google search (and hence Google trends) is static in the sense that it is not updated
and shared as often as Web 2.0 content. Web 2.0, on the other hand, allows for the
creation of dynamic, highly interactive and content rich applications that differ from
static websites that constituted Web 1.0. These new Internet technologies gave rise
to social networking and social media sites. The most popular social networking
application in the world is Facebook, which currently has about 1.3 billion active
monthly users (Statistic Brain 2015a). Other popular social networking and social
media platforms are YouTube, Twitter, LinkedIn, Pinterest, Instagram, and Google+.
Twitter allows posting short messages, not longer than 140 characters, called
“tweets”. It has about 650 million users worldwide. On average, the world creates
about 58 million tweets per day (Statistic Brain 2015b).
While these changes in technology have shaped how individuals find informa-
tion and communicate it, the ability to store information about millions of users and
their online interactions in searchable databases has significant influence on various
aspects of business, especially for analysts who need market data, but suffer from
the fact that market data that are collected through traditional methods arrive too
late and at a high cost. To satisfy their need for timely market indicators and analy-
sis, they utilize the above mentioned databases maintained by search engines such
as Google and Yahoo and social networking platforms such as Twitter and Facebook.
Google search volume data are available to everyone through the Google Trends
interface (https://www.google.com/trends/) and Twitter data are publicly available
to everyone through searchable databases holding almost the entirety of tweets dat-
ing many years back. These immense databases provide a versatile environment for
many forms of descriptive and predictive analytics. As will be further detailed in the
“Literature Review” section of the chapter, Google Trends and Twitter have been
used to capture market trends and used for predicting the demand for various goods
and services. However, somewhat surprisingly, there is little to no research that we
have been able to encounter that shows the usefulness of web search and social
media based analytics on providing a view of the competitive environment within a
11 The Competitive Landscape of Mobile Communications Industry in Canada… 145
certain industry and geographical area. Our research aims to fill that void and deter-
mine if Google Trends and Twitter data can be effectively used to assess the com-
petitive landscape of a certain industry within a given geographical area.
The specific context of our study is the mobile communication service provision
market in Canada. Although each of the main Canadian mobile communication
service providers knows the market share it holds relative to its competition for a
given quarter, their expressed desire is to have insight as to how well they are per-
forming relative to the competition on a more frequent (e.g. weekly or monthly)
basis. To provide such insight for this market, we observe the following characteris-
tics of the mobile telecommunications industry in Canada: (1) there exist a fairly
stable and small number of competitors, and (2) the market is already mature and
the overall consumer growth year by year is small. These characteristics help in the
formulation of our predictive models. By using Google Trends, regression models
for predicting the market share changes of each of the main mobile service provid-
ers in Canada are developed. We then attempt to improve the models by adding
competition variables. Finally, we incorporate Twitter sentiment scores into our
models to detect if those scores explain some of the variance in the dependent vari-
able that is not already explained by Google Trends data.
Armed with this information, decision makers in mobile telecommunication ser-
vice providers can assess the presence and importance of the threat their direct com-
petition poses to their organization. In turn, they can adjust marketing and sales
strategies by quickly assessing their operator’s performance relative to the competi-
tion. Once the competitive threats are identified, a closer analysis of the competi-
tors’ activities may help identify the source of their success, which in turn helps
strategic planning.
The rest of this chapter is organized as follows. In the next section, we provide a
brief review of the relevant academic literature. Section 11.3 details our approach to
the predictive modeling exercise. In Sect. 11.4, we present the results of the data
analysis, which are discussed further in Sect. 11.5. Section 11.6 presents concluding
remarks and directions for future research.
11.2.1 C
onsumer Related Research Involving Google
Trends Data
Many studies have demonstrated a relationship between online search and offline
sales. Chandukala et al. (2014) have shown that the market potential or demand,
which is expressed by “latent interest”, can be measured by analysing online search.
Their findings have been drawn from relationships between product search and
product sales, search for jobs and unemployment rates, search for a flu medicine and
incidence of flu in a given region and time, and search for cancer and incidence of
cancer.
146 M. Szczech and O. Turetken
In 2009, Hal Varian, then Google’s Chief Economist, and Hyunyoung Choi
showed that short-term forecasts of automotive and home sales could be signifi-
cantly improved by using Google search data as provided by the Google Trends
framework. Through this example, they demonstrated the potential of employing
Google Trends search data in research. Later, in 2012, they extended their study and
showed the potential of Google Trends in predicting the sales of vehicles and motor
part sales (Choi and Varian 2012). It was demonstrated that it is possible to create
Google Trends based consumer consumption indicators that outperform most com-
monly used survey-based consumer consumption indicators such as Michigan
Consumer Sentiment Index (Vosen 2011). It has been confirmed that Google search
data can improve forecasting models in general, and specifically that it can improve
commercial real estate price and demand forecasts (Marian et al. 2014).
Google Search seems to reveal the intentions of user actions. For example, there
is correlation between Google search volume for marijuana and consumption of
marijuana by youth (Cavazos-Rehg et al. 2015). With the use of Google Trends it is
also possible to predict, with reasonable margin of error, box office movie results
days, or even weeks, before the showings (Goel et al. 2010). It was also shown that
Google Search Volume Index (SVI) can measure the retail market investor desire to
buy certain stocks (Da et al. 2011). Jun et al. (2014) have developed a model for
forecasting the sales of Toyota Prius based on search traffic and some environmental
variables. Our study expands these typical search-volume based models by incorpo-
rating search-volume based competitive variables. For example, to forecast Toyota
Prius sales, it would be beneficial to see where the sales of direct competition such
as Chevrolet Volt and Honda CR-Z are going. In general, such competitive factors
tend to be omitted in most studies utilizing search traffic for consumer choice esti-
mation and forecasting. This study addresses this gap by assessing the impact of
competitive factors on the ability to forecast performance of three main competitors
in an oligopolistic market. The introduction of a competitive factor is especially
important in industries that are characterized as oligopolies, because oligopolies are
characterized by a high level of mutual interdependence between firms (Goel 2007),
thus changes in offering by one firm my affect the performance of the other. For
example, if a particular business improves its offering or lowers price and the others
do not, then that business increases its sales at the expense of the others. In the
Canadian wireless communications industry there exist a limited number of com-
petitors, and barriers to entry are high. In fact, about 90% of the market is controlled
by three main players. Our models aim to capture the degree of interdependence
between the provider under consideration and its direct competition by adding com-
petition variables to the prediction. If there is a high degree of interdependence
between the providers as expected, then the competitive variables should have sig-
nificant impact on market performance.
One weakness of search volume analytics is the fact that we know the scale of the
interest, but are unaware of the underlying sentiment. One solution to this problem
would be analyzing the combination of search terms with sentiment indicators
(Shawn and Stridsberg 2015). Such an approach can be used to predict financial
market changes such as financial index moves or financial crises (Preis et al. 2013).
11 The Competitive Landscape of Mobile Communications Industry in Canada… 147
Similarly, Google Trends data were used to create an investor sentiment index that
was able to closely reflect investor sentiment as obtained by traditional means (Beer
et al. 2013). Another approach to estimate user sentiment related to their topics of
search interest is the use of the collective wisdom available in social media. The
next section reviews research on that latter stream.
Recently, there have been a number of big data predictive analytics studies that
incorporated Twitter Analytics data. These include predictions of stock market
moves, financial markets, election results, crime volumes, health outbreaks, or even
unemployment. In 2013, Matthew S. Gerber, attempted to answer the question of
whether Twitter data can be used to predict crime in a large US city (Gerber 2014).
He argued that such information could assist decision making processes. That study
was not able to identify a significant correlation, but when using Twitter data to sup-
port other crime predicting models, it was able to improve crime prediction in 19
out of 25 crime types. One of the main reasons why it was difficult to obtain an
effective prediction framework could be the fact that tweets cannot be easily grouped
by location on such a granular basis such as neighborhood-by-neighborhood in a
given city. Possibly better results could be obtained if GPS-tagged tweets were used
in the study.
Research aiming to predict election results from social media has indicated that
the volume and sentiment of electoral party mentions on Twitter reflects the election
result for that party (Tumasjan 2011). It has also been shown that the size of the
social media network of an electoral candidate can be indicative of an election result
if the election is “closely” contested (Cameron et al. 2015).
In the world of commerce, services and retail, it has been established that a par-
ticularly effective way of affecting sales is the word-of-mouth (WOM) (Engel et al.
1969). Not surprisingly, social networking sites such as Twitter and Facebook have
been identified as possible environments where WOM could be shared. It was
established that microblogging websites are widely used as a form of electronic
word-of-mouth (eWOM) with regards to a brand, and they disseminate brand senti-
ments among the members of social networks (Jansen 2009). Thus, it can be
expected that Twitter and other microblogging websites allow sharing of user senti-
ment about a certain product, and can influence user purchasing decisions. Not sur-
prisingly, some studies try to predict sales based on Twitter data analysis. It was
demonstrated that social media data can be used to predict quarterly sales of iPhones
with average error of 5–10% (Lassen 2014). It is important to note that sentiment
analysis tools and techniques are still evolving and currently there are no conclusive
studies that would indicate which approach to sentiment analysis, especially for
short texts such as tweets, is the most effective (Lak and Turetken 2014).
148 M. Szczech and O. Turetken
To reiterate, the primary objective of our research is to show that the big data
archived by Google Trends and Twitter can be utilized to give business managers an
early indication of how well they are doing against their direct competition at any
point in time. The Canadian wireless telecommunications industry is dominated
(90% market share) by three major players, namely Rogers Wireless, Telus Mobility
and Bell Mobility with over eight million subscribers each. The only independent
company that tries to compete with the big three on a Canada-wide front, and is
present in more than four provinces, is Wind Mobile Corporation. Its size is about a
tenth of the size of a single provider from the big three, and currently sits just above
800 thousands subscribers. The other competitors are either constrained geographi-
cally (Sasktel, MTS), or too small to be considered serious contenders (Windmobile,
Mobilicity and others). Therefore, in this study, we consider only Rogers Wireless,
Telus Mobility, and Bell Mobility.
Because the barriers to entry are high, the market is already significantly pene-
trated, and there is a limited amount of new consumers, new subscribers that join
one competitor are likely to be ones that leave another competitor. In this highly
regulated landscape, the management for each wireless telecommunication com-
pany faces three main concerns:
• To retain as many of their existing subscribers as they can,
• To attract existing subscribers form competition, and to a lesser degree
• To attract first-time wireless subscribers.
Success, in all of the three objectives above, can be measured by the change in
total subscribers in any given quarter. This indicator is called “net new subscribers”
(NNS). Each of the main wireless services carriers in Canada publishes the number
of net new subscribers they were able to attract quarterly. Therefore, each provider
has the opportunity to compare their net new subscribers against the competition,
more or less, every 3 months. Three months, in some cases, can prove to be too long.
If a provider is not doing well compared to the competition, an early detection of a
threat could lead to early intervention through marketing and service-offering
adjustments. In addition, in this type of industry, what is important besides the
growth of customers, or subscribers, is the actual market share. If a company has a
decent growth in subscribers, but is losing its market share overall, this can hardly
be considered a success. Therefore; it is beneficial for any provider to be able to
determine how well it is doing when compared to the competition. It is not just its
own NNS figure, but also its current position against the competition that is impor-
tant. If a decision maker can identify which competitor has made the greatest prog-
ress in a given period of time (e.g. a month), then (s)he can try to understand what
stands behind their success, and try to balance this with his/her company’s offering
and marketing strategy in a more timely fashion.
11 The Competitive Landscape of Mobile Communications Industry in Canada… 149
300000
200000
150000
100000
50000
0
2008Q3
2008Q4
2009Q1
2009Q2
2009Q3
2009Q4
2010Q1
2010Q2
2010Q3
2010Q4
2011Q1
2011Q2
2011Q3
2011Q4
2012Q1
2012Q2
2012Q3
2012Q4
2013Q1
2013Q2
2013Q3
2013Q4
2014Q1
2014Q2
2014Q3
2014Q4
2015Q1
-50000
-100000
Quarter
Fig. 11.1 NNS over time for Bell, Rogers and Telus
To serve as the basis of historical values for the dependent variable, the quarterly
results for net new subscribers, for each of the Canadian wireless carriers are widely
available. The data for years 2006 through 2015 can be collected through Canadian
Wireless Telecommunications Association (CWTA). Therefore, for each company,
we can collect net new subscriber data for 37 quarters. The dependent variable that
is used in this study is net new subscribers (NSS):
(11.1)
800
700
600
500
400
300
200
100
0
0 5 10 15 20 25 30
BellSVI RogersSVI TelusSVI
Fig. 11.2 SVI over time for Bell, Rogers and Telus
Rogers is, and historically has been, the largest wireless services provider in Canada.
This may cause most of the competitors to aim their offering and marketing cam-
paigns against the leader, resulting in a higher average customer loss for Rogers
than the other two competitors.
One of the main premises of this study is that trends in market data can be mod-
eled by Google Trend data as was done in previous literature. It is possible to obtain
search volume data for each of the quarters, and for each of the competitors through
Google Trends. Google Trends search volume data can be fine-tuned to a particular
region and industry. One characteristic of the data is the fact that it is approximated
relative to the collection of all searches. In general, what Google Trends returns is a
search volume index (SVI) where it returns a number for a given time range, relative
to the highest search volume for all the included search terms. Therefore, only one
search volume result would have an SVI of 100, that is, the highest one. The rest of
the SVIs would be between 0 and 100. For our research we selected Canada as the
geographical location, “Internet and Telecom” as the category, and “Mobile and
Wireless” as the subcategory. It is important to note that while Google does really
well in splitting search queries into geographical location, the algorithm it uses for
sorting the searches into various categories and subcategories is not guaranteed to
be fully accurate as it infers the category based on the search terms the users enter.
Finally we used “Bell”, “Rogers” and “Telus” as search terms we wished to com-
pare. Google Trends only provides weekly data; therefore we summed the results up
and grouped them into quarters. Figure 11.2 displays SVI trends for each provider.
Based on previous literature, it is reasonable to expect that there will be a positive
correlation between the number of searches that included a given provider name and
its ability to attract net new subscribers. However as seen in Fig. 11.2, Google
Trends data do not exhibit the same pattern of seasonality as the NNS data displayed
in Fig. 11.1. This makes it essential to include a seasonality factor in the formulation
of subsequent models. It is possible that other industries do not exhibit such season-
ality, and a seasonal factor is not needed in a more general model whereas in this
11 The Competitive Landscape of Mobile Communications Industry in Canada… 151
Table 11.1 Average seasonal weight factor for each provider (Eq. 11.3)
Bell Rogers Telus
ASWF.Q1 0.153 0.112 0.462
ASWF.Q2 0.757 1.016 1.051
ASWF.Q3 1.493 1.796 1.226
ASWF.Q4 1.596 1.076 1.261
industry, the first quarter of the year seems to be a low season for all the providers
and quarters 3 and 4 are where the highest number of net new subscribers is regis-
tered. The fourth quarter includes December holidays, and is characterized by the
highest levels of consumer spending, which translates to higher NNS in the wireless
industry. Third quarter is the back-to-school and back to work quarter, which often
leads to higher sales in retail and services industry. Quarter one is generally the
slowest season for retail and services as the consumers pull back from the high
spending that characterizes quarter four. In addition, quarter one may have the high-
est number of deactivations or cancellations due to some “over-purchasing” in
December.
The Seasonal Weight Factor (SWF) was approximated by the averages for all the
known historical quarters. The average seasonal weight ASW for each of the four
quarters can be obtained by calculating the average NNS for each quarter, (ANNS(q))
and dividing it by the sum of averages for each quarter as follows:
ANNS ( Qi )
ASW ( Qi ) = (11.2)
ANNS ( Q1) + ANNS ( Q2 ) + ANNS ( Q3) + ANNS ( Q4 )
If there is no seasonality, that is if all four quarter ANNS(q) are equal, then the
formula will always result in a value of 0.25. Therefore, the respective average sea-
sonal weight factor (ASWF) can be calculated by dividing the ASW by 0.25:
ASW ( q )
ASWF ( q ) = (11.3)
0.25
The average seasonal weight factors for each provider and for each quarter are
shown in Table 11.1. Once again, if there was no seasonality, all of the values in the
table would be equal to 1. Therefore, the average of the ASWF values over the four
seasons for each competitor is 1.
To predict NNS by the seasonality factor alone we formulate our base model,
Model 1:
As the next step, we formulate a model using both time series (ASWF) and
causal (SVI) variables to attempt improving predictive power by capturing both the
seasonality and the trend observed in NNS. Empirical results in previous literature
suggest a linear relationship between SVI and NNS. There is no theoretical reason
to argue that this relationship should be nonlinear either, therefore Model 2 is for-
mulated as a linear model as follows:
The next step is to add competitive factors to our model. In an industry like this
where there are a fixed number of competitors and fairly stable target market, per-
formance of one player is expected to be affected by the performance of the compe-
tition. This leads us to believe that there will be a correlation, very likely negative,
between the number of Google searches for the target company’s NNS and those of
its competitors.
Following from our discussion of the market, each target provider (i) has two
direct competitors (j, k). We model the relationship between a company’s market
movement in the presence of its competitors with Model 3
where the added term B2j is the coefficient for the search volume index for provider
j (competitor 1), B2k is the coefficient for the search volume index for provider k
(competitor 2), SVIj (q) is the Google search volume index for provider j for quarter
q, and SVIk (q) is the Google search volume index for provider k for quarter q.
The sentiment (ST) expressed by consumers about a provider in their online activity
could indicate whether the volume of search expressed in the trends is a positive or
negative indicator. By their very nature, web search tools are indexing engines: they
only store pointers to content in their databases. As is more than typical, when web
content changes (or is altogether removed) over time, so do the results to a query. As
a result, it is impossible to replicate the results to a historic web query, which means
one could not reproduce the content to which a consumer was exposed when they
11 The Competitive Landscape of Mobile Communications Industry in Canada… 153
searched for a certain term in the past. Therefore; as it was noted in the literature
review section, a shortcoming of using Google Search Volume (SV) as an indicator
for sales or consumption is the fact that the volume information does not carry with
it the sentiment of the search results. Even though the volume does reflect interest
in a product or a brand, we miss information about the nature, i.e. the sentiment, of
the interest.
Meanwhile, content on social media, especially on Twitter, is not overwritten
unless a user specifically chooses to do so. Twitter Analytics database holds all the
“non-deleted” tweets back to year 2006. Therefore past sentiments regarding a
given topic can be identified by analyzing Twitter data. Historical data from Twitter
is not freely available; however the Twitter website makes it possible to run search
queries specifying dates and key search terms. This led us to collect historical tweets
from the Twitter website through a series of individual searches. Although very
time-consuming, this process proved to be viable for collection of tweets. One
weakness of the Twitter data is the fact that very few of the historical tweets can be
localized, which means the Twitter searches we performed were “global” in nature.
To narrow down the results, full names of the companies were included in the search
query instead of just the short name. For example, instead of just searching for
“Bell” the search included “Bell Mobility” as a search term. After examining the
returned tweets, it was verified, by examining the content and the authors of the
tweets, that what is included in the result are tweets that are relevant, and were origi-
nated by Canadian users.
After the tweets for each of the three competitors for every quarter were col-
lected, it was noted that the volume of tweets that were returned for years prior to
quarter 3 (Q3) of 2008 were rather negligible. This could be due to the fact that
Twitter adoption in Canada did not reach significant scale before this date. Therefore,
the time span of the analysis was narrowed down to 27 quarters, from Q3 of 2008 to
Q1 of 2015.
After the initial processing of the tweets as described, sentiment analysis of the
tweet contents was performed. For the purposes of this study, a sentiment analysis
tool called “SentiStrength” was used. SentiStrength returns two scores for each
Tweet it analyzes: a negative score, a number between −1 and −5, and a positive
score, a number between 1 and 5 where “–5” is the most negative score and “5” is
the most positive score. For each tweet, these two scores were added to obtain a
single sentiment score in the range of −4 to 4. About 1000 scores were reviewed
manually and it was observed that scores between −1 and 1 were mostly associated
with tweets that were perceived to be “neutral”. Therefore, a decision was made to
classify the tweets into Negative, Positive and Neutral categories. Any tweet with a
score below −1 was considered Negative, a score between −1 and 1 (inclusive) was
Neutral, and a score greater than 1 was Positive. Positivity to negativity ratio, which
is simply the number of positive tweets divided by the number of negative tweets
(Pos/Neg), is a commonly used measure of the general sentiment of a collection of
tweets. Another way of tallying sentiment scores is to use the ratio of positive tweets
to tweets with opinions. This approach ignores neutral tweets for sentiment
154 M. Szczech and O. Turetken
Pos
Sentiment ST = (11.4)
Pos+Neg
In the last model we formulated, we included this sentiment variable in an addi-
tive fashion as none of the multiplicative models yielded desirable results.
Formulating a linear model also simplifies the subsequent analyses and makes this
model comparable to the previous three. The resulting model, Model 4, is as
follows:
where the new term STi(q) is the Twitter sentiment for provider i in quarter q as
displayed in Eq. (11.4).
11.4 Results
IBM SPSS Statistic software was used to analyze the data. Normal P-P plots showed
that the distribution of the NNS variables is reasonably normal (Fig. 11.3). The
models were estimated separately for all three wireless service providers: Bell,
Rogers and Telus through multiple linear regression models based on historical
quarterly NNS data. The relationships between these players are complex and
diverse, which implies competitor one may be affecting competitor two in a com-
pletely different way than it affects competitor three. The predictive power of the
models for Bell, Rogers and Telus are in Tables 11.2, 11.3 and 11.4 respectively,
while Tables 11.5, 11.6, and 11.7 provide the regression analysis results. Analysis of
the independent variables indicates a high level of collinearity between BellSVI,
RogersSVI and TelusSVI. The collinearity of provider SVIs is not a significant issue
for this study, because we are not trying to assess how much each of the variables
156 M. Szczech and O. Turetken
affects the final output, but rather their combined impact on our ability to predict
NNS for each quarter. In other words, our models are predictive rather than explor-
atory therefore collinearity is not a severe violation. However; this also implies that
the regression coefficients for the competitor models (Models 3 and 4) should not
be used to make any conclusions about the effect of competition in the market.
For Bell, as seen in Table 11.2, if only the seasonal weight factor is used as an
independent variable, the adjusted R2 is 0.73 indicating that seasonality plays a
major role in determining market share. When we add Google Trends SVI to the
model, the adjusted R2increases to 0.838; this is a substantial change. When vari-
ables representing competition (TelusSVI and RogersSVI) are added, we see a
minor increase in the adjusted R2 to 0.84. Finally, the addition of the Twitter senti-
ment decreases the adjusted R2 to 0.839.
Interestingly, for Rogers, the model with the seasonal factor only (RogersASWF)
explains only about 0.419 of the variance (R2) in NNS (Table 11.3). When Google
Trends SVI for Rogers is added to the model, the adjusted R2 is more than doubled
to 0.861, and standard error of the estimate is more than halved to 33,297. After
adding the variables representing competition to the model, we notice a sizeable
improvement in adjusted R2 to 0.916, and standard error of the estimate goes down
to 25,894. Yet again, after adding the Twitter sentiment factors, the adjusted R2 goes
down to 0.913.
158 M. Szczech and O. Turetken
Finally, for Telus, it is observed that the Seasonal Weight Factor alone explains
about 80% of the variance in the dependent variable (Table 11.4). Seasonality seems
to play a very significant part in Telus’s ability to attract new users and retain exist-
ing ones. Once again, we see that adding the Google Trends SVI for Telus improves
the model and increases the adjusted R2 to 0.943. The addition of competition vari-
ables, as expressed by RogersSVI and BellSVI, is able to improve the model, but
only slightly, to an adjusted R2 of 0.951. Finally the addition of Twitter Sentiment
(TelusST) has no impact on the model.
11.5 Discussion
Our results show that Bell NNS performance is the least susceptible to competition
variables or to the performance of the competition in general. Generally speaking,
Bell is the oldest and most renowned telecommunications provider in Canada;
therefore it has a solid base of loyal customers that are unlikely to switch to other
competitors. In addition, Bell is the largest wired telecommunications provider in
Canada, and it has the most advanced fiber optics base TV service. Hence customers
can enjoy additional benefits when bundling their wireless and wired solutions with
the same provider.
Rogers’ performance, on the other hand, seems to be fairly sensitive to the per-
formance of the other two providers. This can be explained by the fact that both
Telus and Bell are aiming to capture the market share from the leader, which is
Rogers. Their marketing efforts and diverse offers are often targeted at Rogers’
customers, and leads to higher Rogers deactivation and churn rates. It is interesting
to observe that the characteristics of the competitive landscape in the Canadian
wireless telecommunications industry are reflected in the impact of the competitive
factor on the prediction accuracy of our models.
The four models tested in this study are fairly robust and easy to justify. Yet, we
are aware that there are many more variations of the presented models that could
have been developed and compared. The argument for our choice of linear models
was made before. Another variation of the models could be created by aggregating
the time variant variables differently noting that there can possibly be a time lag
between searching for information on the Web and deciding to choose a wireless
service provider. To test whether this time lag had any effect on the models, we
rebuilt the models with 4 weeks, 2 weeks and 1 week time lags between the inde-
pendent and dependent variables. When we compare these results with the original
(no time lag) results, we observe that longer time lags result in poorer predictive
capability in terms of adjusted R2. This suggests that the time lag between searching
for information and choosing a wireless solution is less than a week (Table 11.8). It
is not possible to apply time lags that are less than a week because Google search
volume data is available in a week-by-week form. All of the three providers studied
here maintain e-commerce sites through which users can subscribe to the service
immediately after conducting an online search. We do not have the statistics on the
11 The Competitive Landscape of Mobile Communications Industry in Canada… 159
Table 11.8 Time-lagged adjusted R square for Bell, Roges, Telus (all four models)
No time-lag 1 week time-lag 2 week time-lag
Adjusted R Adjusted R Adjusted R 4 week time-lag
Provider—Model square square square Adjusted R square
Bell—Model 2 0.838 0.835 0.834 0.829
Bell—Model 3 0.840 0.834 0.832 0.825
Bell—Model 4 0.839 0.834 0.833 0.827
Rogers—Model 2 0.861 0.853 0.836 0.797
Rogers—Model 3 0.916 0.913 0.912 0.898
Rogers—Model 4 0.913 0.911 0.910 0.898
Telus—Model 2 0.943 0.943 0.943 0.939
Telus—Model 3 0.951 0.949 0.944 0.935
Telus—Model 4 0.951 0.949 0.945 0.935
11.6 Conclusions
In this study, we studied the success of readily available web search metadata along
with social media content in predicting the market share of the three major mobile
service providers in Canada. The main contribution of this work is the use of these
data both for the provider of interest (target company) and its competitiors in the
process improving prediction accuracy. The impact of competition variables vary
depending on the competitor. Nevertheless, the results suggest that when nowcast-
ing or predicting operational results for a company with a known and limited set of
direct competitors, it is beneficial to include competition variables based on Google
Trends data. So far, all of the Google Trends studies that analyzed company perfor-
mance such as sales in retail, real estate, or car sales, focused only on Google Trends
or Twitter data for the target company, omitting the data for direct competition. This
study demonstrates that when trying to predict sales of, for example, Toyota
Dealerships, it would be beneficial to include Google Trends values for direct com-
petition such us Honda, Hyundai and Nissan. We also observe that, for an industry
160 M. Szczech and O. Turetken
with a limited and stable number of competitors, we could expect the market leader
to be most influenced by the inclusion of competition variables in the Google Search
based model. The results show that Rogers’ performance for a quarter is more
dependent on competition performance than, for example, Bell’s performance.
It comes as a surprise to see that the inclusion of Twitter sentiments did not
improve the performance of the models for any of the three competitors. The exist-
ing body of literature would suggest a strong correlation, but our data for the
Canadian Telco sector indicate otherwise. Some of the possible reasons for this situ-
ation are discussed next.
Due to financial limitations, the historical data for Twitter were obtained by man-
ually entering the queries on the Twitter website and by copying all of the resulting
tweets to a spreadsheet. This is a very time-consuming process hence it could only
be done once. Therefore, it was not possible to experiment with various sets of que-
ries to select one that has the best fit for the model. With better access to Twitter
data, this issue could be revisited in the future.
In addition, the Twitter data that were obtained through this laborious process
seemed to have significant levels of variance, where in some quarters the number of
returned tweets would be counted in hundreds, and in others, in thousands. It could
be that the adoption levels of Twitter in Canada were not sufficient in some of the
historical periods to provide an adequate dataset for modelling. Future research
could further explore the impact of sentiments by incorporating additional sources
of data such as Facebook posts, other microblogging website posts, or posts from
forums with wireless provider reviews.
It would also be desirable to include more historical quarters into our model but
the Twitter data prior to year 2008 were nearly non-existent for the queries we used
in our study. It would be beneficial to update the model in the future with new quar-
terly data and to verify that the findings contained herein can be confirmed with a
larger dataset. Ideally, we would wish to utilize monthly, rather than quarterly, data
for the study, but monthly subscriber figures for each of the competitors are not
publicly available. Therefore, for our Model 4 with five predictors, the findings
could not be based on a dataset that is large enough. It is possible that the findings
related to Twitter sentiments could be revised and reinterpreted after applying the
model to a more sizeable dataset. Research on sentiment analysis suggests context
or topic specific corpuses for automated sentiment analysis (Lak and Turetken
2014). In this study we used the general purpose sentiment analysis engine
SentiStrength. In the future, the tool can be modified to better fit the specific indus-
try being studied.
Another direction for future research is to confirm our findings in other similar
industries around the globe before making any generalizations. There are a number
of other industries that are focused on services and have oligopoly characteristics in
certain geographical areas. For example, Canadian retail banking industry is domi-
nated by top five players: RBC, TD, Scotiabank, CIBC and BMO. Likewise, the US
wireless provider market is dominated by four top players: Verizon, AT&T, Sprint,
11 The Competitive Landscape of Mobile Communications Industry in Canada… 161
and T-Mobile. Our models can be replicated for these markets to identify whether
they are generalizable or have simply captured the idiosyncrasies about the context
of the current study.
The purpose of our research was purely prediction; therefore we did not specifi-
cally emphasize each independent variable’s contribution and significance in the
model. This is mainly due to the fact that the independent variables in the study
cannot be (substantially) manipulated by the managers of the companies explored.
As such, collinearity and dependence of error terms was not a specific concern as
long as the overall predictive power of the models was satisfactory. Future studies
where some of the social media related variables can be significantly manipulated
should ensure avoiding the violation of regression assumptions to be able to explain
the individual factors’ role in influencing the overall change in the dependent
variables.
One advantage of using Google Trends as a surrogate for potential consumer
interest is the fact that Google Trends captures nonlinearities in the consumer search
patterns hence making simple linear models such as those presented in this chapter
feasible. As a result, the predictive power of our models was very high. However,
market share may not be as easily predictable in other contexts with longer time
periods and market volatility. Further research should explore whether nonlinear
time variant models such as Markov chains may be needed for web and social media
based prediction of market success.
Biographies
References
Beer F, Hervé F, Zouaoui M (2013) Is big brother watching us? Google, investor sentiment and the
stock market. Econ Bull 33(1):454–466
Cameron MP, Barrett P, Stewardson B (2015) Can social media predict election results? Evidence
from New Zealand. J Polit Market:1–17
Cavazos-Rehg P, Krauss M, Spitznagel E, Buckner-Petty S, Grucza R, Bierut L (2015) Monitoring
marijuana use and risk perceptions with Google Trends data. Drug Alcohol Depend
146:e242–e243
Chandukala SR, Dotson JP, Liu Q, Conrady S (2014) Exploring the relationship between online
search and offline sales for better “nowcasting”. Cust Need Solut 1(3):176–187
Choi H, Varian H (2012) Predicting the present with Google Trends. Econ Rec 88(s1):2–9
Da Z, Engelberg J, Gao P (2011) In search of attention. J Financ 66(5):1461–1499
Engel JF, Blackwell RD, Kegerreis RJ (1969) How information is used to adopt an innovation.
J Advert Res 9(4):3–8
Gerber MS (2014) Predicting crime using Twitter and Kernel density estimation. Decis Support
Syst 61:115–125
Goel RK (2007) Oligopoly. In: Kaliski BS (ed) Encyclopedia of business and finance. Macmillan,
Detroit, MI, pp 558–559
Goel S, Hofman JM, Lahaie S, Pennock DM, Watts DJ (2010) Predicting consumer behavior with
web search. Proc Natl Acad Sci U S A:17486–17490
Internet Live Stats (2015) Total number of websites. http://www.internetlivestats.com/total-
number-of-websites
Internet World Stats (2014) World Internet users statistics. http://www.internetworldstats.com/
stats.htm
Jansen BJ (2009) Twitter power: Tweets as electronic word of mouth. J Am Soc Inf Sci Technol
60(11):2169–2188
Jun S-P, Park D-H, Yeom J (2014) The possibility of using search traffic information to explore
consumer product attitudes and forecast consumer preference. Technol Forecast Soc Chang
86:237–253
Lak P, Turetken O (2014) Star ratings versus sentiment analysis—a comparison of explicit and
implicit measures of opinions. In: Proceedings of the 47th Hawaii international conference on
system sciences (HICSS). IEEE, pp 796–805
Lassen NB (2014) Predicting Iphone sales from Iphone tweets. In: IEEE 18th international enter-
prise distributed object computing conference. IEEE, pp 81–90
Marian AD, Nicole B, Wolfgang S (2014) Sentiment-based commercial real estate forecasting with
Google search volume data. J Prop Invest Financ 32(6):540–569
Preis T, Moat HS, Stanley HE (2013) Quantifying trading behavior in financial markets using
Google Trends. Sci Rep 3:1684
Shawn KJL, Stridsberg D (2015) Feeling the market’s pulse with Google Trends. Int Fed Tech
Anal J
Statistic Brain (2015a) Facebook statistics. http://www.statisticbrain.com/facebook-statistics
Statistic Brain (2015b) Twitter statistics. http://www.statisticbrain.com/twitter-statistics/
Tumasjan A (2011) Election forecasts with Twitter: how 140 characters reflect the political land-
scape. Soc Sci Comput Rev 29(4):402–418
Vosen S (2011) Forecasting private consumption: survey-based indicators vs. Google Trends.
J Forecast 30(6):565–578
Chapter 12
Scale Development Using Twitter Data:
Applying Contemporary Natural Language
Processing Methods in IS Research
David Agogo and Traci J. Hess
12.1 Background
In the past two decades, significant advances in computer science and related disci-
plines have unleashed the power of algorithms and advanced hardware on the clas-
sical problem of interpreting spoken and written text. Methods for machine
translation, speech recognition and speech synthesis have been widely applied in
consumer products such as spoken dialogue systems (SDS) (e.g., Apple’s Siri,
Amtrak’s Julie, and Microsoft’s Cortana), and a host of other technologies,
including the mining of social media data for various purposes (Hirschberg and
Manning 2015). There is also a vast amount of digital data now available for the
purpose of understanding human behavior, an emerging area called computational
social science (Lazer et al. 2009). This paper aims to demonstrate a potentially use-
ful application of big data in information systems (IS) research, with applicability
for social science research in general.
Like many other academic disciplines, the IS field has a growing interest in the
use of big data (Agarwal and Dhar 2014). Further, the IS field is known for its best
practices in the development and validation of measurement scales in empirical
research (e.g., Boudreau et al. 2001; MacKenzie et al. 2011; Straub et al. 2004). The
purpose of this paper is to demonstrate how big data can be used to develop better
measurement scales using an IS scale as an example. While IS research has focused
on scale validation issues such as construct validity and reliability, content validity
has received little attention. Content validity refers to “the degree to which items in
an instrument reflect the content universe to which the instrument will be general-
ized” (Straub et al. 2004, p. 424). Content validity is a property of a set of items or
measures taken together (Anderson and Gerbing 1991), and can be difficult to
establish due to challenges in sampling the domain of interest (Hinkin 1995, 1998;
Rossiter 2002).
In this paper, natural language processing (NLP) methods are applied on data
collected from the Microblogging website, Twitter, with the objective of identifying
frequently occurring themes in affective evaluations of technology, in order to sup-
port the development of a content valid measurement scale. Both positive and nega-
tive evaluations are considered, thus we refer to this scale as the technology hassles
and delights scale (THDS). Further, semi-structured interviews on affective evalua-
tions of technology are also carried out to evaluate the relative performance of both
approaches at identifying the most prevalent elements in this domain. By employing
both Twitter data and interview data, separately and collectively, towards the devel-
opment of a measurement scale, we hope to demonstrate the capacity for contempo-
rary big data and big data methods to contribute to scale development practices in
IS research. After all, as acknowledged by Agarwal and Dhar (2014, p. 445), the big
data opportunity in IS research enables us to “address the same types of questions
as we have in the past but with significantly richer data sets”.
Beyond the methodological contributions of this paper, the IS scale being devel-
oped (THDS), is also important. This scale captures both positive and negative
events experienced when interacting with computers and phones, and has the poten-
tial to improve our understanding of the process-based factors that lead to affective
evaluations of technology (e.g. computer anxiety, technostress, flow and enjoyment)
(Zhang 2013). THDS may also shed light on the user experiences that lead to
switching from one platform to the other (e.g. from Windows to Mac or from
Android to Apple iOS). In the remainder of the paper, a background on scale devel-
opment and the social media data source are presented, after which the NLP meth-
ods used in this paper are introduced. The analysis and results are then reported,
followed by a discussion of the next steps in this research project.
12 Scale Development Using Twitter Data: Applying Contemporary… 165
This paper employs data obtained from a leading social networking and micro-
blogging service, Twitter. Founded in 2006, Twitter is a service that allows users
to post and read short messages (maximum of 140 characters). These “tweets”,
as they are called, are typically public and visible to anyone online. The success
of Twitter can been attributed in part to its wide global reach and support for
establishing weak ties, with the “promise of transcending distance, connecting
everyone with anyone” (Takhteyev et al. 2012, p. 81). Beyond that, however,
Twitter creates ambient awareness between people in social networks and pro-
vides a platform for virtual exhibitionism and voyeurism for active contributors
and passive observers (Kaplan and Haenlein 2011). Ambient awareness is a form
of awareness facilitated by the exchange of fragments of information online
which can result in high levels of social presence and media richness between
individuals on social media (Kaplan and Haenlein 2011). The support for ambi-
ent awareness and virtual exhibitionism/voyeurism make Twitter a unique source
of data for researchers seeking an unfettered glimpse into the daily lives of peo-
ple around the world.
The public nature of tweets enables users to engage in self-disclosure and self-
presentation by sharing their experiences with the world (i.e., while users can
restrict access to their tweets to those who follow them, this is not a common
practice) (Kaplan and Haenlein 2011). These motivations have led to extensive
use of Twitter to disclose information on personal experiences, attitudes, etc. One
important aspect of Twitter is the length restriction imposed by the creators of the
service. By restricting the length of a tweet, information shared by users is more
likely to be focused on a single idea or topic, which enables easy interpretation by
researchers. However, this also leads Twitter users to (1) compress sentences by
omitting words and using unusual spellings to fit within character limits, and (2)
send out multi-part tweets i.e. multiple tweets on a single subject. In addition,
limited context information is made available, since tweets are generally meant to
be read at that instant, and the sender assumes that the audience is already part of
the conversation. In addition to these idiosyncrasies with Twitter data, there are
other issues such as an abundance of spam, automated posts and the relative ano-
nymity of users.
Nevertheless, tweets have been reliably used in aggregate to analyze and even
predict events/trends such as weather (Lampos and Cristianini 2012), box-office
revenues (Asur et al. 2010), consumer confidence and political polls (O’Connor
et al. 2010), politics and election outcomes (Beauchamp 2013; Gayo-Avello
2013), public health concerns such as a flu pandemic (Lampos and Cristianini
2010), and unemployment (Llorente et al. 2015), as well as a myriad of applica-
tions in online marketing. On the individual level, tweets have been used to iden-
tify gender, age, regional origin (Rao et al. 2010), political affiliation
(Pennacchiotti and Popescu 2011), and post-partum changes in affect (De
Choudhury et al. 2013).
12 Scale Development Using Twitter Data: Applying Contemporary… 167
information about the use of language can be obtained. For instance, the verbs or
adjectives co-occurring most frequently with a particular noun can be identified and
used to infer the way an object is most frequently described. Such methods have
been widely used for more advanced purposes such as language translation. Lastly,
interpretation is applied to select semantically accurate (i.e. meaningful) phrases out
of the anticipated large group of syntactically accurate information collected. Given
this introduction to the methods to be used, the scale to be created is now introduced
and preliminary results and analysis are presented.
subsequent behavior. Despite this, the study of cognitions has dominated IS and
only recently have efforts been undertaken to identify pertinent categories of
affective concepts (e.g., Loiacono and Djamasbi 2010; Zhang 2013). Recent
conceptual work such as the Affective Response Model (ARM) (Zhang 2013)
provides the foundation for classifying existing affective evaluations towards
technology objects and use, and shedding light on how they come about. For
instance, experiences during use are a source of enduring positive or negative
evaluations of particular technologies or technology in general (Propositions 6
and 7 in Zhang 2013). ARM identifies these factors as process-based factors
(Category 5.1 and 6.1 in Zhang 2013), and thus provides an emerging theoretical
space which can serve as the foundation for THDS.
When creating measurement scales in the affective domain, there needs to be
clarity about the target, intensity and direction of the characteristic (McCoach
et al. 2013). Target refers to the object, behavior, or idea the feeling is directed at,
intensity refers to the degree or strength of the feeling and direction reflects
whether the feeling is positive, neutral or negative. In this case, the target is the
user experience with hands on use of specific technology objects (mobile phones,
tablets and computers). The intensity refers to feelings strong enough to be
expressed, and the direction includes both positive and negative expressions. The
technology delights and hassles scale is being developed to test hypotheses related
to at least two important IS research areas (1) usage continuance/switching inten-
tions (Bhattacherjee 2001; Bhattacherjee et al. 2012) and (2) deep structure usage
(Burton-Jones and Straub 2006), answering the call to open the black box of con-
structs (perceived ease of use and usefulness) which drive usage (Benbasat and
Barki 2007). This paper takes the perspective that improving our understanding of
process-based factors is a new approach that can help shed more light on these
principal constructs and may eventually lend itself to theory creation (Goodhue
2007). Leaning on best practices for creating affective scales in the affective
domain (McCoach et al. 2013), tweets and semi-structured interviews, will be
used. The objective of the analysis is to identify major themes from a corpus of
tweets in which people are speaking about three categories of technology: com-
puters, mobile phones and tablets. In parallel, semi-structured interviews will also
be used to generate items, and the resultant themes identified will be compared
and condensed into a single THDS.
The previously discussed methods are now employed towards identifying the main
themes in a corpus of tweets using the steps illustrated in Fig. 12.1 below. Each step
is explained in the sections that follow.
170 D. Agogo and T.J. Hess
Full Tweet
Collection of
Pre-Filtering POS Tagging
(Word,POS) Tweets
n-Gram of
Syntax-Aware
n-Gram Selection
Tr-gram Lists
Tri-grams
Identifying
Themes
The dataset used for this study consists of 146,315,059 tweets (Nf) from Jan 1, 2014
to March 31, 2015 (455 days). The keywords used to select tweets included: com-
puter, pc, laptop, desktop, phone, iphone, cellphone, smartphone, tablet, ipad. This
dataset represents an average of 321,571 tweets per day (range of 226,266–704,067).
The maximum number of tweets was on Sept 9, 2014 the day of the iPhone 6 launch.
Initial models and data cleansing approaches were developed on a smaller subset
(N1 = 2 million) due to the large processing requirements of repeated analysis using
the full dataset and are reported alongside the final analysis with the full dataset.
The first level of filtering involved excluding tweets that were retweets (i.e., dupli-
cates of someone else’s tweet), not written in English, or were automatic posts (e.g.,
by game apps). Tweets were filtered using the rich metadata that is part of each
tweet (e.g. time of tweet, the application the tweet was sent from, the language
encoding of the tweet, etc.). Unfortunately, these tags are not always accurate and
may result in a substantial number of false negatives i.e. discarded tweets. The over-
all size of the dataset makes this a less severe issue. This initial filtering yielded
38,076,612 usable tweets (26% of Nf). TweetNLP (Owoputi et al. 2013) was used to
tag the tweets for subsequent analysis. The 38 million usable tweets (26% of Nf)
from the pre-filtering stage were tagged. To verify that TweetNLP appropriately
tagged keywords of interest, tagger accuracy was verified on a random subset of two
million tweets. The nine keywords appeared a total of 1,573,477 times, each being
tagged as either a proper or common noun in at least 70% of appearances. Details
are contained in Table 12.3 below.
12 Scale Development Using Twitter Data: Applying Contemporary… 171
Following this, 5-grams which met the syntactic requirements were selected from
the corpus of tweets. One grammatical syntax structure, the verb phrase, was used
to identify themes. The verb phrase has an action word, a verb, at its core (e.g.,
walking, crash, froze), and also includes a complement or modifiers (e.g., walking
slowly back home, crash my computer, my phone froze). At its simplest, a verb
phrase can capture the principal action in a sentence (e.g., “My phone crashed” is
the verb phrase in the tweet {My phone crashed the second it turned midnight}). A
more complete discussion of the different forms of verb phrases is beyond the scope
of this paper. Since a single tweet may contain multiple verb phrases, adjacency of
the verb phrase to one of the keywords was also a selection criteria. The sequence
for selecting 5-grams from a sample tweet is shown in Table 12.3, and all selected
5-grams, all verb-phrases, and the full source tweets were saved separately. From
the initial set of two million tweets, about half of the tweets (908,367) were found
to meet this criteria and were further analyzed. Some additional filtering was done
at this stage to exclude spam tweets not previously detected. After refining this algo-
rithm on the smaller dataset, the analysis was run on the full data set, leading to
13,089,522 tweets (34% filtering rate).
Broadly, the verb phrases identified through this filtering process were selected as
primary themes for preliminary evaluation as either hassles or delights. Different
lists of verb phrases were identified for each technology category, i.e., computers,
phones, and tablets. Running the full sequence of steps in Table 12.4 on the full
dataset of tweets (Nf) yielded 140,942 (phone), 58,649 (computer), and 20,977
(tablet) unique tri-grams that appeared more than once in the dataset. The top 200
of these tri-grams for each category were analyzed for themes. Where necessary, a
172 D. Agogo and T.J. Hess
random subset of tweets was retrieved to enable better interpretation. Due to space
constraints, only themes from phone-related tri-grams are reported.
Tri-grams fell under meaningful themes such as operating the device, user
clumsiness with the device, etc. as well as clear affective expressions (hate my
phone…, love my phone…). In both the computer and tablet categories, meaning-
ful themes were also identified. A cursory scan of the tweets under themes such as
user rage and affective expression reveal that more detailed information about the
reasons for these affective evaluations can be obtained and is planned as future
research. The validation conducted using semi-structured interviews is presented
in the following section.
In order to cross-validate the themes derived from the NLP analysis, qualitative data
from semi-structured interviews was collected independently. Participants were
sought from the crowdsourcing platform, Amazon Mechanical Turk, an increas-
ingly common source of data for social science research (Buhrmester et al. 2011;
Steelman et al. 2014). A total of 45 participants completed the survey, responding to
open ended questions asking them to list the most frequently occurring delights and
hassles they had experienced using technology (21 for computers only, the rest for
mobile phones and tablets). The sample was 56% female, 62% had a 4-year college
degree or greater, and had an age range of 19–66 (mean = 36, S.D. = 14). As with
the Twitter data, only the results for phones are presented due to space constraints.
An example question provided is “During daily interaction with smartphones, some
things happen that annoy and irritate you. Using short sentences, list some examples
below.” The responses were coded to identify dominant themes. The themes
12 Scale Development Using Twitter Data: Applying Contemporary… 173
Fig. 12.2 Word cloud showing themes from semi-structured interviews for THDS scale. Red:
hassle, Green: delight, Yellow: both hassle & delight; Size represents frequency
Fig. 12.3 Comparison of themes for phone hassles & delights. % represents proportion of
1,711,038 tweets (in top 200 tri-grams) and proportion of 110 interview statements. Shading rep-
resents common themes across the two data sources
identified for phone use are shown in the word cloud in Fig. 12.2 below, with the
size of the theme representing the frequency of occurrence and the color indicating
the direction of the theme.
Finally, the themes from both methods were compared, with the semi-structured
interviews yielding ten distinct themes and the Twitter NLP analysis yielding 12
themes as shown in Fig. 12.3. There was significant overlap between these themes,
with six of ten themes from the semi-structured interviews present in the Twitter NLP
themes. This overlap suggests that Twitter data may be a viable source for conducting
content analysis with new scale development and in validating existing scales. The
non-overlapping themes in the two data sets also provided insight. Unique themes
174 D. Agogo and T.J. Hess
from the Twitter data were Desire to Own/Purchase and Emotions towards Phone,
which are more affective in content, as compared to Network Quality, Access to
Apps, and Storage Space, which were the themes unique to the semi-structured inter-
views. While the themes presented below represent preliminary analysis of the two
data sources, these findings seem appropriate given the inherent impulsive vs. reflec-
tive nature of these data sources. Twitter supports ambient awareness by enabling
quick exchanges of tweets that describe real-time experiences of users with their
phones. Thus, the Twitter data can be expected to reflect more impulsive, unfiltered
expressions of affect, both positive and negative. In comparison, semi-structured
interviews are initiated with an explanation of the context, and then participants are
asked to express their beliefs or general experiences with their phones. In formulating
responses, interviewees seem to provide recollections based on reflection, which pro-
duce more cognitive, functional aspects of their phone experiences.
Further examination of themes across and within the two data sources may yield
additional insight. For example, the theme of Ease of Use from the semi-structured
interview data and the theme of Operating the Device from the Twitter data, shared
some common concepts. The affective themes from the Twitter data could be ana-
lyzed and decomposed into additional themes based on positive and negative evalu-
ations, or based on some of the dimensions in the ARM framework. Other differences
in themes could be examined based on the nature of the data source (i.e., size limita-
tions), and the relative percentage of total themes across the data sources.
The foregoing sections introduce a range of NLP techniques and discuss their appli-
cation to a corpus of tweets for the purpose of extracting themes associated with
technology use that can form a process-based experience scale—the THDS. By
applying a syntax-aware filtering approach, lists of tri-grams that capture expressions
related to specific technologies have been identified. The top 200 tri-grams related to
mobile phones were analyzed and condensed into 12 themes. Following that, semi-
structured interviews on the same topic were conducted independently and that data
was also analyzed to arrive at a set of ten themes. Finally, themes from both sources
were compared and a reasonable amount of overlap was discovered. Further, the
potential for even more themes to be identified using further NLP analysis was noted.
These findings are reported for a single category of technology, mobile phones.
This project started from a desire to put big data to the test in the context of scale
creation, as well as a desire to create a much needed measure of process-based fac-
tors associated with technology use. Based on the progress reported thus far, the
analysis of Twitter data using basic NLP techniques does yield themes which
12 Scale Development Using Twitter Data: Applying Contemporary… 175
potentially represent delights and hassles using technology. These themes r easonably
overlap with themes derived independently from semi-structured interviews, the
traditional method of generating items for scale development. Deeper exploration
of relevant tweets is needed, after which a more formal comparison of the themes
identified in the twitter data and semi-structured interviews will be conducted.
Upon completing this work, the extent to which twitter data and semi-structured
interviews can be used in combination should be more evident. NLP methods may
in fact become a meaningful, tool for scouring large datasets in support of scale and
theory development.
Another critical aspect of this ongoing project is the rationale for how Twitter
data may provide broader coverage than semi-structured interviews. One possible
explanation, supported by preliminary analysis of themes, is that responses to
questions asked in semi-structured interviews are fundamentally different from
unprompted expressions posted to social media. Surveys rely on recollections and
are subject to reflection and thought while tweets are more impulsive. When asked
to recall feelings, survey instruments and interviews are inherently priming the
participant, which may lead to inaccurate evaluations related to the prime (Russell
2003) {Citation} and even bias the information retrieval process from memory
(Ratcliff and McKoon 1988). On the other hand, extracted tweets which occur
unprompted are likely to represent feelings with high levels of activation at the
time the tweet was created, and are therefore less likely to be subject to priming or
memory biases. Given Twitter users’ desire to create and sustain ambient aware-
ness and virtually exhibit themselves, we expected our Twitter data set to result in
a broad set of events with high levels of activation. The large number of tweets
which contain explicit affective expressions are evidence of this. People are fre-
quently tweeting about loving their phones, wishing to smash their phones, need-
ing or missing their phones, etc., but such affective recollections do not emerge
from the semi-structured interviews.
Finally, an exploration of how these methods may be applied beyond the THDS
is necessary. Guidelines for affective scale development (McCoach et al. 2013) or
the C-OAR-SE scale development method in marketing (Rossiter 2002) might be
informative in determining the boundaries within which this approach might be
suitable. Affective scales need to have a clear target, intensity and direction
(Anderson and Bourke 2000; McCoach et al. 2013). The C-OAR-SE method
requires the specification of the object, attribute and rater-entity associated with a
scale. In the example above, the object was clearly defined (i.e., technology objects
in three categories), the attributes were filtered (i.e. hassles or delights) and the
rater-entity was individual users (who use Twitter). Future work on this subject will
therefore explore how to expand this NLP-based approach to other areas of interest
to IS researchers and other fields. This can potentially contribute significantly to
broader utilization of big data sources in traditional research practices.
Practical applications of the THDS and associated concepts also abound. For
instance, the insights gained from such analyses can inform design priorities when
designing new features of software or hardware. Outside of this specific application,
organizations can develop latent measures of consumer sentiment or consumer atti-
tudes of relevance to business success and track these measures as part of customer
176 D. Agogo and T.J. Hess
Biographies
Acknowledgements The authors would like to thank anonymous reviewers from the SIG DSA
2015 Business Analytics Congress for comments that helped streamline and enhance this version
of the paper. In addition, the authors would like to thank Brendan O’Connor, Computer Science
Department, UMass Amherst, and co-creator of TweetNLP, for his guided support during the
development of an earlier manuscript of this paper.
References
Agarwal R, Dhar V (2014) Editorial—big data, data science, and analytics: the opportunity and
challenge for is research. Inf Syst Res 25(3):443–448
Anderson JC, Gerbing DW (1991) Predicting the performance of measures in a confirmatory factor
analysis with a pretest assessment of their substantive validities. J Appl Psychol 76(5):732–740
Anderson LW, Bourke SF (2000) Assessing affective characteristics in the schools. Routledge,
New York
Asur S, Huberman B et al (2010) Predicting the future with social media. In: 2010 IEEE/WIC/
ACM international conference on web intelligence and intelligent agent technology (WI-IAT),
vol 1. IEEE, pp 492–499
Beauchamp N (2013) Predicting and interpolating state-level polling using Twitter textual data. In:
New directions in analyzing text as data workshop
Beaudry A, Pinsonneault A (2010) The other side of acceptance: studying the direct and indirect
effects of emotions on information technology use. MIS Q 34(4):689–6A3
12 Scale Development Using Twitter Data: Applying Contemporary… 177
Benbasat I, Barki H (2007) Quo vadis, TAM? J Assoc Inf Syst 8(4):211–218
Bhattacherjee A (2001) Understanding information systems continuance: an expectation-
confirmation model. MIS Q 25(3):351–370
Bhattacherjee A, Limayem M, Cheung CMK (2012) User switching of information technology: a
theoretical synthesis and empirical test. Inf Manag 49(7):327–333
Boudreau M-C, Gefen D, Straub DW (2001) Validation in information systems research: a state-
of-the-art assessment. MIS Q 25(1):1–16
Brill E (2000) Part-of-speech tagging. In: Handbook of natural language processing. CRC Press,
Boca Raton, pp 403–414
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk a new source of inexpen-
sive, yet high-quality, data? Perspect Psychol Sci 6(1):3–5
Burton-Jones A, Straub DW (2006) Reconceptualizing system usage: an approach and empirical
test. Inf Syst Res 17(3):228–246
Churchill GA Jr (1979) A paradigm for developing better measures of marketing constructs.
J Mark Res:64–73
Clark LA, Watson D (1995) Constructing validity: basic issues in objective scale development.
Psychol Assess 7(3):309
De Choudhury M, Counts S, Horvitz E (2013) Predicting postpartum changes in emotion and
behavior via social media. In: Proceedings of the SIGCHI conference on human factors in
computing systems. ACM, New York, pp 3267–3276
Gayo-Avello D (2013) A meta-analysis of state-of-the-art electoral prediction from Twitter data.
Soc Sci Comput Rev
Ghiselli EE, Campbell JP, Zedeck S (1981) Measurement theory for the behavioral sciences: origin
& evolution. WH Freeman & Company
Goodhue DL (2007) Comment on Benbasat and Barki’s ‘Quo vadis TAM’ article. J Assoc Inf Syst
8(4):15
Hinkin TR (1995) A review of scale development practices in the study of organizations. J Manag
21(5):967–988
Hinkin TR (1998) A brief tutorial on the development of measures for use in survey questionnaires.
Organ Res Methods 1(1):104–121
Hirschberg J, Manning CD (2015) Advances in natural language processing. Science
349(6245):261–266
Hudiburg RA (1989) Psychology of computer use: Xvii the computer technology hassles scale:
revision, reliability, and some correlates. Psychol Rep 65(3f):1387–1394
Hudiburg RA (1992) Factor analysis of the computer technology hassles scale. Psychol Rep
71(3):739–744
Kaplan AM, Haenlein M (2011) The early bird catches the news: nine things you should know
about micro-blogging. Bus Horiz 54(2):105–113
Lampos V, Cristianini N (2010) Tracking the flu pandemic by monitoring the social web. In: 2010
2nd International workshop on cognitive information processing (CIP). IEEE, pp 411–416
Lampos V, Cristianini N (2012) Nowcasting events from the social web with statistical learning.
ACM Trans Intell Syst Technol 3(4):72
Lazer D, Pentland A, Adamic L, Aral S, Barabási A-L, Brewer D, Christakis N, Contractor N,
Fowler J, Gutmann M, Jebara T, King G, Macy M, Roy D, Alstyne MV (2009) Computational
social science. Science 323(5915):721–723
Llorente A, Garcia-Herranz M, Cebrian M, Moro E (2015) Social media fingerprints of unemploy-
ment. PLoS One 10(5):e0128692
Loevinger J (1957) Objective tests as instruments of psychological theory: monograph supplement
9. Psychol Rep 3(3):635–694
Loiacono E, Djamasbi S (2010) Moods and their relevance to systems usage models within organi-
zations: an extended framework. AIS Trans Hum-Comput Interaction 2(2):55–72
MacKenzie SB, Podsakoff PM, Podsakoff NP (2011) Construct measurement and validation
procedures in MIS and behavioral research: integrating new and existing techniques. MIS Q
35(2):293–334
178 D. Agogo and T.J. Hess
Manning CD (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In:
Computational linguistics and intelligent text processing. Springer, pp 171–189
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press,
Cambridge
McCoach DB, Gable RK, Madura JP (2013) Instrument development in the affective domain.
Springer
Nunnally J (1978) Psychometric methods. McGraw-Hill, New York, p 2013
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From Tweets to Polls: linking
text sentiment to public opinion time series. ICWSM 11(122–129):1–2
Owoputi O, O’Connor B, Dyer C, Gimpel K, Schneider N, Smith NA (2013) Improved part-of-
speech tagging for online conversational text with word clusters. In: HLT-NAACL, pp 380–390
Pennacchiotti M, Popescu A-M (2011) Democrats, republicans and starbucks afficionados: user
classification in twitter. In: Proceedings of the 17th ACM SIGKDD international conference on
knowledge discovery and data mining. ACM, New York, pp 430–438
Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In:
Proceedings of the 2nd international workshop on search and mining user-generated contents.
ACM, New York, pp 37–44
Ratcliff R, McKoon G (1988) A retrieval theory of priming in memory. Psychol Rev 95(3):385
Rossiter JR (2002) The C-OAR-SE procedure for scale development in marketing. Int J Res Mark
19(4):305–335
Russell JA (2003) Core affect and the psychological construction of emotion. Psychol Rev
110(1):145
Smith NA (2011) Linguistic structure prediction. Synth Lect Hum Lang Technol 4(2):1–274
Steelman ZR, Hammer BI, Limayem M (2014) Data collection in the digital age: innovative alter-
natives to student samples. MIS Q 38(2):355–378
Straub D, Boudreau M-C, Gefen D (2004) Validation guidelines for IS positivist research. Commun
Assoc Inf Syst 13(1):63
Takhteyev Y, Gruzd A, Wellman B (2012) Geography of Twitter networks. Soc Networks
34(1):73–81
Zhang P (2013) The affective response model: a theoretical framework of affective concepts and
their relationships in the ICT context. MIS Q 37(1):247–274
Chapter 13
Information Privacy on Online Social
Networks: Illusion-in-Progress in the Age
of Big Data?
Shwadhin Sharma and Babita Gupta
Abstract In the age of big data where vast amounts of data are collected, stored,
and analyzed from all possible sources, the growth of social media and the culture
of sharing personal information have created privacy and security related issues.
Drawing on the prospect theory and rational apathy theory, we present a research
model to investigate why people disclose personal information on Online Social
Networks. This paper analyzes the impact of situational factors such as information
control, ownership of personal information, and apathy towards privacy concern of
users on Online Social Network. We describe the proposed research design for col-
lecting our data and analysis using structural equation modeling to analyze the data.
The findings and conclusions will be presented after the data is analyzed. This work
contributes to the network analytics by developing new constructs using the Prospect
Theory and the Rational Apathy theory from the fields of behavioral economics and
social psychology respectively.
13.1 Introduction
The proliferation of social media and web 2.0 is enabling individuals and compa-
nies to engage with digital technologies at an unprecedented scale generating vast
amounts of data, also referred to as “big data”. Big data is characterized by higher
volume, velocity, and variety (the three V’s) of data that usually cannot be handled
by traditional database management tools (Zikopoulos et al. 2012) and is often char-
acterized as a massive volume of both structured and unstructured data that are
generated at high velocity with veracity that adds value to the intended process
(Demchenko et al. 2013; Kshetri 2014).
An Online Social Network (OSN) is an online platform that allows members to
create public profiles within a bounded system, share texts, photos and videos, and
other personal information, and thus, connect, develop, and maintain relationships
(Boyd and Ellison 2007; Ellison et al. 2007). People may use OSNs for many different
reasons including socialization, fun and enjoyment, usefulness in communicating and
interacting with friends, bridging, bonding, and maintaining social capital. Use of
OSNs have created humongous amount of structured and unstructured data. Indeed,
the recent attention that big data is garnering can be largely credited to the rapid
development of the Online Social Networks (OSNs). As OSNs have provided addi-
tional channels for interpersonal and business communication, huge volumes and
variety of data are being generated for collection, storage, and aggregation from
OSNs. These data can be used by the governments, business organizations, research
agencies, marketing companies, etc. Manyika et al. (2011) estimated the value of big
data for U.S. medical industry alone to be $300 billion. Companies in various industry
sectors such as healthcare, retail, services market, supply chain and transportation,
entertainment, and marketing and advertising have started to pay close attention to the
big data phenomenon and thus, to OSNs as one of the primary sources of the big data
(Tan et al. 2013). It is also important to note that despite several benefits of OSNs in
the big data environment, the ability of organizations to collect, store, and analyze big
data poses privacy and security related risks for the users.
The interaction of OSNs with users and the generation of big data on OSNs
through these interactions are presented in Fig. 13.1 below. The interactions cre-
ated in OSNs are accessed by many parties such as government, big organizations,
third parties, and consumer and service firms. Figure 13.1 presents the simplistic
view of how OSNs acts as a source of big data and thus, the source of privacy and
security issues.
The growth of social media and the culture of sharing information have fueled
the proliferation of OSNs such as Facebook, Instagram, Twitter, Google+, and
Pinterest in individuals’ daily life. These OSNs are becoming an important social
platform for computer-mediated communication (Nadkarni and Hofmann 2012) at
an exponential rate—be it for bonding, bridging, or maintaining social capital
(Ellison et al. 2007), or using it as a medium of social interaction and exchanges
(Boyd and Ellison 2007). With Facebook alone having more than one billion mem-
bers (Sharma and Crossler 2014), it is no surprise to see that almost fourth-fifths of
the Internet users use one or the other OSN (Conroy and Williams 2014). The expo-
nential growth of OSNs has brought an intense focus on the privacy and security
issues of its users. OSNs have been plagued with issues of privacy risks such as the
surveillance, secondary use of Information, and collection of irrelevant information
(Sharma and Crossler 2014). In the context of the big data, this already complex
issue becomes further complicated.
An individual may feel a threat to their privacy when they lose control to their
personal information. In an online environment, where users feel a certain amount
of anonymity and the OSN providers have the freedom to aggregate and share the
information easily, the issues of privacy and security may be more predominant. In
a social network environment, information privacy may imply the level of identifi-
able information collected by the organization and the possible unauthorized uses of
that information. These privacy concerns can range from information threats such as
digital aggregation and improper access of personal data by third parties to dangers
arising from the social environment such as online stalking, bullying, or leaking of
private data to the world (Hogben 2007). The level of sophistication of technologies
analyzing the big data generated by the OSNs has increased greatly over the last few
years. In addition, cost-effective and innovative forms of collection and -processing
of high volume, high velocity, and high-variety information assets has brought the
privacy and security issues to the forefront (Kshetri 2014). As we start capturing life
in digital reality in online social networks, it becomes easier for people and organi-
zations with the right skills set to build an accurate portrait of our past, present, and
future behavior, without our knowledge. Software such as Rapid Information
Overlay Technology developed for the U.S. defense department (theguardian.com
2013) uses ‘extreme-scale analytics’ to gather information about individuals’ online
social network habits to predict their future behaviors. Internet giants like Google
and Facebook (including Instagram) have been criticized for a long time for the lack
of transparency on what’s being done with the users’ data they collect. An example
of volume, velocity, and variety of data that Facebook stores and can retrieve is the
Facebook Graph Search function that was launched in March 2013. This function
can give answers to user’s natural language queries by combining the big data
acquired from its billions of members and external data into a search engine. These
results can link Facebook activities such as pictures liked, relationship status, and
comments made between a user’s friends from the time they joined Facebook. Big
data and it tools and techniques that are being used by many OSN companies are
opaque, masked by the layers of technical, legal, physical design (Richards and
King 2013), making the data being collected and used by these companies question-
182 S. Sharma and B. Gupta
able. On top of it, there are several third-party applications on OSNs that also collect
user information in real-time. As such, real-time structured and unstructured data
provided and shared on OSNs such as Facebook, Instagram, Twitter, and Foursquare
generally carry privacy risks.
However, even though social media is taking on the role of primary communica-
tion, people, especially the millennials, may be in a state of indifference when it
comes to their privacy (Yoo et al. 2012). Some of these individuals using OSNs may
not be aware of the risk associated with the release of personal information. Others
may have experienced privacy invasion and thus, may not consider their information
to be private anymore (Solove 2008) becoming apathetic towards their own privacy
over time. Some people are comfortable giving up their privacy for patriotic reasons
such as for national security while others believe that they have nothing to hide
anyway as all of their information is already collected by big organizations like
Google or the government (Goitein 2013).
Information Systems (IS) research has focused on privacy and its value. However,
we do not yet fully understand why people, despite valuing privacy, still choose to
freely share their personal information online. Thus, the concept of privacy on OSN
is an interesting one to study as the value of privacy for each individual is situational
in nature (Acquisti et al. 2013)—some users may modulate their privacy boundar-
ies; for others, the definition of privacy in itself might vary from one situation/
timeline to other. Using the prospect theory and rational apathy theory, this paper
analyzes the impact of situational factors such as information control, ownership of
personal information, benefits of information disclosure, and apathy towards user’s
privacy concern on OSN.
information on a social network (Hugl 2011; Rosenblum 2007). Despite the privacy
risk, the users still use OSNs and share their personal information (Acquisti and
Gross 2006; Tufekci 2008). This study explores how the introduction and rise of big
data on OSNs would affect the perception of the users toward the privacy concerns
and affect their OSN usage behavior.
It is important to study privacy in relation to big data as most of the hacking and
privacy violations are now on bigger and broader terms. In 2010, Julian Assange
used WikiLeaks to upload 90,000 documents related to Afghan War and started an
unprecedented big data leak in the U.S. military history. Edward Snowden followed
the trend by publishing 20 times as many documents. The data that was leaked pro-
vided a glimpse of how the U.S. government has been performing surveillance
activities on its own citizens as well as leaders around the world such as Angela
Merkel, Germany’s chancellor. Recent big data breaches in Anthem Inc. and Ashley
Madison are bringing a lot of attention to privacy violations as well. Big data has
allowed people to extract implicit, previously unknown, and potentially personally
identifiable information about the individuals.
Prospect theory states that while making decisions, individuals appraise a set of
decision alternatives based on personal heuristics, and then select the alternative
that brings the highest satisfaction and outcome (Keith et al. 2012). However, such
personal decision heuristics may demonstrate bounded rationality (Simon 1982) as
the theory is based on the assumption that utility comes from the returns and not the
value of assets. Thus, an individual’s reference point would strongly affect the
choice of their heuristics (Kahneman and Tversky 1979). This phenomenon has
fascinating implications for individuals’ decision to share their personal informa-
tion on OSNs as users compare the utility derived from information sharing to the
loss of information through privacy risk.
In many cases, individuals are rationally apathetic towards a cause. When a voter
feels that his vote would not have any real influence on the conclusion of an election
or change the political scenario, he could develop apathy towards the election.
Similarly, a rational shareholder would not put an extra effort to go through the
length and complexity of proxy statements unless he feels that his effort will make
a difference (Karuitha et al. 2013). Apathy is basically defined as a state of indiffer-
ence or reasoned assessment where an individual has an absence of interest or
184 S. Sharma and B. Gupta
tial for a research model as it removes any confounding variables (Ormond 2014).
Thus, for this research model, age, gender, OSN experience, past privacy invasion,
number of OSN friends, number of years of experience on Facebook, and time spent
on OSN were included as the control variables to see if they impact the dependent
variable.
While privacy apathy is a relatively newer concept in IS, the concept has been gain-
ing momentum as a way to gauge the indifference of a user towards privacy con-
cerns (Sharma and Crossler 2014). With big data being collected and stored by
millions of web sites, applications, agencies, and third parties, individuals may
believe that there is no such thing as privacy in the age of Web 2.0 technologies. As
stated by Mark Zuckerberg, the co-founder of Facebook on January 2010, privacy is
no more a “social norm”. Similar sentiments were echoed by the United States
Senate majority leader Harry Reid when he advised to “just calm down and under-
stand that National Security Agency’s (NSA) PRISM isn’t anything that is brand
new” (csmonitor.com 2013). Similarly, a recent survey showed that almost half of
the Americans take NSA’s PRISM program of data surveillance as “no big deal” as
these people believe that “they’re being tracked all over the Internet by companies
like Google and Facebook” (csmonitor.com 2013). Thus, it is safe to hypothesize
that users with privacy apathy put lower value and price to their personal informa-
tion and thus, care less about information disclosure (Yoo et al. 2012).
H1: Privacy apathy would positively influence intention to disclose information on
OSNs despite the threat of big data.
Privacy protection belief is the subjective possibility that consumers believe that
their private information is protected as anticipated (Metzger 2004). In an online
setting, users who exemplify higher protection beliefs are believed to have more
control over their information and thus, are more in control over information disclo-
sure and are more likely to disclose their personal information (Raschke et al. 2014).
Thus, it is predicted that:
H2: Privacy protection belief would positively influence intention to disclose infor-
mation on OSNs despite the threat of big data.
Privacy risk belief implies the probability of potential loss because of disclosure
of personal information (Malhotra et al. 2004). It is deemed to be the cost of privacy
as disclosing information is often considered risky. Such cost and risks associated to
OSN can range from unintended third parties receiving users’ personal information
to hacking of personal account based on information shared on OSN (Hogben
2007). Several studies have verified the negative effect of perceived privacy risk on
people’s intention to disclose personal information on online transactions and activ-
ities (Li et al. 2010; Malhotra et al. 2004).
186 S. Sharma and B. Gupta
H3: Privacy risk belief would negatively influence intention to disclose information
on OSNs despite the threat of big data.
Perceived benefits refer to a user’s overall expectation of positive outcomes from
an OSN without any significant privacy threats (Bulgurcu 2012). Individuals are
likely to give up a degree of privacy in return for potential benefits related to OSNs.
In an OSN environment, the user’s fear in the form of losing control of personal
information is compensated by the several benefits such as information, enjoyment,
and convenience (Hogben 2007). Thus, the following is hypothesized:
H4: Perceived benefit would positively influence intention to disclose information
on OSNs despite the threat of big data.
Privacy and control has often been linked together in prior work (Westin 1967).
The ability of people to control their information has been emphasized as critical in
any concept of privacy (Wolfe and Laufer 1974). Thus, it is no surprise to see that
there has been an outcry regarding how users have lost control of their information
on OSNs (Boyd 2008; Hoadley et al. 2010). When people tend to share information
on OSN, it is often broadcasted to the network of friends. Sometimes, such informa-
tion is accessed by the third party applications installed by the users. An individual
believing lower information control on OSNs would believe that such information
has been collected and stored by OSNs and third parties and thus, would have higher
privacy apathy. Similarly, a sense of higher information control would lead to a
positive privacy protection belief and a lower privacy risk belief. Thus, we posit:
H5: Perceived information control would negatively influence privacy apathy.
H6: Perceived information control would positively influence privacy protection
belief.
H7: Perceived information control would negatively influence privacy risk belief.
Perceived ownership implies a sense of possession and entitlement (Furby 1978).
In the case of information and OSNs, perceived ownership implies the sense of
entitlement, possession, and attachment towards the information shared on OSNs
(Feuchtl and Kamleitner 2009; Sharma and Crossler 2014). When individuals
believe that the information shared on OSN is their information and contains some
level of attachment with their identity and privacy, it positively influences their pri-
vacy risk belief (Sharma and Crossler 2014). Thus, it is hypothesized that:
H8: Perceived ownership would positively influence privacy risk belief.
Away from OSN, there are tremendous opportunities for people to maintain
a relationship, enjoy life, consume information, and develop an offline real-life
image. Users of OSN will perceive lower benefits from OSN use when they are
enjoying many of similar benefits offline in their real life. Thus, existing benefits
decrease the perceived benefits of future disclosure (Keith et al. 2012). Thus, we
propose:
H9: Existing offline benefits would negatively influence perceived benefits of OSNs.
13 Information Privacy on Online Social Networks: Illusion-in-Progress… 187
The proposed conceptual model will be evaluated using survey design. An online
questionnaire survey has been developed to collect data and perform empirical tests
of the relationships proposed in the research model. The survey design technique
fits the research phenomenon being studied in this research as the objective of this
research is to explore user’s information disclosure behavior on OSN. Also, a sur-
vey design provides the benefit of generalizability to the study as data could be
collected from a wider range of respondents.
All the items used in the survey instrument are adapted from previous studies.
The items are reflective likert-scale and have been adapted to fit the context of this
study. The items for this study along with their respective original source/s have
been presented in Table 13.1 below:
Although constructs adopted from earlier studies have been rigorously tested
for reliability and validity, additional content validation using a multi-stage itera-
tive procedure is recommended (Churchill 1979). Podsakoff et al. (2003) also have
suggested using an ex-ante approach such as expert panel review and a pilot test to
control Common Method Bias (CMV). Thus, a preliminary investigation consist-
ing of expert panel review, pretest, and pilot test will be conducted to ensure mea-
surement validity of the instrument. The changes suggested by the expert panel
review and pre- and pilot tests such as revisions to wordings to improve clarity and
precision, dropping items to make the survey fatigue-free, revision of items to
make them unambiguous, etc. will be incorporated. This will ensure content valid-
ity of our survey instruments and also reduce CMV. Similarly, to reduce CMV we
will keep our survey anonymous, optional, and relatively short. We will also assess
the extent of common method variance with two statistical tests. First, we will
perform Harman’s single factor test by loading all of the items in a principal com-
ponent factor analysis (Podsakoff et al. 2003). If the results show that there is more
than a single factor that accounts for a majority of covariance, it would suggest
absence of CMV in our study. However, as Harman’s single factor test is increas-
ingly contested for its ability to detect common method bias, we will also use
Lindell and Whitney’s (2001) test that uses a theoretically unrelated construct
(termed a marker variable) to assess CMV. We will use “Perceived effectiveness of
credit card guarantees” as our marker variable construct for this study and will use
it to adjust the correlations among the principal constructs (Pavlou et al. 2007).
The absence of high correlations among any of the items of the study’s principal
constructs and perceived effectiveness of credit card guarantees would indicate
that the study doesn’t have serious concerns about common method bias as the
construct perceived effectiveness of credit card guarantees is expected to be weakly
related to the study’s principal constructs.
Undergraduate students from different classes within a public university in
California will be invited to complete the survey. The invitations will be sent through
emails as well through classroom visits. As the age group of 18–25 years that are
188 S. Sharma and B. Gupta
Table 13.1 (continued)
Perceived information I believe I have control over the amount Xu (2007)
control of your personal information collected
on OSN.
I believe I have control over who can get
access to my personal information on
OSN.
I believe I have control over my personal
information that has been released on
OSN.
I believe I have control over how my
personal information is being used by
OSN.
I believe I have control over my personal
information that I provided on/to OSN.
Perceived benefits OSN is useful to exchange personal Ellison et al. (2007);
information with my friends. Krasnova et al. (2010)
OSN is useful for me to monitor what
others share about themselves.
Sharing personal information on OSN is
fun.
By sharing personal information on OSN,
I get more popular with my OSN-friends.
I share personal information via OSN
because it’s better than the alternatives.
Existing offline benefits I have more time to spend with my Self-developed
family and friends around me.
Staying offline has several benefits than
staying online.
I can build real relationships and stay
happy and healthy when I am offline.
I have more time to pursue my hobbies
and pursuits and form network with
people I know.
Behavioral intent to I am likely to provide my personal Xu and Teo (2004)
disclose information information on/to OSN.
(BINT) I plan to provide my personal information
on/to OSN.
I intend to provide my personal
information on/to OSN.
educated and college students are the ones that use the OSNs the most (Lenhart et al.
2010), it is appropriate to have undergraduate students as the sample for this study.
A primary investigation consisting of reliability and validity testing, model fit
test (i.e. goodness of fit), common method bias test, and t-test is conducted to
ensure the validity of the structural model. We will use SmartPLS 2.0, SPSS
along with AMOS for our instrument validation and testing of the structural
190 S. Sharma and B. Gupta
model. SmartPLS uses a Partial Least Square (PLS) regression technique that
employs a component-based approach for estimation and places minimal restric-
tions on sample size, measurement scales and residual distributions (Chin and
Todd 1995) and it does not impose normality requirements on the data. We will
also use AMOS which is a covariance based structured equation model that pro-
vides various overall goodness-of-fit indices for assessing model fit and method
variance.
Before testing the hypothesized structural model, we evaluate the psychometric
properties of the measures. All the constructs in this study are measured with mul-
tiple items. A PLS confirmatory analysis will be conducted to examine convergent
validity, discriminant validity, and reliability using commonly accepted guidelines
(Churchill 1979). Reliability for the constructs will be measured using composite
reliability score and Cronbach’s alpha. The Cronbach’s alpha and composite reli-
ability examine the internal consistency among the data. For all the constructs in our
study, we will also perform descriptive statistics of all the constructs including
means and standard deviations and the level of each item’s contribution to the over-
all factor.
To further examine the validity of the measurement model, we will analyze how
well the model fits the data with the help of model fit statistics available through
AMOS (Anderson and Gerbing 1988). The goodness of fit index (GFI), compara-
tive fit index (CFI), normed fit index (NFI) and incremental fit index (IFI) all assess
the goodness of fit of the model with the data and should be above 0.90 to show
model fit. Root mean square error of approximation (RMSEA) and Standardized
Root Mean Square Residual (SRMR) which measures the “badness of fit” should
both be below 0.05. Similarly, we will also assess if the relative chi-square (i.e.
CMIN/df), which is also a “badness of fit”, is below the threshold of 3 (Kline 1998)
and is thus non-significant. Together, the result would indicate if our hypothesized
measurement model is “fitting” the observed data.
The hypotheses and the relationships used for this study will be tested by
examining the structural model of our study. A bootstrapping resampling proce-
dure will be performed to assess the significance of the path coefficients within
the structural model. The proposed hypotheses for this research model will be
tested using t-statistics (p-value) for the standardized path coefficients. The
t-statistics (p-value) provided by PLS structural model analysis would show us
if the hypothesis is supported or not while the standardized path coefficients
would determine the direction and strength of the relationship between exoge-
nous and endogenous variables. The study will have a satisfactory and substan-
tive model if the dependent factors have R-square (the variance explained by the
independent variables) greater than 0.10 (Falk and Miller 1992). Thus, we will
examine if the proposed paths were mostly significant and how well our model
explained the variances in our endogenous variables. Also, we will analyze the
effect of our control variables age, gender, education, experience in social net-
working, and experience on the Internet on the intention to disclose information
on OSNs.
13 Information Privacy on Online Social Networks: Illusion-in-Progress… 191
13.6 Conclusion
The objective of this research is to explore the factors that may affect user’s infor-
mation disclosure behavior on online social networks. There has been limited
research on why consumers choose to disclose personal information on OSN despite
valuing in privacy. With big data analytics taking the center stage and OSNs becom-
ing as the primary source of big data, privacy and security in social networks have
become increasingly important. This research looks at factors that may explain indi-
vidual’s information disclosure behavior. We use prospect theory and rational apa-
thy theory in our research model. As outlined in our research model, the information
disclosure decision of the invidiual depends on variables such as the privacy risk,
protection beliefs, and perceived benefits to disclose information on OSNs. The
users information disclosure decisions would also be affected by their belief about
who owns the information being provided and how the information that has already
been collected and stored by social media companies is being used by these com-
panies. Thus, this study may help us expand the concept of apathy, risk belief and
privacy calculus in regard to the context of information disclosure behavior.
We will be using survey research as our research methodology as this helps in
increasing the generalizability of this study. The survey instrument for this research
will be hosted in Qualtrics. The main data will be collected from students as they
represent the general demographic that use online social network the most. Prior to
collecting the data, a preliminary investigation will include expert panel reviews,
pre-test, and pilot studies to confirm the reliability and validity of the survey instru-
ment. The loadings, cross-loadings, content and face validity, and reliability will
also be examined during the pilot study. Then the structural model will be used to
test the path coefficient and t-values for our hypotheses.
We will discuss the key findings based on our data analysis in the future publication.
This paper has several theoretical and practical implications. First, this paper brings
together the concept of big data and OSNs to analyze the privacy behavior of OSN
users. Previous research on privacy and information disclosure has focused on inter-
net transactions, eCommerce, and social networks (Dinev and Hart 2006; Keith
et al. 2012) but the concept of big data and its impact on privacy behavior of OSN
192 S. Sharma and B. Gupta
users has been studied by very few researchers. Second, this paper brings the con-
cept of privacy apathy and perceived ownership into focus. Users of OSNs are
believed to be worried about losing privacy in the age of big data. This paper seeks
to study if some of the users would care less about privacy if they believe that they
have already lost the ownership of their information on the internet. Third, we are
also expanding the prospect theory and apathy theory and the concept of the refer-
ence point in guiding OSN users to decide about their privacy-related behavior.
IS research has regularly faced the criticism of lacking relevance to practice
(Baskerville and Myers 2004; Benbasat and Zmud 1999). As such, this paper
provides value to practitioners in many different ways. First, this study is helpful
to OSN providers and third parties as these organizations would now understand
how consumer’s information disclosure behavior works. Second, this study helps
us understand why people tend to disclose too much of their personal information
on OSNs.
McGrath (1995) stated that all research methods are inherently flawed, though each
is flawed differently. Thus, the role of the researcher is always to minimize the flaws
associated with the research by maximizing the three criteria of good research: gen-
eralizability, precision, and realism). This research is no exception to other research
and thus, has its limitations. Some of the limitations of this study pertain to the
generalizability of the study due to sample frame used for this study, theoretical
constructs excluded from the study to achieve parsimonious research model,
research method used for testing the proposed model, and use of self-reported
scales. However, understanding these limitations also provides with the opportuni-
ties for future research. As this research is a research-in-progress, understanding
these limitations will also provide us with the opportunities for strengthening the
next steps in our research plan and our future research.
Biographies
References
Acquisti A, Gross R (2006) Imagined communities: awareness, information sharing, and privacy
on the Facebook, privacy enhancing technologies. Springer, Berlin, pp 36–58
Acquisti A, John LK, Loewenstein G (2013) What is privacy worth? J Leg Stud 42(2):249–274
Afroz S, Islam AC, Santell J, Chapin A, Greenstadt R (2013) How privacy flaws affect consumer
perception. Socio-Technical Aspects in Security and Trust (STAST), 2013 Third Workshop on:
IEEE, pp 10–17
Anderson JC, Gerbing DW (1988) Structural equation modeling in practice: a review and recom-
mended two-step approach. Psychol Bull 103(3):411–423
Baskerville R, Myers MD (2004) Special issue on action research in information systems: making
is research relevant to practice foreword. Manag Inf Syst Q 28(3):329–335
Benbasat I, Zmud RW (1999) Empirical research in information systems: the practice of relevance.
Manag Inf Syst Q 23(1):3–16
Boyd D (2008) Facebook privacy trainwreck: exposure, invasion and social convergence.
Convergence 14(1):13–20
Boyd DM, Ellison NB (2007) Social network sites: definition, history, and scholarship. J Comput-
Mediat Commun 13(1):210–230
Bulgurcu B (2012) Understanding the information privacy-related perceptions and behaviors of an
online social network user. University of British Columbia, Vancouver
Chin WW, Todd PA (1995) On the use, usefulness, and ease of use of structural equation modeling
in MIS research: a note of caution. Manag Inf Syst Q 19(2):237–246
Churchill GA (1979) A paradigm for developing better measures of marketing constructs. J Mark
Res 16(1):64–73
194 S. Sharma and B. Gupta
Conroy S, Williams A (2014) Use of internet, social networking sites, and mobile technology for
volunteerism. AARP Office of Volunteerism and Service
csmonitor.com (2013) The danger of American apathy on NSA surveillance. The Christian
Science Monitor. http://www.csmonitor.com/Commentary/Opinion/2013/0731/The-danger-
of-American-apathy-on-NSA-surveillance. Accessed 19 Jan 2014
Demchenko Y, Grosso P, De Laat C, Membrey P (2013) Addressing big data issues in scientific
data infrastructure. In: Proceedings of international conference on collaboration technologies
and systems (CTS), San Diego, CA, pp 48–55
Dhami A, Agarwal N, Chakraborty TK, Singh, BP, Minj J (2013) Impact of trust, security and
privacy concerns in social networking: an exploratory study to understand the pattern of infor-
mation revelation in Facebook. In: Proceedings of 3rd international advance computing confer-
ence (IACC), pp 465–469
Dinev T, Hart P (2006) An extended privacy calculus model for E-commerce transactions. Inf Syst
Res 17(1):61–80
Ellison NB, Steinfield C, Lampe C (2007) The benefits of Facebook “friends:” social capital and col-
lege students’ use of online social network sites. J Comput-Mediat Commun 12(4):1143–1168
Falk RF, Miller NB (1992) A primer for soft modeling. University of Akron Press, Akron
Feuchtl S, Kamleitner B (2009) Mental ownership as important imagery content. Adv Consum
Res 36(2):995–996
Furby L (1978) Possession in humans: an exploratory study of its meaning and motivation. Soc
Behav Personal 6(1):49–65
Goitein E (2013) The danger of American apathy on NSA surveillance. The Christian Science
Monitor. Available on November 3 from http://www.csmonitor.com/Commentary/
Opinion/2013/0731/The-danger-of-American-apathy-on-NSA-surveillance. Accessed 31 Jul
2013
Hoadley MC, Xu H, Lee J, Rosson MB (2010) Privacy as information access and illusory control:
the case of the Facebook news feed privacy outcry. Electron Commer Res Appl 9(1):50–60
Hogben G (2007) Security issues and recommendations for online social networks. Enisa Position
Paper 1:1–36
Hugl U (2011) Reviewing person’s value of privacy of online social networking. Internet Res
21(4):384–407
Johnson M, Egelman S, Bellovin SM (2012) Facebook and privacy: it’s complicated. In:
Proceedings of the eighth symposium on usable privacy and security, New York, USA, pp 9–15
Kahneman D, Tversky A (1979) Prospect theory: an analysis of decision under risk. Econometrica
47(2):263–291
Karuitha JK, Onyuma SO, Mugo R (2013) Do stock splits affect ownership concentration of firms
listed at the Nairobi securities exchange? Res J Finan Acc 4(15):105–117
Keith MJ, Thompson SC, Hale J, Greer C (2012) Examining the rationality of information disclo-
sure through mobile devices. In: Proceedings of 33rd international conference on information
systems, Orlando, USA, pp 1–17
Kline RB (1998) Principles and practice of structural equation modeling. Guilford Press, New York
Krasnova H, Spiekermann S, Koroleva K, Hildebrand T (2010) Online social networks: why we
disclose. J Inf Technol 25(2):109–125
Krasnova H, Veltri NF, Günther O (2012) Self-disclosure and privacy calculus on social network-
ing sites: the role of culture. Bus Inf Syst Eng 4(3):127–135
Kshetri N (2014) Big data’s impact on privacy, security and consumer welfare. Telecommun
Policy 38(11):1134–1145
Lenhart A, Purcell K, Smith A, Zickuhr K (2010) Social media and mobile internet use among
teens and young adults. Millennial. Pew Internet and American Life Project, Washington
Li H, Sarathy R, Xu H (2010) Understanding situational online information disclosure as privacy
calculus. J Comput Inf Syst 51(1):62–71
Li H, Sarathy R, Xu H (2011) The role of affect and cognition on online consumers' decision to
disclose personal information to unfamiliar online vendors. Decis Support Syst 51(3):434–445
13 Information Privacy on Online Social Networks: Illusion-in-Progress… 195
Lindell MK, Whitney DJ (2001) Accounting for common method variance in cross-sectional
research designs. J Appl Psychol 86(1):114–121
Malhotra NK, Kim SS, Agarwal J (2004) Internet users’ information privacy concerns (IUIPC):
the construct, the scale, and a causal model. Inf Syst Res 15(4):336–355
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers A (2011) Big data: the next
frontier for innovation, competition, and productivity. McKinsey Global Institute
McGrath E (1995) Methodology matters: doing research in the behavioral and social sciences. In:
Human-computer interaction. Morgan Kaufmann, San Francisco, pp 152–169
Metzger MJ (2004) Privacy, trust, and disclosure: exploring barriers to electronic commerce.
J Comput-Mediat Commun 9(4)
Nadkarni A, Hofmann SG (2012) Why do people use Facebook? Personal Individ Differ
52(3):243–249
O’Brien D, Torres AM (2012) Social networking and online privacy: Facebook users’ perceptions.
Ir J Manag 31(2):63–97
Ormond DK (2014) The impact of affective flow on information security policy compliance. In:
M.S. University (ed) pp 1–178
Pavlou PA, Liang H, Xue Y (2007) Understanding and mitigating uncertainty in online exchange
relationships: a principal-agent perspective. Manag Inf Syst Q 31(1):105–136
Podsakoff PM, MacKenzie SB, Lee J-Y, Podsakoff NP (2003) Common method biases in behav-
ioral research: a critical review of the literature and recommended remedies. J Appl Psychol
88(5):879–903
Raschke RL, Krishen AS, Kachroo P (2014) Understanding the components of information pri-
vacy threats for location-based services. J Inf Syst 28(1):227–242
Richards NM, King JH (2013) Three paradoxes of big data. Stanford Law Review Online
66(41):41–46
Rosenblum D (2007) What anyone can know: the privacy risks of social networking sites. IEEE
Secur Priv 5(3):40–49
Sarfaraz A, Ahmed S, Khalid A, Ajmal MA (2012) Reasons for political interest and apathy among
university students: a qualitative study. Pak J Soc Clin Psychol 10(1):61–67
Sharma S, Crossler RE (2014) Disclosing too much? Situational factors affecting information dis-
closure in social commerce environment. Electron Commer Res Appl 13(5):305–319
Simon HA (1982) Models of bounded rationality. MIT Press, Cambridge
Solmitz DO (2000) The roots of apathy and how schools can reduce apathy. Available from https://
dwaynehoward.wordpress.com/2012/05/14/the-roots-of-apathy/
Solove DJ (2008) Understanding privacy. Harvard University Press, Cambridge
Stieglitz S, Dang-Xuan L, Bruns A, Neuberger C (2014) Social media analytics. Bus Inf Syst Eng
6(2):89–96
Tan W, Blake MB, Saleh I, Dustdar S (2013) Social-network-sourced big data analytics. IEEE
Internet Comput 5:62–69
theguardian.com (2013) Software that tracks people on social media created by defence firm.Available
from http://www.theguardian.com/world/2013/feb/10/software-tracks-social-media-defence
Tufekci Z (2008) Can you see me now? Audience and disclosure regulation in online social net-
work sites. Bull Sci Technol Soc 28(1):20–36
Turel O, Serenko A (2012) The benefits and dangers of enjoyment with social networking web-
sites. Eur J Inf Syst 21(5):512–528
Van Dyne L, Pierce JL (2004) Psychological ownership and feelings of possession: three field
studies predicting employee attitudes and organizational citizenship behavior. J Organ Behav
25(4):439–459
Westin AF (1967) Privacy and freedom. Atheneum, New York
Wolfe M, Laufer RS (1974) The concept of privacy in childhood and adolescence. In: Margulis ST
(ed) Privacy as a behavioral phenomenon, symposium presented at the meeting of the environ-
mental design research association, Milwaukee
196 S. Sharma and B. Gupta
Xu H (2007) The effects of self-construal and perceived control on privacy concerns. In:
Proceedings of international conference on information systems, Montreal, Canada, pp 1–14
Xu H, Teo HH (2004) Alleviating consumer’s privacy concern in location-based services: a psy-
chological control perspective. In: Proceedings of the twenty-fifth international conference on
information systems, Charlottesville, Virginia, pp 793–806
Xu H, Dinev T, Smith HJ, Hart P (2008) Examining the formation of individual’s privacy concerns:
toward an integrative view. In: Proceedings of 29th international conference on information,
Paris, France, pp 1–16
Xu H, Teo HH, Tan BC, Agarwal R (2009) The role of push-pull technology in privacy calculus:
the case of location-based services. J Manag Inf Syst 26(3):135–174
Xu F, Michael K, Chen X (2013) Factors affecting privacy disclosure on social network sites: an
integrated model. Electron Commer Res 13(2):151–168
Yoo CW, Ahn HJ, Rao HR (2012) An exploration of the impact of information privacy invasion. In:
Proceeding of thirty third international conference on information systems, Orlando, Florida,
pp 1–18
Zikopoulos PC, Eaton C, Deroos D, Deutsch T, Lapis G (2012) Understanding big data: analytics
for enterprise class and streaming data. McGraw-Hills books. eBook, http://public.dhe.ibm.
com/common/ssi/ecm/en/iml14296usen/IML14296USEN.PDF
Chapter 14
Online Information Processing
of Scent-Related Words and Implications
for Decision Making
14.1 Introduction
With the rise of e-commerce and online-based shopping, the trend from retail
e-commerce sales in the U.S. is growing from 225.5 billion U. S. dollars revenue in
2015 and is predicted to almost double to 434.2 billion in 2017 (Statista 2015).
Online advertising and promotional strategy decisions become all the more
viduals based on olfactory orientations and hence influences online decision making
process.
Findings from this paper provide implications for branding and online advertis-
ing managerial decisions. These decisions involve very different considerations
when compared to offline branding and advertising decisions, particularly for orga-
nizations in the scent product industry. This paper focuses on understanding the
nuances in consumer online information processing and provides additional insight
for supporting managerial and organizational decisions on market segmentation and
targeting strategies.
14.2 S
tudy 1: Individual Differences in Affective Responses
to Scent-Related Words
The relationship between odor and emotions are strongly connected (Herz et al.
2004) and consequently influences the perception and decisions of consumers
(Chebat and Michon 2003; Bone and Ellen 1999). Previous studies have found that
in the absence of actual scent, olfactory imagery can play a significant role in induc-
ing sensations similar to that of processing actual odors, as evidenced by neurosci-
ence data (Bensafi et al. 2003). Past research has focused on the effect of odors in
the marketplace and its impact on purchase decisions and behavior. However, the
“experience of odor” can be elicited without the scent being present, as in the form
of imagined odors. Stevenson and Case (2005, p. 244), defined olfactory imagery as
“being able to experience the sensation of smell when an appropriate stimulus is
absent.” They noted how this had resulted from cumulative evidence, mostly self-
reported data, in three forms: (1) participants report such experiences; (2) descrip-
tions of these experiences are similar to those of actual smelling; and (3) their
reactions to certain forms of these experiences involve appropriate behavioral
responses.
Odor valence, pleasant versus unpleasant, is weighted asymmetrically within
individuals. In particular, unpleasant odors have a functional purpose—human sur-
vival. Thus we believe that the effect of odor valence (represented by pleasant or
unpleasant odor-associated words in this study) will vary across the two olfactory
groups: (1) individuals with a normal sense of smell and (2) individuals with a
heightened sense of smell. However, regardless of individual sensitivity to smell,
which can be categorized into: heightened, normal or decreased (Cross et al. 2015),
unpleasant odor associations, represented by unpleasant odor-associated words,
will elicit increased emotions compared to non-odor associations. This reflects the
function of unpleasant odors as a warning against exposure to ingestion of hazard-
ous or harmful substances. Also, there is a higher probability of activation of the
amygdala for negative emotions, such as fear and disgust, relative to positive emo-
200 M.-H. Lin et al.
Fig. 14.1 Scalp distribution of Late positive potential (LPP). Displayed are grand average taken
from read task. Non-olfactory words (blue); pleasant olfactory words (red); unpleasant olfactory
words (pink)
a normal sense of smell (or 70% of the population). For these individuals, the ability
to smell is generally taken for granted and is relatively not as meaningful, compared
to those who feel its absence or suffer its heightened presence (Cross et al. 2015).
Odor-related experiences for normal individuals also should not be as strong or as
emotionally charged as those of hyperosmics. Individuals with a normal sense of
smell possess good, but less fluent olfactory imagery ability in comparison to indi-
viduals with a heightened sense of smell. Thus, we do not expect explicit odor-
imagery instructions to further enhance (e.g., ceiling effect) odor-induced emotions,
although there may be a slight increase stemming from the unpleasant odor-
associated word imagery due to the negativity effect.
H2: The effect of olfactory imagery, elicited by mental imagery triggered by
olfactory-related words, on emotions will vary across different olfactory groups.
H2a: For individuals with a normal sense of smell, olfactory imagery will not fur-
ther enhance emotions (LPP mean amplitude) for odor-associated words in com-
parison to the passive reading task.
H2b: For individuals with a heightened sense of smell, olfactory imagery will sup-
press emotions (reflected in lower LPP mean amplitude) in odor-associated
words in comparison to the passive reading task.
The emotional processes occurring during the viewing of scent-related words (vs.
non-olfactory related words) are explored using a neuroscience tool, electroenceph-
alography (EEG), to understand the brain’s responses during olfactory imagery.
A screener survey was distributed across campus to students and staff members
in a large university in the Midwest to recruit participants from the two olfactory
categories for the purpose of the study. A self-reported screener question, validated
by Lin et al. (2017), asked individuals to select a category, out of the four, that
described their sense of smell best: heightened sense of smell, normal sense of
smell, decreased sensitivity to smell, and impaired with no sense of smell. This
resulted in 24 individuals with normal sense of smell and 23 individuals with a
heightened sense of smell. The other two smell categories were not further investi-
gated in this current study due to our interest in understanding the impact of scent-
related words for sensitive individuals, which make up approximately 20–25% of
the population (Aron 1998).
The study was a three valence (non-olfactory words vs. pleasant olfactory words
vs. unpleasant olfactory words) within subject × 2 olfactory ability (normal vs. sen-
sitive) between subject mixed design. In the first task, participants were asked to
silently read the words presented to them on the screen. They were shown a list of
72 words displayed on a computer screen one word at a time. The list of words is
taken from Royet et al. (2003) and supplemented with words from González et al.
(2006). The list consists of 12 non-olfactory related words (e.g., needle, button,
14 Online Information Processing of Scent-Related Words and Implications… 203
saucer), 36 words with pleasant olfactory associations (e.g., rose, coffee, honey) and
24 words with unpleasant olfactory associations (e.g., dumpster, feces, trash). The
72 words are presented in 3 blocks of 24 words apiece, consisting of 4 non-olfactory
related words, 12 pleasant olfactory words, and 8 unpleasant olfactory words. The
procedure consists of 72 trials showing a fixation cross (+) for 1 s, then a word is
displayed for 750 ms. Followed by a blank screen for the intertrial interval (ITI) of
3 s. This is repeated for the first block of 24 words. Then there is a short pause of 1
min, followed by the next block. This continues until the three blocks are com-
pleted. Block order is randomized across participants. In the second task, partici-
pants were instructed to silently read the word and also form a mental image of the
corresponding smell represented by the word. For example, to form an image related
to the smell of garlic for “garlic”). Practice trials were included to ensure partici-
pants understood the instructions.
14.2.4 Results
We took measurements for the Late Positive Potential (LPP) ERP component, using
the window of 600–900 ms recorded at the electrode site Pz (Cacioppo and Berntson
1994; Schupp et al. 2003). Under the passive read task, an ANOVA test of the indi-
viduals with a normal sense of smell revealed a strong olfactory words valence
204 M.-H. Lin et al.
Fig. 14.2 ERP results for LPP at Pz for the reading task for individuals normal (left) vs. sensitive
(right) to smell. Non-olfactory word (blue); pleasant olfactory words (red); unpleasant olfactory
words (pink)
Fig. 14.3 Interaction
2.5 *
effects of olfactory
orientation (Normal vs. 2
Sensitive individuals) and 1.5
LPP (uV)
effect, F(2, 44) = 13.00, p < 0.001. The LPP is significantly increased for pleasant
olfactory words (Mpleasant = 1.02 μV vs. Mnon-olfactory = 0.25 μV, p < 0.05) while signifi-
cantly increased for unpleasant olfactory words (Munpleasant = 2.17 m μV vs.
Mnon-olfactory = 0.25 μV, p < 0.05) compared to non-olfactory words (Fig. 14.2). These
results confirm H1a, where emotions are elicited most under unpleasant odor- asso-
ciation words, followed by pleasant odor-associated words in comparison to non-
olfactory words.
Olfactory word valence for the LPP is also significant with hyperosmics (F(2,
42) = 5.65, p < 0.05). As predicted, the LPP is not increased for pleasant olfactory
words (Mpleasant = 0.78 μV vs. Mnon-olfactory = 0.38 μV, p > 0.1) but is strongly increased
for unpleasant olfactory words (Munpleasant = 1.71 μV vs. Mnon-olfactory = 0.38 μV, p <
0.05) compared to non-olfactory words. Results are consistent with H1b.
For effects due to the olfactory imagery task, there are additional differences
between the groups (Fig. 14.3). Individuals with a normal sense of smell were not
affected by the imagery task instruction. Neither pleasant nor unpleasant olfactory
word stimuli were differentially affected by the more passive read task versus the
more resource-demanding imagery task. In contrast, hyperosmics were affected by
the imagery instructions and resulted in suppressed LPP. Imagery did not affect the
processing of pleasant words, however, under the imagery task, words related to
unpleasant smells reduced the LPP magnitude in relation to the response to non-
olfactory related words.
14 Online Information Processing of Scent-Related Words and Implications… 205
For individuals with a normal sense of smell, the effect of imagery is not signifi-
cant, Mread = 1.11 μV vs. Mimagery = 1.20 μV; F(1, 22) = 0.012, p > 0.1. This non-effect
is consistent for both valence comparisons across tasks (p’s > 0.1), which confirms
H2a.
In hyperosmics, there is a significant difference in the effect of the task on the
LPP amplitude (Mread = 1.78 μV vs. Mimagery = 0.33 μV; F(1, 21) = 5.1, p < 0.05)
confirming H2b. Individuals sensitive to smell appear to be automatically process-
ing the affect information by just engaging in the reading task. When instructed to
perform olfactory imagery, affect reflected by LPP is suppressed (Fig. 14.2). Further
examination shows no difference under the pleasant olfactory word condition (Mread =
1.2 μV vs. Mimagery = 0.48 μV; F(1, 23) = 0.69, p > 0.1). For the unpleasant olfactory
condition, there is a significant task effect (Mread = 2.72 μV vs. Mimagery = 0.92 μV;
F(1, 23) = 5.0, p < 0.05). Confirming H2b, the olfactory imagery task results in a
suppression effect in hyperosmics, reflected in reduced LPP, in the unpleasant
condition.
14.2.5 Discussion
Summarizing results for the two valence categories of pleasant versus unpleasant
olfactory words during the read task shows there is a clear negativity bias. Unpleasant
olfactory words generated significantly higher LPP amplitudes for both olfactory
groups. This is consistent with the role of smell in warning against unsafe condi-
tions and substances.
However, emotional reactions during the reading of pleasant olfactory words
were not different from non-olfactory words in individuals with a heightened sense
of smell. To further understand the relationship of olfactory imagery fluency and
olfactory orientation, the following analysis was conducted from the data gathered
in the prescreener survey.
Fluency in performing scent related imagery through reporting of the level of
vividness of their olfactory imagery varies across individuals. Olfactory imagery
ability can be measured through the Vividness of Olfactory Imagery Questionnaire
(VOIQ; Gilbert et al. 1998); an imagery scale modeled after the visual imagery
scale by Marks (1973). We believe that the ability to perform olfactory imagery,
reflected by the VOIQ scale, will be highly correlated with the level of olfactory
sensitivity in individuals. In other words, hyperosmics have better olfactory imagery
abilities compared to individuals with a normal sense of smell, and hyposmics have
the lowest performance in olfactory imagery. To examine this aspect of individual
differences, we surveyed undergraduates using the VOIQ scale (N = 518). Results
showed that smell category strongly predicts VOIQ scores (F(2, 514) = 17.62, p <
0.001) while gender was not significant. Hyperosmics reported lower scores (reflect-
ing higher vividness in olfactory imagery) (Mhyperosmic = 33.56) followed by individu-
als with a normal sense of smell (Mnormal = 39.82). Those with a diminished sense of
smell (hyposmics) reported the highest scores and the least vividness in olfactory
206 M.-H. Lin et al.
imagery (Mhyposmic = 46.98). The correlation and direction between VOIQ and the
three smell categories confirmed that vividness likely plays a role in the effects of
olfactory imagery on emotions.
With these findings in mind, the suppressed affect during imagery in sensitive
individuals is even more likely an effect of automatic suppression. On the contrary,
automatic affect response to reading the olfactory pleasant words is supported by
the fact that individuals who are sensitive to smell implicitly (hence automatically)
process affective associations of olfactory words.
14.3 S
tudy 2: Evaluations and Behavioral Intentions
to Scented Brand Names
Two product categories often associated with a scent were selected for creating the
online ads, including a home fragrance product and a food item (cookie). Ads were
pretested for likeability and product association with scent.
208 M.-H. Lin et al.
Words used to construct brand names in this study were chosen from a database
with normality ratings to ensure gender and mood neutral words. Scent-related
words (e.g., lavender, orange blossom, caramel) and non-scent-related words (e.g.,
bingo, symphonies, compass) were also selected from normality ratings on olfac-
tion association levels. The scent-related brand name version and non-scent-related
brand name version of the ads were constructed so the only variation is the brand
name between the two conditions (Fig. 14.4).
A total of 256 participants from a mid-size university in the United States were
recruited and given class credit for their participation. A screener question to survey
the smell orientation of the individuals was asked and used to group participation
into four groups based on their sensitivity to smell: no sense of smell, decreased
sense of smell, normal sense of smell and increased sensitivity to smell. The two
olfactory groups of interest for our study, sensitive and normal, resulted in 66 sensi-
tive individuals and 163 individuals with a normal sense of smell. A total number of
229 participants were included in the analyses.
Participants were randomly assigned into the two ad conditions: scent-related
word brand ad and non- scent-related word brand ad. This resulted in 35 sensitive
individuals in the scent-related word brand condition and 31 in the non-scent-related
word brand. Eighty-four and 79 normal individuals were included in the two condi-
tions respectively. The ads were presented on a computer screen simulating an
online shopping scenario and participants were instructed to perform olfactory
imagery, “Please take a minute or two to try to form a mental image in your of what
the product must smell like.” Questions related to vividness of imagery (e.g., “please
identify the strength of the smell that came to mind when thinking about the prod-
uct”) on a 5-point scale were included in the survey and later used as a manipulation
check. Ratings were included as covariate. Participants were then asked to rate their
attitudes towards the ad (Aad), brand (Abrand), product (Aproduct), and purchase inten-
tions (PI). Beliefs related to the functionality/performance of the product was asked
for each product (e.g., “judging from the ad and brand, I believe this home fragrance
gets rid of odors effectively”).
14.3.3 Results
Control variables included gender, imagery vividness level and product involve-
ment. There were marginal significant effects on the outcome variables, hence these
were not included in the following analyses.
In a between subject design, approximately half of the participants were ran-
domly given the scent-related ad and the others were shown the non-scent-related
ad. MANOVA test reveals significant main effects on the impact of scent-related
brand (vs. non-scent-related brand) in home fragrance ads on the Abrand (Mscent = 3.31
vs. Mnoscent = 2.98, F(1, 225) = 3.63, p < 0.5) and Aproduct (Mscent = 3.38 vs. Mnoscent =
3.15, F(1, 225) = 2.93, p < 0.05). There were marginal effects on purchase inten-
tions (Mscent = 3.75 vs. Mnoscent = 3.50, F = 1.55, p < 0.08) and on the Belief of how
well the product performs (Mscent = 2.73 vs. Mnoscent = 2.42, F = 1.53, p < 0.08). (In
the case of home fragrance, how effectively did it get rid of odors?) No significant
effect of the brand name on Aad (Mscent = 3.09 vs. Mnoscent = 2.96 F(1, 225) = 0.41, p
> 0.1). This was due to the removal of confounding effects through pretesting the
ads and brand names. Findings confirm H3.
Smell orientation (normal vs. sensitive) also had significant main effects on the
Aad (Mnormal = 3.16 vs. Msensitive = 2.9, F(1. 225) = 4.03, p < 0.05) and Belief (Mnormal =
2.65 vs. Msensitive = 2.22, F(1, 225) = 5.59, p < 0.01). There was a marginal significant
effect of smell orientation on Abrand (Mnormal = 3.26 vs. Msensitive = 3.05, F(1, 225) =
2.51, p < 0.08). There were no main effects of smell orientation on Aproduct (Mnormal =
3.35 vs. Msensitive = 3.19, F(1, 225) = 1.88, p > 0.1) and PI (Mnormal = 2.52 vs. Msensitive
= 2.36, F(1, 225) = 0.51, p > 0.1). Results in general confirm H4.
Further, there are marginal interaction effects of smell orientation (normal vs.
sensitive) × fragrance scent (scent vs. no scent) on Abrand (F(1, 225) = 1.88, p > 0.1),
Aad (F(1, 225) = 1.84, p < 0.05), Aproduct (F(1, 225) = 1.88, p > 0.1) and Belief (F(1,
225) = 1.91, p < 0.08). Planned post-hoc tests in individuals with a normal sense of
smell demonstrate that a scent-related brand name, in comparison with non-scent-
related brand name, has a significant impact on Abrand (Mscent = 3.47 vs. Mnoscent =
3.08, t(162) = 2.64, p < 0.01), Aad (Mscent = 3.29 vs. Mnoscent = 3.02, t(162) = 1.92,
p < 0.01), Aproduct (Mscent = 3.49 vs. Mnoscent = 3.24, t(162) = 1.85, p < 0.05) and Belief
(Mscent = 2.98 vs. Mnoscent = 2.66, t(159) = 1.83, p < 0.05). Scent-related brand name
210 M.-H. Lin et al.
4.2 * * * *
* *
3.2
Level of agreement
2.2
Normal
1.2
Sensitive
0.2
-0.8 No Scent No Scent No Scent No Scent No Scent
scent scent scent scent scent
Abrand Aad Aproduct PI Belief
Fig. 14.5 Mean affect related ratings (Aad, Abrand, Aproduct and Belief) and behavioral intention
ratings (Purchase intensions) across the two brand conditions (Non-scent-related brand vs. scent-
related brand) for the home fragrance ad in the two olfactory orientation groups (normal vs. sensi-
tive). **p < 0.01, *p < 0.05
did not have an impact on PI (Mscent = 2.76 vs. Mnoscent = 2.67, t(162) = 0.48, p > 0.1).
Findings confirm H4a (Fig. 14.3).
In contrast, the scent-related brand name had marginal influences on Aproduct
(Mscent = 3.3 vs. Mnoscent = 3.0, t(64) = 1.43, p < 0.05) and PI (Mscent = 2.80 vs. Mnoscent =
2.38, t(64) = 1.74, p < 0.05) for individuals with a heightened sense of smell. Scent-
related brand name had no significant impact on affect driven evaluations, Abrand,
Aad, or Belief. Results support H4b (Fig. 14.5).
In a different product category using a cookie ad, similar effects were found in
the evaluation of the ad. Overall, individuals with a normal sense of smell are more
likely to be influenced by the scent-related brand resulting in higher attitude ratings.
Individuals with a normal sense of smell rated Abrand (Mnormal = 3.24 vs. Msensitive =
2.88, F(1, 225) = 4.95, p < 0.01), Aad (Mnormal = 3.35 vs. Msensitive = 3.08, F(1, 225) =
3.14, p < 0.05), PI(Mnormal = 3.82 vs. Msensitive = 3.43, F(1, 225) = 3.14, p < 0.05) and
Belief “believed the cookie would taste better” (Mnormal = 3.70 vs. Msensitive = 3.41,
F(1, 225) = 5.46, p < 0.01) higher than sensitive individuals.
Overall, Scent-related brand names in the cookie ad were not rated significantly
higher than non-scent-related brand ads on Abrand (Mscent = 3.14 vs. Mnoscent = 2.97,
F(1, 225) = 1.09, p > 0.1), Aad (Mscent = 3.29 vs. Mnoscent = 3.15, F(1, 225) = 0.86, p >
0.1), Aproduct (Mscent = 3.94 vs. Mnoscent = 3.84, F(1, 225) = 0.63, p > 0.1) and Belief
(Mscent = 3.56 vs. Mnoscent = 3.54, F(1, 225) = 0.02, p > 0.1). Purchase intentions
(Mscent = 3.75 vs. Mnoscent = 3.50, F(1, 225) = 2.42, p < 0.08) were marginally higher
in the scent-related brand ad.
However, individuals with a normal sense of smell were significantly influenced
by the scent-related brand name (vs. non-scent-related brand name) and rated Abrand
(Mscent = 3.43 vs. Mnoscent = 3.05, t(162) = 2.41, p < 0.01) and Aad (Mscent = 3.51 vs.
Mnoscent = 3.13, t(162) = 2.41, p < 0.01) higher in the scent-related brand name condi-
tion. The effects were not significant on PI and belief.
14 Online Information Processing of Scent-Related Words and Implications… 211
Individuals with a heightened sense of smell are not influenced by the scent-
related brand name; thus do not rate Abrand (Mscent = 2.90 vs. Mnoscent = 2.91, t(64) =
−0.08, p > 0.1), Aad (Mscent = 3.09 vs. Mnoscent = 3.15, t(64) = −0.24, p > 0.1), or
Belief (Mscent = 3.50 vs. Mnoscent = 3.36, t(64) = 0.51, p > 0.1) higher in the scent-
related brand name condition (vs. non-scent-related brand name). However, PI are
marginally higher in sensitive individuals (Mscent = 3.66 vs. Mnoscent = 3.25, t(64) =
1.43, p < 0.08) when the scent-related brand name was presented.
14.3.4 Discussion
As congruency theory predicted, and in support of H3, findings overall show that
scent-related brand names are better perceived and rated higher in positive attitudes
towards Abrand, Aad, Belief and PI in the home fragrances ad. The scent-related brand
name did not strongly influence attitudes for the cookie and could be likely a result
of the product category. Home fragrances are normally more frequently associated
with a scent than cookies, where taste is the determinant attribute.
The main effects for olfactory orientation were significant in both ads for Abrand,
Aad and Beliefs where individuals with a normal sense of smell rated the products
with scent-related brand names more favorably. The image of a product presented in
the ad, which is automatically associated with a scent, triggered lower attitudes
toward the brand, ad, and product in the sensitive individuals. Attitudes are not ele-
vated by the brand name for these individuals, whereas the scent-related brand name
is seen and rated higher in the normal individuals. Findings in Study 2 coincide and
support the ERP results, revealing suppressed affective reactions and responses in
processing scent-related information.
However, purchase intention was rated higher in sensitive individuals when scent-
related brand name was presented, despite lower attitudes toward the brand, ad and
belief. On the surface, this seems to contradict the suggestions offered by the theory
of reasoned action. However, we argue that the attitudinal reaction, reflected in non-
significant effects of scented brand names on ad and brand ratings, was masked by
emotional suppression for the purpose of overall behavioral function in individuals
sensitive to smell. Such affect regulation, which has been suggested to foster feeling
“right” (vs. feeling “good”) based on the demands of the situation (Koole et al.
2008). Further, the online environment was able to mitigate the negative physiologi-
cal responses that might otherwise yield in a different behavioral outcome.
Findings from this study argue against the perception that attitudes (revealed at
the surface level) alone accurately predict beliefs and behaviors. Our results suggest
individual difference factors should be considered in the predictive model. Higher
ratings of attitudes, as observed in normal individuals, might not translate into pur-
chase behaviors. On the other hand, lower ratings of positive attitudes can still result
in increased purchase intentions and beliefs. Underlying explanation for this
contrary to conventional belief lies in the automatic suppressing processes of affect
discovered in study 1.
212 M.-H. Lin et al.
The results in this paper revealed differentiated underlying emotional processes dur-
ing online purchase decisions. Product decisions that normally use scent as one of
the main attributes in driving purchase decisions are constrained by the online envi-
ronment, in the case of e-commerce or online ads. Our findings suggest individual
differences in sensitivity to smell plays a crucial role in purchase decisions related
to scent-related products. Further, strongly correlated with olfactory sensitivity is
the effectiveness of performing olfactory imagery. These differential effects based
on individual difference factors investigated in this paper have ramifications for
managerial decisions and strategy planning. In particular, understanding consumer
responses to online advertising and sensory related information has implications
for organizational branding and online advertising decisions. Customer relation-
ship management and marketing communication efforts should consider (and/or
reconsider) these individual difference elements when communicating with their
consumers.
Normal individuals (vs. sensitive) in general are attracted to scented products
and are less concerned about the “side effects” scented products might have on
individuals sensitive to smell. Further, they rate the ad, product and brand signifi-
cantly higher when a scent-related brand (vs. non-scent-related brand) was used.
This was replicated in both home fragrance product ad food items. In the case of
product performance, normal individuals believed the effectiveness of the home
fragrance was enhanced when a scent-related brand name (vs. no scent) was used.
Evidenced from findings in the ERP study (Study 1), affective responses in the
pleasant valence olfactory words resulted in an attenuated emotional response.
These findings are supported by other studies investigating automatic emotional
suppression responses from sensitive individuals as a reaction to reduce unpleasant
affect (Lin et al. 2017). Gaining these nuanced understandings of decision making
processes involved in consumer’s online shopping experiences and attitudes, can
enhance quality decisions made at the organizational level.
Suppression of affect demonstrated in the ERP study (Study 1) is consistent with
findings from online brand attitudes in Study 2. Our combined findings suggest an
inhibited processing of emotions in individuals sensitive to smell when olfactory
words were presented. This reaction can be viewed as a form of protective mecha-
nism for individuals who have strong memory associations with scent from accu-
mulating experiences in the past. By considering individual difference factors,
implicit emotional reactions to scent were demonstrated. Future research should
consider generalizing these findings in other areas of individual differences, includ-
ing personality research and individual differences in other sensory perceptions.
Additionally, the two studies suggest that the mind (emotional reactions) develops
an automatic emotional suppression mechanism for regulating negative associa-
tions, so that the body (behavior) can normally perform and make cognitive driven
decisions. The balancing mechanism between emotions and cognitions can occur
14 Online Information Processing of Scent-Related Words and Implications… 213
implicitly and automatically. These findings open doors for future research on emo-
tional regulation and other emotional intelligent related streams of research.
Results from our paper provide insight into understanding online purchase deci-
sions and behaviors that are relevant to product purchases that are often associated
with scent attributes. The paper also demonstrates the advantages of utilizing mixed
methods. Fundamental mechanisms underlying the differential effects demonstrated
through behavioral experiments and surveys were explained and supported through
empirical data utilizing neuroscience methods. Through the combined use of meth-
ods described, the role of valence, automatic processes during passive view/read of
words and images presented through online advertisements was investigated.
Particularly, ERP findings provide implicit and almost real time data on emotional
processes of scent related information, presented in visual format, which provides
an explanation for the inconsistency between self-reported attitudes and behaviors
observed in the behavioral study.
Other implications for understanding the interplay between the input of multiple
sensory affect, attitudes and behavior are warranted. Clearly, online decision mak-
ing processes and purchase behaviors may diverge from traditional decisions and
behaviors taking place in block and mortar stores. Yet, as this paper shows, the role
of scent and the influence of individual differences in sensitivity to scent in online
purchase forums remain salient.
One of the limitations to the paper is we only included one end of the olfactory
sensitivity spectrum in our investigation. The purpose for the study was to under-
stand vulnerable consumers, sensitive individuals, and their cognitive and emotional
processing of online information. Future studies should consider investigating indi-
viduals who fall on the other end of the spectrum, individuals who have decreased
sense of smell. Previous studies have found that hyposmics reported lower levels of
quality of life and lacked enjoyment of many daily consumption activities such as
dining in restaurants friends (Miwa et al. 2001). Others have also investigated the
full spectrum and discovered that sense of smell is often viewed as part of the con-
sumer’s identity and provide many implications for marketers and businesses (Cross
et al. 2015). Individuals who find themselves deviated from the societal norms and
expectations of following consumption rules in the marketplace have often been
neglected and marginalized by the society (Lin et al. 2014). Further, the use of self-
reported measure to screen and recruit individuals for the specific olfactory catego-
ries has its disadvantages. However, in a separate study, the validation of the scale
and support effective use of the scale is demonstrated (Lin et al. 2017).
Biographies
References
Aron EN (1998) The highly sensitive person: how to thrive when the world overwhelms you. Three
Rivers Press, New York
Bensafi M, Porter J, Pouliot S, Mainland J, Johnson B, Zelano C, Young N et al (2003) Olfactomotor
activity during imagery mimics that during perception. Nat Neurosci 6(11):1142–1144
Bensafi M, Sobel N, Khan RM (2007) Hedonic-specific activity in Piriform cortex during odor
imagery mimics that during odor perception. J Neurophysiol 98(6):3254–3262
Bone PF, Ellen PS (1999) Scents in the marketplace: explaining a fraction of olfaction. J Retail
75(2):243–262
Bosmans A (2006) Scents and sensibility: when do (in) congruent ambient scents influence prod-
uct evaluations? J Mark 70(3):32–43
Cacioppo JT, Berntson GG (1994) Relationship between attitudes and evaluative space: a critical
review, with emphasis on the separability of positive and negative substrates. Psychol Bull
115(3):401
Chebat J-C, Michon R (2003) Impact of ambient odors on mall shoppers’ emotions, Cognition,
and spending: a test of competitive causal theories. J Bus Res 56(7):529–539
Chebat J-C, Morrin R, Chebat D-R (2009) Does age attenuate the impact of pleasant ambient scent
on consumer response? Environ Behav 41(2):258–267
Costafreda SG, Brammer MJ, David AS, Fu CH (2008) Predictors of amygdala activation during
the processing of emotional stimuli: a meta-analysis of 385 PET and Fmri studies. Brain Res
Rev 58(1):57–70
Cross SNN, Lin M-H, Childers TL (2015) Sensory identity: the impact of olfaction on consump-
tion. In: Belk R, Murray J, Thyroff A (eds) Research in consumer behavior series. Emerald,
Bradford
Cunningham WA, Espinet SD, DeYoung CD, Zelazo PD (2005) Attitudes to the right-and left:
frontal ERP asymmetries associated with stimulus valence and processing goals. NeuroImage
28(4):827–834
Djordjevic J, Zatorre RJ, Petrides M, Boyle JA, Jones-Gotman M (2005) Functional neuroimaging
of odor imagery. NeuroImage 24(3):791–801
Doty RL, Shaman P, Dann M (1984) Development of the University of Pennsylvania Smell
Identification Test: a standardized microencapsulated test of olfactory function. Physiol Behav
32(3):489–502
Fishbein M, Ajzen I (1975) Belief, attitude, intention and behavior: an introduction to theory and
research. Addison-Wesley, Reading, MA
Gilbert AN, Crouch M, Kemp SE (1998) Olfactory and visual mental imagery. J Ment Imag
22:137–146
Goldkuhl L, Styvén M (2007) Sensing the scent of service success. Eur J Mark 41(11/12):1297–1305
González J, Barros-Loscertales A, Pulvermüller F, Meseguer V, Sanjuán A, Belloch V, Ávila C
(2006) Reading cinnamon activates olfactory brain regions. NeuroImage 32(2):906–912
Hajcak G, Nieuwenhuis S (2006) Reappraisal modulates the electrocortical response to unpleasant
pictures. Cogn Affect Behav Neurosci 6(4):291–297
Herz RS (1997) Emotion experienced during encoding enhances odor retrieval cue effectiveness.
Am J Psychol 110:489–506
Herz R (2007) The scent of desire. William Morrow, New York
Herz RS, Schankler C, Beland S (2004) Olfaction, emotion and associative learning: effects on
motivated behavior. Motiv Emot 28(4):363–383
Holland RW, Hendriks M, Aarts H (2005) Smells like clean spirit nonconscious effects of scent on
cognition and behavior. Psychol Sci 16(9):689–693
Ito TA, Larsen JT, Smith NK, Cacioppo JT (1998) Negative information weighs more heavily on
the brain: the negativity bias in evaluative categorizations. J Pers Soc Psychol 75(4):887
Kim RS, Seitz AR, Shams L (2008) Benefits of stimulus congruency for multisensory facilitation
of visual learning. PLoS One 3(1):1532
216 M.-H. Lin et al.
Koole SL, Kuhl J, Shah J, Gardner W (2008) Dealing with unwanted feelings. In: Handbook of
motivation science. Guilford Press, New York, pp 295–307
Krishna A (2010) An integrative review of sensory marketing: engaging the senses to affect per-
ception judgment and behavior. J Consum Psychol 22(3):332–351
Krishna A, Elder R, Caldara C (2010) Feminine to smell but masculine to touch? Multisensory
congruence and its effect on the aesthetic experience. J Consum Psychol 20:410–418
Lang PJ, Bradley MM, Cuthbert BN (1998) Emotion, motivation, and anxiety: brain mechanisms
and psychophysiology. Biol Psychiatry 44(12):1248–1263
Lin MH, Cross SNN, Childers TL (2014) Two ends of the olfactory sensitivity continuum: too
much and too little. In: Proceedings for the American marketing association winter conference
Lin MH, Cross SNN, Childers TL (2017) Sensitive to the servicescape: the impact of individual
differences in sense of smell in response to ambient scent. Working paper
Marks DF (1973) Visual imagery differences in the recall of Pictures. Br J Psychol 64(1):17–24
Mitchell DJ, Kahn BE, Knasko SC (1995) There’s something in the air: effects of congruent or
incongruent ambient odor on consumer decision making. J Consum Res:229–238
Miwa T, Furukawa M, Tsukatani T, Costanzo RM, DiNardo LJ, Reiter ER (2001) Impact of
olfactory impairment on quality of life and disability. Arch Otolaryngol Head Neck Surg
127(5):497–503
Morrin M, Ratneshwar S (2003) Does it make sense to use scents to enhance brand memory?
J Mark Res 40(1):10–25
Royet J-P, Plailly J, Delon-Martin C, Kareken DA, Segebarth C (2003) fMRI of emotional
responses to odors: influence of hedonic valence and judgment, handedness, and gender.
NeuroImage 20(2):713–728
Schupp HT, Markus J, Weike AI, Hamm AO (2003) Emotional facilitation of sensory processing
in the visual cortex. Psychol Sci 14(1):7–13
Schupp HT, Flaisch T, Stockburger J, Junghöfer J (2006) Emotion and attention: event-related
brain potential studies. Prog Brain Res 156:31–51
Ship JA, Weiffenbach JM (1993) Age, gender, medical treatment, and medication effects on smell
identification. J Gerontol 48(1):26–32
Statista (2015). http://www.statista.com/statistics/272391/us-retail-e-commerce-sales-forecast/
Stevenson RJ, Case TI (2005) Olfactory imagery: a review. Psychon Bull Rev 12(2):244–264
Weinberg A, Ferri J, Hajcak G (2013) Interactions between attention and emotion. In: Handbook
of cognition and emotion. Guilford Press, New York, pp 35–54
Chapter 15
Say It Right: IS Prototype to Enable
Evidence-Based Communication
Using Big Data
15.1 Introduction
15.2.1 B
uilding Block 1: Backend Architecture with Big Data
Analytics
In our backend architecture, we collect and store a large set of more than 14,000
corporate disclosures from stock-listed companies. For each company, we match
the firm news with the corresponding stock price movements and abnormal follow-
ing the release of a new financial disclosure. We then preprocess the financial dis-
closures into a machine-readable format (Manning and Schütze 1999). Subsequently,
we utilize Bayesian learning as a method from Big Data analytics (Hastie et al.
2013; Zou and Hastie 2005). Thereby, we extract decisive words that influence
investors as measured by the stock market reaction (Pröllochs et al. 2015). Finally,
we generate a dictionary containing all the decisive words extracted and assign each
word a positive or negative sentiment score.
Fig. 15.2 User interface for evidence-based communication as an add-in for Microsoft word
220 S. Alfano et al.
the readability. The analysis further highlights words in the text that investors per-
ceive negatively (red) and positively (blue). Our IS prototype supports users by
proposing alternative words as a replacement. In addition, a dashboard shows an
aggregated review that reports the overall sentiment and readability. The dashboard
also displays the sentiment and readability of each sentence to simplify the identifi-
cation of areas for improvement.
15.3 Conclusion
15.4 Biographies
Simon Alfano is a Ph.D. student in the Finance Research Group at the Chair of
Information Systems Research at Freiburg University’s Department of Economics.
In his research at the intersection of behavioral economics and data science, Simon
studies how investors process the textual information. As a member of the Finance
Research Group, Simon is also supporting the start-up TonalityTech to provide cor-
porate communications access to the knowledge on how investors process qualita-
tive aspects of financial disclosures. Prior to his Ph.D. studies, Simon worked as a
management consultant.
https://www.is.uni-freiburg.de/mitarbeiter-en/team/simon-alfano?set_
language=en
Dirk Neumann is Full Professor with the Chair of Information Systems of the
University of Freiburg, Germany. His research topics include Business Analytics,
Text Mining and Cloud Computing. He studied information systems in Giessen
(Diploma), Economics in Milwaukee, WI, USA (Master) and received a Ph.D. from
Karlsruhe Institute of Technology (KIT) in 2004. He has (co-)authored many
research publications at European Journal of Operational Research, ACM
Transactions on Internet Technology, Journal of Management of Management
Information Systems or Decision Support Systems.
https://www.is.uni-freiburg.de/mitarbeiter-en/team/dirk-neumann
References
Antweiler W, Frank MZ (2004) Is all that talk just noise? The information content of internet stock
message boards. J Financ 59(3):1259–1294
Cornelissen J (2014) Corporate communication: a guide to theory & practice. Sage, London
Hastie TJ, Tibshirani RJ, Friedman JH (2013) The elements of statistical learning: data mining,
inference, and prediction. Springer, New York
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press,
Cambridge
Pröllochs N, Feuerriegel S, Neumann D (2015) Generating domain-specific dictionaries using
bayesian learning. In: 23rd European conference on information systems (ECIS 2015),
Münster, Germany
Rennekamp K (2012) Processing fluency and investors’ reactions to disclosure readability.
J Account Res 50(5):1319–1354
Schumaker RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial
news. ACM Trans Inf Syst 27(2):1–19
Tan HT, Ying Wang E, Zhuo BO (2014) When the use of positive language backfires: the joint
effect of tone, readability, and investor sophistication on earnings judgments. J Account Res
52(1):273–302
Tetlock PC (2007) Giving content to investor sentiment: the role of media in the stock market.
J Financ 62(3):1139–1168
Tetlock PC, Saar-Tsechansky M, Macskassy S (2008) More than words: quantifying language to
measure firms’ fundamentals. J Financ 63(3):1437–1467
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Statist
Soc B 67(2):301–320
Chapter 16
Introduction: Pedagogy in Analytics
and Data Science
Abstract Keeping with the “Exploring the Information Frontier” theme of the
ICIS 2015 conference, the Pre-ICIS Business Analytics Congress workshop sought
forward-thinking research in the areas of data science, business intelligence, analyt-
ics and decision support with a special focus on the state of business analytics from
the perspectives of organizations, faculty, and students. The teaching track aimed to
promote comprehensive research or research-in-progress in teaching and learning
addressing topics including business analytics curriculum development, pedagogi-
cal innovation, organizational case studies, tutorial exercises, and the use of analyt-
ics software in the classroom. This work has been summarized in this chapter.
16.1 Introduction
Emerging technologies in business intelligence and social media are fueling a need
for innovative curricula in online, traditional and hybrid delivery formats that meet
the industry needs. In keeping with the BA Congress theme of “Exploring the
Analytics Frontier”, we sought pedagogical research contributions, teaching materi-
als, and pedagogical practices/cases that address acquisition, application, and con-
tinued development of the knowledge and skills required in the usage of business
analytics in the classroom, with emphasis on business intelligence, social media
N. Evangelopoulos (*)
University of North Texas, 365D Business Leadership Building,
1307 West Highland Street, Denton, TX 76201, USA
e-mail: [email protected]
J.W. Clark
University of Maine, DP Corbett Business Building, Rm. 315, Orono, ME 04469, USA
e-mail: [email protected]
S. Balkan
Portland State University, Fourth Avenue Bldg, 1900,
1900 SW Harrison St., Portland, OR 97201, USA
e-mail: [email protected]
analytics, big data analytics, high performance analytics, data science, visualiza-
tion, and other emerging analytic technologies.
With the explosion of data, the demand for business intelligence and analytics is
increasing at a faster rate now than ever before. The measurable value from data is
created only after its interpretation and implementation in business processes. It is
becoming a mainstream for large and small businesses alike to use data driven deci-
sion making. This created a high demand for data scientists and business analysts.
According to a 2015 MIT Sloan Management Review study, 40% of the companies
surveyed were struggling to find and retain the data analytics talent (Ransbotham
et al. 2015). International Data Corporation (IDC) predicts a need by 2018 for
181,000 people with deep analytical skills, and a requirement five times that number
for jobs with the need for data management and interpretation skills (Deloitte 2016).
In an effort to close the big talent gap, top business schools are adjusting their
curricula to incorporate state of the art tools and techniques in the fields of business
intelligence and analytics with the objective of training students to meet demand.
The BA Congress Teaching track brought together a community of scholars who
have developed cutting edge curriculum in the areas of pedagogical innovation, orga-
nizational case studies, tutorial exercises and software. BA Congress received sev-
eral contributions, ranging from survey of fields and data types to analytics software
tutorials and case studies. In this chapter a small sample of these are presented.
In this section, we briefly introduce four papers from the teaching track of the BAC
2015 workshop.
In the first paper, titled Tools for Academic Business Intelligence & Analytics
Teaching—Results of an Evaluation, Kollwitz et al. (2017) survey the field of tools
for business intelligence and analytics, systematically evaluating them for class-
room use. Dividing the field into five subdomains that correspond to popular skill
profiles—Big Data, text, web, network, and mobile analytics—they identify and
compare state-of-the-art tools in each major area. Taking into account the practical
importance of free licenses for academic use, the availability of documentation and
training materials for each piece of software, and compatibility with Windows, Mac
OS, and Linux platforms, their disciplined research should provide excellent recom-
mendations to instructors building a curriculum for analytics teaching.
Business analytics tools come in a wide range of algorithmic complexity, plat-
form availability, ease of use, and application domains. As educators prepare to
introduce their students to these tools, brief tutorials can come handy. The second
paper, titled Neural Net Tutorial (Huguenard and Ballou 2017), provides such a
tutorial for artificial neural networks. Its step-by-step directions guide students to
build a working neural net and train it with a sample data set to make predictions
about horse racing outcomes. Prefaced with a brief but accessible introduction to
the concept of machine learning generally and neural networks specifically, the
tutorial could stand on its own in an undergraduate or graduate analytics course.
16 Introduction: Pedagogy in Analytics and Data Science 225
Biographies
Joseph W. Clark was born and raised in Maine, then went away for higher educa-
tion, earning his B.A. and Ph.D. from USC and his M.B.A. from Tulane. He was one
of the first generation of web developers during the dot-com boom of 1997–2001. As
an academic, he has held appointments at China Agricultural University, the
226 N. Evangelopoulos et al.
Sule Balkan got her Ph.D. in Economics from the University of Arizona Eller
College of Management in 1998. She recently moved to Portland Oregon after living
in Taiwan for 4 years where she was the Director of Big Data Certificate Program
and an Associate Professor at the Institute of Business and Management of National
Chiao Tung University. Her research and teaching interests include business intelli-
gence topics such as predictive modeling, advanced analytics, information driven
campaign management and Big Data. Before moving to Taiwan, Sule worked as a
Clinical Associate Professor of Information Systems in the W.P. Carey School of
Business, Arizona State University for 4 years. She has more than 10 years of profes-
sional experience in information management, predictive modeling and analytics,
and campaign execution fields. She worked as Director of Information Management
at Ameriprise Financial prior to joining to academia. Currently she is working at
Portland State University Department of Engineering and Technology Management.
References
Deloitte (2016) Analytics trends: the next evolution. Downloaded on October 31, 2016, from
http://www.deloitte.com/us/AnalyticsTrends
Dunaway MM (2017) An examination of ERP learning outcomes: a text mining approach. In:
Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics and data science: advances in research
and pedagogy. Springer annals of information systems series (http://www.springer.com/
series/7573), Vol. 21, 2016–2017
Huguenard BR, Ballou DJ (2017) Neural net tutorial. In: Deokar A, Gupta A, Iyer L, Jones MC
(eds) Analytics and data science: advances in research and pedagogy. Springer annals of infor-
mation systems series (http://www.springer.com/series/7573), Vol. 21, 2016–2017
Kollwitz C, Dinter B, Krawatzeck R (2017) Tools for academic business intelligence & analytics
teaching—results of an evaluation. In: Deokar A, Gupta A, Iyer L, Jones MC (eds) Analytics
and data science: advances in research and pedagogy. Springer annals of information systems
series (http://www.springer.com/series/7573), Vol. 21, 2016–2017
Ransbotham S, Kiron D, Prentice P (2015) Minding the analytics gap. MIT Sloan Manag Rev
56(3):63–68
Schuff D (2017) Data science for all: a university-wide course in data literacy. In: Deokar A, Gupta
A, Iyer L, Jones MC (eds) Analytics and data science: advances in research and pedagogy.
Springer annals of information systems series (http://www.springer.com/series/7573), Vol. 21,
2016–2017
Chapter 17
Tools for Academic Business Intelligence
and Analytics Teaching: Results
of an Evaluation
Abstract The trend towards big data and business intelligence & analytics (BI&A)
is still continuing. In the upcoming years, thousands of new jobs for data scientists
will be established by the economy. Therefore, there is a need for well-educated
graduates with deep analytical skills. In order to prepare students for their later
profession and to teach them in analytics tools relevant for practice, related aca-
demic education is required. The BI&A sub-domains and tool categories (like text
mining and web analytics) correspond to popular skill profiles. Since the market for
BI&A tools is very large and hence hard to survey, this paper identifies and evalu-
ates a number of tools for each BI&A sub-domain. The tools are evaluated with
regard to university-specific requirements (such as expenses and available learning
resources) and BI&A category-specific requirements (such as functionality). Based
on the evaluation results recommendations for each tool category are given.
Keywords Business intelligence & analytics • BI&A • Big data • Higher education
• Teaching • Tool evaluation
17.1 Introduction
a deep analytical understanding and it might even increase over the next years
(Jacobi et al. 2014; Manyika et al. 2011). Furthermore, the interdisciplinary and
rapid development of big data technologies causes additional new challenges for
education in this field (Jacobi et al. 2014). Al-Sakran (2015) has already indicated a
lack of specific knowledge about business intelligence & analytics (BI&A) tools
with respect to potential data scientists.
A very first approach to overcome this issue from an academic side is the design
of more practical and tool-oriented curricula. However, universities face the prob-
lem that a wide range of potential vendors and tools (Al-Sakran 2015) could be
considered within such curricula. Consequently, faculty should be supported when
evaluating and selecting BI&A tools for teaching purposes. Therefore, the main
research question of this contribution is: Which BI&A tools are most suitable for the
practical use in academic education?
Many overviews and evaluations of BI&A tools have been published in the past
(e.g., Combe et al. 2010; Davis and Woratschek 2015; for a full list c.f. Wang 2015).
However, these contributions usually take a broad and rather generic look on BI&A
and/or do not take specific aspects for academic education (such as the expenses) into
account. Since the broad field of BI&A can be further differentiated into various sub-
domains (e.g., data analytics, streaming analytics, text analytics, etc.), as for example
proposed by (Chen et al. 2012), there is a need to train students in various tool cate-
gories in order to prepare them adequately for practice. Various other requirements—
besides the academic-specific ones—related to the characteristics of each BI&A
sub-domain need to be identified and considered. Therefore, the paper at hand gives
a short overview of tools which are suitable for different BI&A sub-domains and for
academic education. To provide further guidance for faculty and academic staff, we
additionally propose one specific tool for each BI&A sub-domain. These suggestions
are the result of the sub-domain-specific tool evaluations.
The remainder of the paper is organized as follows. The following section
“Theoretical Foundations” further motivates the need of practical, tool-oriented
lessons within academic education and introduces the BI&A framework by (Chen
et al. 2012) which is used to define the various BI&A sub-domains. Following, we
describe the research methodology which will be used for the tool evaluations and
deduce the university-specific requirements. Subsequently, we present for each
identified BI&A sub-domain separately the specific tool requirements, an overview
of possible tools, the evaluation report, and—based on the evaluation results—a tool
recommendation. Concluding, we summarize our findings, discuss the limitations
and provide an outlook to future research.
The rise of big data results in an increasing demand for professionals who have in-
depth analytical skills and knowledge (Manyika et al. 2011). However, the shortage
of skilled data scientists has become a serious challenge for industry (Davenport
and Patil 2012). Current curricula in the field of BI&A do not address all skills rel-
evant in practice (Wang 2015). Al-Sakran (2015) showed that especially graduates
lack of deep knowledge in technical tools. To overcome this issue, we suggest to
include more hands-on exercises in academic teaching by using tools which are
suitable for both, educational use and the potential later usage in industry (i.e., cov-
ering common tool features required in real-world settings). In general, hands-on
exercises are an integral part of BI&A curricula (Wixom et al. 2014). Within litera-
ture, the relevance of hands-on lessons for education is highlighted by various
authors. Teaching information systems students the necessary skills for using BI&A
tools is regarded as one of the key results of academic education (Al-Sakran 2015;
Wixom et al. 2014). Therefore, universities should cooperate with software vendors
and provide “state-of-the-art” software in appropriate classes (Wixom et al. 2011).
The data sets for practical hands-on analytical experience should come from indus-
try (Schiller et al. 2015) or can be obtained from public sources (for a list of possible
sources for data sets cf. Schiller et al. 2015 p. 820) However, since the BI&A tool
market is very large and hard to survey, the final selection of specific “state-of-the-art”
tools for teaching out of all possible solutions is challenging.
The subsequent evaluations of software solutions are based on the BI&A research
framework as introduced by Chen et al. (2012). Chen et al. (2012) differentiate
between five technical sub-domains in analytics research and have identified several
foundational technologies and emerging research trends for each sub-domain.
Foundational technologies encompass a variety of underlying mature technologies.
Emerging research describes trends and developments and shows an overview of the
current focus in research. We use this framework as it constitutes a complete list of
shapes within BI&A from which we deduce different tool categories for our evalu-
ations. Table 17.1 shows an excerpt of the framework (for a complete version see
Chen et al. 2012).
Since we will evaluate different analytics tools for each sub-domain in the
remainder of the paper, in the following a brief overview for the sub-domains is
presented.
First, Chen et al. (2012) identified big data analytics as a sub-domain. A com-
monly used definition for big data, proposed by Gartner research institute (Laney
2001), includes the “three Vs” (Volume, Velocity, and Variety). Within this sub-
domain, there is a variety of research areas dealing with different kinds of (structured)
230 C. Kollwitz et al.
data and with a broad range of advanced analytical techniques (Chen et al. 2012)
requiring high computing power. Within the big data analytics sub-domain a wide
range of tools addresses different characteristics of big data. Since the big data char-
acteristic “Variety” is already covered by the other BI&A sub-domains (e.g., text ana-
lytics uses text data, web analytics uses web data, etc.) and the characteristic “Volume”
has mainly an impact on the underlying big data architectures (e.g., using distributed
systems for high volume data computing), we will focus on the remaining character-
istic “Velocity”. Since the high data generation and processing speed (i.e., velocity) is
best addressed by streaming analytics, we consequently focus our tool evaluation for
the sub-domain “big data analytics” on streaming analytics tools.
In addition, Chen et al. (2012) considered text analytics as a sub-domain target-
ing the analysis of unstructured text data (such as e-mails, corporate documents, and
web pages). Following Ghosh et al. (2012), we understand the term text analytics as
a synonym for text mining and vice versa. The technique of text analytics is based
on information retrieval and computational linguistics which are already widely
applied in industry. For instance, text analytics can provide valuable information
about customers from social networks through sentiment analysis. So far, seven
application areas for text analytics have been identified: web mining, classification,
clustering, natural language processing (NLP), concept extraction, information
extraction (IE), and information retrieval (IR) (Ghosh et al. 2012).
17 Tools for Academic Business Intelligence and Analytics Teaching: Results… 231
17.3 Methodology
In the structuring step (S2), we will derive requirements from our overall goals.
While the university-specific requirements (UR, see below), related to the first goal
(G1), are the same for all tool categories, the tool-specific requirements (TR),
related to the second goal (G2), essentially depend on each sub-domain. Therefore,
we will identify relevant criteria for each BI&A sub-domain separately. The third
step (S3) corresponds to our actual evaluations, in which we assess and compare the
tools related to the set requirements (both URs and TRs). The evaluation results can
be found in the Tables 17.3, 17.4, 17.5, 17.6, and 17.7. In the last step (S4), we will
select one tool per sub-domain that covers the requirements from both perspectives
best. Based on the evaluation reports we justify our recommendations.
For the information gathering, we will mainly use documentations, publications,
and web pages provided by the tool vendors as well as the BI&A literature.
The education sector is still under intense cost pressure and often subject of budget
cuts. Hence, universities usually have limited budget for software, hardware and its
maintenance. One option to face this challenge, is the usage of free learning
resources and tools offered by academic programs (such as Teradata University
Network [TUN], SAS Global Academic Program, and IBM Academic Initiative; for
a full list see Table 17.2).
Due to the support by academic programs and the option to use open source solu-
tions, we consider free of charge usage as one of the crucial requirements in the
evaluation process (UR1). We will use the following rating to evaluate the criterion
UR1:
The tool is not free of charge for academic use.
A downgraded version of the tool is free of charge for academic use.
The tool is free of charge for academic use.
In addition, a comprehensive documentation constitutes another key requirement
as most faculty and students will get familiar with the tool on their own and through
“learning by doing” (UR2). Moreover, the tool maintenance will be considerably
easier if a good documentation is available. In addition, many vendors offer a vari-
ety of further information sources (such as frequently asked questions sections
[FAQs] and tutorials). Tool related information can be provided either officially by
the vendors or unofficially by user communities as it is often the case for open
source tools. Consequently, our rating for the criterion UR2 is as follows:
The vendor provides a basic documentation only.
The vendor provides a comprehensive documentation and additional information
sources (like FAQs, tutorials, etc.).
The vendor provides a comprehensive documentation and additional information
sources. In addition, the vendor provides the option to share information with
other users (e.g., via user groups, internet forums, etc.).
17 Tools for Academic Business Intelligence and Analytics Teaching: Results… 233
According to Wixom et al. (2011) there is a need for a wide range of free learning
resources for students and for faculty. Especially, since on the one hand it enables
students to improve their practical skills by self-instruction and on the other hand it
provides additional course content for lecturers. Hence, the availability of free
learning resources and teaching materials is represented in our evaluation of the
BI&A tools as a university-specific requirement (UR3). We rate the criterion as
follows:
The vendor does not provide learning resources.
The vendor provides either learning resources for students or teaching material
for faculty.
The vendor provides both, learning resources for students and teaching material
for faculty.
Finally, the tools should be platform independent or at least available for the
major operating systems (i.e., Windows, Linux, and Mac OS). Some tools are web-
based and therefore independent of an operating system. The degree of platform
independency constitutes the forth requirement (UR4) which will be assessed as
follows:
The tool runs only on one of the major operating systems.
The tool runs on two of the three major operating systems.
The tool runs on all major operating systems, namely Windows, Linux, and Mac
OS, or is in fact platform-independent.
234 C. Kollwitz et al.
(Gualtieri and Curran 2014), which evaluates the supported data sources of
streaming analytics tools on a scale from 0 (weak support) to 5 (strong support).
Table 17.3 presents the evaluation report for streaming analytics tools, covering all
UR and the big data analytics-specific requirements which are:
TR-BDA-1 Predefined streaming operators and
TR-BDA-2 Accepted data sources.
All vendors offer special programs for academic relations (for a full list see
Table 17.2). IBM, Software AG and TIBCO feature their streaming analytics tools
by these programs and offer free-of-charge versions for academic use. All tools are
documented in detail. Additional information sources such as FAQs, tutorials and
user guides are provided as well as community platforms, which are not tool-
specific, but nevertheless allow the exchange of ideas and experiences with other
users. As part of the IBM Academic Initiative free learning resources are available
for InfoSphere Streams. For lecturers workshops, a faculty guide, and various
reports are provided. Students can use a range of video tutorials, white papers, and
reports. TIBCO provides a very extensive online learning course, various reports,
white papers, and data sheets at leisure. Other material is available via the
StreamBase University program. Software AG offers a variety of benefits for both,
faculty and students. The University Relations program provides an education pack-
age which is tailored to the Apama Streaming Analytic tool. For students it includes
an online training course, case studies, and the opportunity for free certification. For
faculty teaching material, tutorials, and corresponding information are available.
For the Sybase Event Stream Processor few free learning resources could be found.
Although the SAP University Alliances program offers a wide range of learning
courses, case studies, and podcasts, no specific material for the Event Stream
Processor tool is available. Some resources, such as a starter guide, can be accessed
via the SAP InfoCenter.
All tools feature predesigned streaming operators that simplify their use.
Regarding the data sources that can be used, the solutions from IBM and Software
AG have received the highest score, followed by the tool from SAP and with the
lowest score TIBCO. A minor disadvantage of InfoSphere Streams is the fact, that
it runs only on Linux-based operation systems.
From our point of view, we recommend the tool Apama Streaming Analytics
from Software AG for use in academic education. Software AG offers the highest
added value with regard to learning resources since the material corresponds well
with the tool. As an example, the education package for students includes case stud-
ies and a scenario lasting over a period of 12 weeks (90 min/week). In the other
categories Apama Streaming Analytics also reaches the highest scores. The tool
offers an intuitive graphical user interface and useful predesigned streaming opera-
tors. Moreover, it provides the option to develop own custom streaming operators or
interfaces and provides a component for data visualization. In summary, we see the
Software AG solution as the best overall package for teaching a streaming analytics
tool in higher education.
17 Tools for Academic Business Intelligence and Analytics Teaching: Results… 237
Adequate text analytics tools are considered as another crucial factor in competition
(Zikopoulos et al. 2012). Therefore, their usage in teaching should be aligned with
the needs of business (Chiang et al. 2012). Paying attention to this, we have selected
three tools from leading vendors in the field of advanced business analytics (Herschel
et al. 2014) for evaluation, namely IBM SPSS Modeler Premium, SAS Text
Enterprise Miner, and RapidMiner Studio (cf. Table 17.4), which are significant in
practice and widely used. All three solutions are not dedicated text analytics tools,
but comprehensive analytics packages with specific extensions for text analytics.
We are focusing on these extensions in our evaluation.
Text analytics refers to a variety of application areas. As mentioned above, Miner
et al. (2012) define seven areas which exhibit some overlaps. As the first tool-
specific requirement we have selected the range of functionality covered by the
tools (TR-TA-1). We have rated this criterion by the degree to which the tools are
covering the seven areas.
Data sources for text analytics can be very heterogeneous. They consist of vari-
ous text formats (e.g., Word, PDF, CSV, XML), online sources (HTML, RSS,
e-mails), and databases (e.g., MySQL, MongoDB). A modern tool should be able to
cover as many of these data sources as possible to provide a wide range of applica-
tion areas. Accordingly, the second tool-specific requirement addresses the range of
supported data sources by the text analytics tools (TR-TA-2). We rate the criterion
as basic, if a selection of common text formats is supported, and as advanced, if
database connections and/or web content are supported, too.
The results of text analytics can be visualized in different ways, for example by
clustering diagrams or decision trees. State-of-the-art tools in this sub-domain
should offer a wide range of visualization techniques without the need for additional
software. Therefore, the third requirement is the capability to visualize the text ana-
lytic results (TR-TA-3). Summarizing the following domain-specific requirements
were assessed (cf. Table 17.4):
TR-TA-1 Range of functionality,
TR-TA-2 Data sources supported, and
TR-TA-3 Visualization techniques.
All vendors offer their text analytics solution free-of-charge for academic use.
SAS offers a free on-demand solution of the SAS Enterprise Miner, which also
includes the SAS Text Miner via the SAS Global Academic Program. IBM provides
a free-of-charge version of SPSS Modeler Professional including IBM SPSS Text
Analytics for mining unstructured data sources via IBM SPSS Mining in the
Academia program. Another solution offered by IBM is the SPSS Text Analytics for
Surveys tool, which can be used for qualitative text analysis in surveys. RapidMiner
provides in general a free version of its RapidMiner Studio. In addition, a profes-
sional version is free-of-charge for members of the RapidMiner Academia program.
In contrast to the general free version, the professional version provides a wider
238
range of extensions (e.g., for Hadoop), more supported data sources and higher
computing capacity. The tools are extensively documented and offer various teach-
ing material for students and faculty. SAS provides a comprehensive documentation
including tutorials, user guides, and factsheets. Furthermore, there is an internet
forum, which addresses in particular text and content analytics topics. As part of the
Global Academic Program, SAS provides various learning resources. Teaching
material for faculty is available, which additionally can be accessed online via the
SAS Live Web Classroom.
There are special certification programs and tutorials for students as well as a
variety of white papers and reports. Furthermore, SAS is member of the TUN,
which is one of the largest providers of learning resources in the BI&A community
(Wang 2015). Via TUN a variety of additional free learning resources can be
accessed. Besides the general Academic Initiative, IBM has created a special pro-
gram for data and text mining, the Mining in Academia program. The detailed tool
documentation (including an internet forum) is complemented by a series of free
learning resources. One the on hand, IBM provides for faculty teaching material
especially for the tool and for text analytics in general, on the other hand, students
will benefit from various resources ranging from white papers and reports to video
courses, a special support forum for students, and diverse certification options.
RapidMiner provides a manual, a starter guide, and tutorials for documentation. The
open source tool has an active community, including an internet forum and a mar-
ketplace for extensions and plug-ins. Furthermore, it offers a couple of free learning
resources for students, namely white papers, webinars, and reports. For faculty free
certifications and a repository with sample data are available.
Among the evaluated tools no significant differences could be found with regard
to the TRs. All solutions cover a broad range of text analytics application areas,
including all areas provided by Miner et al. (2012). The accepted data formats are
manifold for all tools and comprise common text formats, web data, and database
interfaces. In order to analyze data, SAS Enterprise Miner and SPSS Modeler need
to transform them into a proprietary format. In contrast, a preprocessing is not nec-
essary using RapidMiner for data analysis. The graphical representation of analysis
results is also supported by all tools.
Overall, we can see no clear winner in this sub-domain. The decision depends on
the preferences of faculty. If an open source solution with an active community and
a variety of extensions is preferred, RapidMiner seems to be an appropriate solu-
tion. For a cloud-based solution without installation efforts and a comprehensive
online course offering, SAS Enterprise Miner seems to be the means of choice. The
in-house solution from IBM provides the advantage of an additional text analytics
tool for surveys and offers a special student forum and other useful free learning
resources. Therefore, also the SPSS Modeler can be recommended. Summarizing,
we consider all evaluated tools as suitable solutions for teaching.
240 C. Kollwitz et al.
The market for web analytics software is large and hence hard to survey. For the
evaluation we have chosen a combination of two market leading vendors in this field
(Google Analytics and Quantcast Measure) and two open source solutions, namely
PIWIK and Open Web Analytics (cf. Table 17.5). While the proprietary tools are
web-based software as a service solutions (SaaS solutions), the open source tools
represent in-house solutions, requiring a separate server to store the raw data.
There are different ways to track user activities on the web. In practice, a distinc-
tion between web log analysis and page tagging has been established (Nakatani and
Chuang 2011). Page tagging, in most cases via JavaScript Cookies or PHP, allows a
more detailed look on user actions that are not tracked by web log analysis. However,
such a client-side data collection requires cookies, which must be approved by web
site visitors. If the cookies are deleted or expire, the quality of data collection
decreases. Since web log analysis is based on stored data, historical log data can be
analyzed. Moreover, it stores data in consistent log files on in-house servers. Both
methods are widely applied in practice. We believe students should be taught in both
methods to get a broad view on web analytics methods. Therefore we consider the
tracking methods as the first tool-specific requirement (TR-WA-1).
The way the tracking data is stored in a database is another distinguishing crite-
rion for web analytics tools. On the one hand, most in-house solutions require a
separate database (e.g., MySQL) causing additional costs and efforts for server
deployment and maintenance. One the other hand, keeping the database in-house
allows access to raw data and prevents legal problems about data privacy. By means
of the second tool-specific requirement (TR-WA-2) we note if the database is
located in-house or in the cloud and which data format is used for storage.
Finally, Nakatani and Chuang (2011) point out that web analytics tools can be
distinguished with regard to their capability to provide data analysis in real-time.
We choose this aspect as the third tool-specific requirement (TR-WA-3) and check
whether the tools are able to collect and analyze data in near real-time or not. This
leads to the following three requirements specific for web analytics tools (cf.
Table 17.5):
TR-WA-1 Tracking methods,
TR-WA-2 Data storage, and
TR-WA-3 Real-time functionality.
As stated in Table 17.5 all tools are free-of-charge. Furthermore, PIWIK and
Google Analytics offer an enterprise version as a fee-based SaaS solution and a
premium version for extremely high data volumes, respectively. All tools provide a
detailed documentation in different forms, like guides, FAQs, and/or wikis. In addi-
tion, all tools except for Quantcast Measure, offer community functionalities in the
form of internet forums. Additionally, Open Web Analytics also offers a user chat
via Internet Relay Chat (IRC). All tools come with some free learning resources
such as videos and tutorials, whereas only Google Analytics provides a comprehen-
sive selection. In addition to some tutorials and an extensive video library on
Table 17.5 Evaluation report for web analytics tools
Google Analytics Open Web Analytics PIWIK Quantcast Measure
Website http://www.google.com/intl/en/analytics http://www.openwebanalytics.com http://piwik.org https://www.quantcast.com
Latest version N.A. 1.5.7 2.14.3 N.A.
License Proprietary GNU General Public License (GPL) GNU General Proprietary
Public License
(GPL)
Platforms (Web-based) (Linux, Windows) (Windows, (Web-based)
Linux, Mac OS)
Free of charge
Documentation
Learning resources
Tracking methods Page tagging (via JavaScript) Page tagging (via JavaScript, PHP) Log files, page Page tagging (via JavaScript)
tagging (via
JavaScript, PHP)
Data storage Cloud (proprietary) In-house (MySQL) In-house (MySQL) Cloud (proprietary)
Real-time
functionality
17 Tools for Academic Business Intelligence and Analytics Teaching: Results…
241
242 C. Kollwitz et al.
YouTube, Google offers a variety of online training courses, the so-called Analytics
Academy, including practical exercises and a learning community.
With regard to the tracking method, the tools from Google and Quantcast rely on
page tagging via JavaScript. Open Web Analytics provides additional page tagging
via PHP, while PIWIK allows both page tagging (via JavaScript or PHP) and log
data analysis. Google Analytics and Quantcast Measure rely on a cloud-based data-
base with proprietary data formats, while the open source tools require a separate
MySQL database. All tools exhibit the capability for real-time data collection and
analysis.
From our point of view, we consider PIWIK as a suitable solution for use in
academic education. The main advantage of PIWIK is its option to use both data
collection methods, page tagging and log data analysis. Thus, it offers a broader
range of opportunities for teaching than the other tools. Also, in contrast to the pro-
prietary solutions, it uses the popular open source database MySQL for data stor-
age. Case studies and exercises with sample data to convey special aspects in the
field of web analytics might be conducted more convenient due to the popularity of
MySQL. Compared to Open Web Analytics, PIWIK exhibits a wider distribution
resulting in a larger community. PIWIK provides a forum where an active commu-
nity gives assistance and suggests ideas for future development. In addition, various
extensions can be downloaded in form of third-party plug-ins via the PIWIK
marketplace.
There is a large pool of potential network analysis tools for usage in the academic
context. To make a preliminary selection we have aligned our evaluation to the cri-
teria provided by Combe et al. (2010). Thus, only tools are taken into account,
which are able to process networks with a minimum of a five-digit number of nodes
and which have at least basic functionalities in network analysis. In addition, we
have checked whether the tools have already been used in an academic context. In
consideration of these criteria we have identified the network analytics tools Gephi,
Pajek, and UCINET (cf. Table 17.6).
For tool-specific requirements we consider a large capacity in processing network
data as relevant. Especially in the big data context it is necessary to analyze large
network structures with hundred thousands of nodes. For this reason the maximum
number of nodes that can be processed constitutes the first tool-specific requirement
(TR-NA-1).
Furthermore, there is a great variance in the range of supported network analytics
functions (TR-NA-2). For our evaluation we distinguish between basic functional-
ity, which covers the fundamental metrics of network analysis, and advanced func-
tionality, which includes additional algorithms and calculations.
Table 17.6 Evaluation report for network analytics tools
Gephi Pajek UCINET
Website http://gephi.github.io http://mrvar.fdv.uni-lj.si/pajek/ https://sites.google.com/site/ucinetsoftware
Latest version 0.8.2 4.05 6.586
License GNU General Public License (GPL) Closed source Closed source
Platforms (Windows, Linux, Mac OS) (Windows; Linux, Mac OS via (Windows, Linux, Mac OS via Wine)
Wine)
Free of charge
Documentation
Learning resources
Maximum number of nodes 150.000 500.000 32.767
Range of functionality Basic Advanced Advanced
Visualization techniques
17 Tools for Academic Business Intelligence and Analytics Teaching: Results…
243
244 C. Kollwitz et al.
The third tool-specific criterion indicates whether the software has integrated
visualization capabilities or if additional software is required for this purpose (TR-
NA-3). The following network analytics-specific TRs were assessed (cf. Table 17.6):
TR-NA-1 Maximum number of nodes,
TR-NA-2 Range of functionality, and
TR-NA-3 Visualization techniques.
For use in the academic education, we recommend a combination of two tools.
The free-of-charge open source tool Gephi provides extensive options for visualiza-
tion of networks in real time, using a 3D engine (Bastian et al. 2009). Once the
network has been visualized, it can be interactively modelled and comprehensively
analyzed. Hereby it does not matter if complex, dynamic or hierarchical graphs are
subject to analysis (Bastian et al. 2009). The program features a user-friendly,
graphical user interface and can process up to 150.000 data nodes (Combe et al.
2010). In addition to a detailed documentation (e.g., tutorials, FAQs, Wiki,), there is
an internet forum for sharing experiences, use cases, and a user chat via IRC. As
learning resources, Gephi offers various official and user-generated (video) tutorials
and instructions, facilitating self-studying for students. For faculty the vendor pro-
vides various sample data which can used for designing exercises for teaching.
Gephi is Java-based and available for Windows, Linux and Mac OS. A disadvantage
of Gephi is its limited functionality compared to other network analytics solutions.
Since Gephi is a fairly new tool and still in the beta phase (latest Version 0.82)
advanced functionalities might be available in the future.
For advanced analysis, we recommend the free-of-charge software Pajek. The
tool has been developed at the University of Ljubljana and has continuously been
improved and supplemented by various functions since that time. Further advan-
tages are the solid performance and the high number of nodes that can be processed
(up to 500.000) (Combe et al. 2010). The program includes comprehensive options
for visualization, is well documented by the developers, and offers an extensive
wiki. Among others, the Pajek-Wiki provides a variety of free learning resources
(e.g., course documents, sample data, and academic papers). Pajek is designed for
Windows, but also works with the help of a virtual runtime environment (e.g., Wine)
on Linux. Due to the outdated design, Pajek’s user interfaces needs some time to
getting used to (Lombardi 2011). UCINET is not for free-of-charge and it cannot
handle larger amounts of data. In addition, it requires a separate visualization tool,
resulting in more effort and maybe more costs.
Overall, we propose Gephi as an entry-level solution, in particular suitable for
self-studying and to teach fundamentals of network analytics. A big advantage is the
active community that has emerged around the tool. To get deeper insights into
the topic and to perform advanced analysis, we recommend Pajek. The main advan-
tage of Pajek results from its maturity. It has been continuously developed since
1998 and therefore it constitutes a very robust version with many extra resources.
A complementary use of both solutions is also recommended, because the native
Pajek file format can also be imported into Gephi. Thus, it allows the design of
teaching materials crossing tool boundaries.
17 Tools for Academic Business Intelligence and Analytics Teaching: Results… 245
The market for mobile applications is constantly growing as more and more users
access the internet via mobile devices. The newest of the five BI&A sub-domains
has great similarities with the web analytics sub-domain, since most of the web
analytics tools provide options to analyze data via mobile websites (e.g., Google
Analytics). However, mobile analytics does not relate only to the analysis of mobile
websites but also to the analysis of smartphone and tablet apps. In addition, the
mobile market provides specific requirements for analytics, which we present within
the next paragraphs. For this reason, we follow Chen et al. (2012) and consider
mobile analytics as a separate sub-domain. We evaluate four vendors that have spe-
cialized in mobile analytics, i.e. tools which provide mobile analytics as part of their
generic web analytics solution are excluded. The assessed tools are Apsalar,
Crittercism, Flurry Analytics, and Localytics (cf. Table 17.7).
First, it is important that the most popular mobile platforms are supported by the
tool. Based on their market share we consider especially Apples iOS and Googles
Android as important, followed by Windows Phone, BlackBerry, and Amazons Fire
OS. Consequently, the first tool-specific requirement is the range of supported plat-
forms (TR-MA-1).
As we mentioned in the web analytics subsection, real-time functionality is
important for web analytics and same is true for the analyses of mobile websites and
applications. Real-time information help to improve app performance and timely
troubleshooting (Ask 2013). The ability to process data in real-time is therefore the
second tool-specific criterion (TR-MA-2).
Several metrics are used in practice to analyze mobile apps. The Forrester
research institute differentiates five potential core metrics (Ask 2013). Besides the
analysis of user engagement, Forrester mentions financial, performance, bench-
marking, and qualitative metrics. For the evaluation, we summarize the capability to
measure indicators in these categories as the range of functionality (TR-MA-3). We
distinguish basic functionality (the tool covers 1–3 categories according to Ask
2013) and advanced functionality (4–5 categories according to Ask 2013).
The last tool-specific requirement for mobile analytics tools is the capability of
data imports and exports (TR-MA-4). This leads to the following assessment crite-
ria for mobile analytics tools (cf. Table 17.7):
TR-MA-1 Supported mobile platforms,
TR-MA-2 Real-time analytics,
TR-MA-3 Range of functionality, and
TR-MA-4 Data interfaces.
All tools are web-based and available as free versions. However, Localytics and
Crittercism are only free-of-charge up to a limit of 10,000 or 30,000 monthly active
users, respectively. The functionality of Crittercism is limited in the free version. All
tools offer a solid documentation including FAQs and white papers. Crittercism and
Apsalar feature in addition community functionalities, the former one in form of
an internet forum, the latter one in the form of a FAQ section on the website.
Free learning resources are offered by all vendors, e.g. in the form of video tutorials
246
and how-tos. Apsalar offers in addition a range of white papers and a couple of
cheat-sheets which are suitable for teaching. Localytics and Apsalar also provide
ebooks, case studies, and webinars. The leading mobile platforms (iOS and Android)
are supported by all vendors. All tools except Flurry Analytics have the ability to
perform real-time analysis (Ask 2013). The functional range for Apsalar and
Crittercism is rated as basic and for Flurry Analytics and Localytics as advanced.
Attention should be paid to Localytics which covers the full range of functionality
(Ask 2013). Concerning the data interfaces all tools have the capability for data
import and export, except for Crittercism, which only supports data export. Access
to raw data in Localytics is only available for the commercial enterprise version.
Overall we recommend Apsalar as the most suitable tool for usage in academic
context. The tool offers extensive documentation including community functional-
ities and a variety of free learning resources. Besides the interface for data export
and import, the tool provides full access to the data. Apsalar accesses user data only
in an anonymous form which improves data privacy. A shortcoming of Apsalar in
comparison to the other tools is the limited functionality. However, Absalar achieved
the best results with regard to the other criteria.
The market for BI&A tools is very large and hence hard to survey. Many vendors
want to benefit from the ongoing trend towards big data. In addition, more and more
open source solutions gain maturity and compete with the established vendors.
Universities face the challenge to select appropriate tools from this large range of
offerings. In order to support universities in this challenge, the paper at hand pre-
sented different tool evaluations covering all five BI&A sub-domains.
For the evaluation, the BI&A research framework from Chen et al. (2012) was
used to deduce different tool categories, from which potential tools have been
selected as evaluation candidates. The evaluation requirements have been derived
from the academic context and from specific characteristics of each BI&A sub-
domain. We were able to identify appropriate tools for university education in all
five BI&A sub-domains and gave a broad overview about their license models,
functionalities and their provision of documentation and free learning resources.
The evaluations cannot provide an exhaustive investigation of all available tools,
but we aimed at making a contribution to the selection of suitable tools for e ducation.
Currently, our tool recommendations are based on the evaluation of the available
tool documentation resources and the current BI&A literature only. The evidence
could be strengthened by surveys of instructors actually using these tools within
their teaching sessions. A further limitation is the strict focus on streaming analytics
tools only within the sub-domain “big data analytics”. In addition, further tool cat-
egories, such as data mining tools or in-memory databases could be evaluated and
currently remain subject to further work.
248 C. Kollwitz et al.
Biographies
a member of the agile business intelligence task force from the German Chapter of
TDWI. He has published in journals like Information Systems Management, and
on conferences like AMCIS, DESRIST, ECIS, ER, HICSS, ICIS, and PACIS.
https://www.tu-chemnitz.de/wirtschaft/wi2/wp/en/team/robert-krawatzeck/
References
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2011) Big data: the
next frontier for innovation, competition, and productivity. McKinsey Global Institute
Miner G, Elder J, Hill T, Nisbet R, Delen D (2012) Practical text mining and statistical analysis for
non-structured text data applications. Academic, Oxford
Nakatani K, Chuang T-T (2011) A web analytics tool selection method: an analytical hierarchy
process approach. Internet Res 21(2):171–186
Schiller S, Goul M, Iyer LS, Sharda R, Schrader D, Asamoah D (2015) Build your dream (not just
big) analytics program. Commun Assoc Inf Syst 37:1. Article 40
Wang Y Business intelligence and analytics education: hermeneutic literature review and future
directions in IS education. In: Proceedings of the Americas conference on information systems
(AMCIS’2015), Puerto Rico, 2015, pp 1–10
Wixom B, Ariyachandra T, Douglas D, Goul M, Gupta B, Iyer L, Kulkarni U, Mooney JG, Phillips-
Wren G, Turetken O (2014) The current state of business intelligence in academia: the arrival
of big data. Commun Assoc Inf Syst 34(1):1–13. Article 1
Wixom B, Ariyachandra T, Goul M, Gray P, Kulkarni U, Phillips-Wren G (2011) The current state
of business intelligence in academia. Commun Assoc Inf Syst 29(1):299–312. Article 16
Zikopoulos PC, DeRoos D, Parasuraman K, Deutsch T, Corrigan D, Giles J (2012) Harness the
power of big data: the IBM big data platform. New York, NY, Mc Graw Hill
Chapter 18
Neural Net Tutorial
Brian R. Huguenard and Deborah J. Ballou
Abstract When problems are complex and cannot be solved through conventional
methods such as statistical or management science models, and when human exper-
tise is not sufficient for efficiently finding high-quality solutions, we can consider
the use of machine learning techniques. One such technique is the artificial neural
network (neural net), which can be used for predictive modeling. This chapter pro-
vides a brief introduction to the topic of neural nets, along with a tutorial in which
a working neural net is built and then used to make predictions.
This chapter provides instructions for creating an artificial neural network that will
be used to predict winners in horse races. The same instructions could be used to
create neural networks in other domains. Through reading this chapter the reader
should be able to acquire a basic understanding of how to create and use a neural
network to aid in a predictive decision-making task. This chapter would also be an
appropriate assignment in a business analytics course.
18.1 Introduction
When problems are complex and cannot be solved through conventional methods
such as statistical or management science models, and when human expertise is not
sufficient for efficiently finding high-quality solutions, we can consider the use of
machine learning techniques. The basic idea behind machine learning is that the
program analyzes historical problems and examples of past solutions, looking for
patterns in the data that can be used to develop strategies and rules for solving future
problems. These learned strategies are then incorporated into the future behavior of
the system, so that new problems can be solved more effectively. One such tech-
nique is the artificial neural network, or neural net (Gurney 1997; Haykin 1999;
Lawrence 1994).
Artificial Neural Networks (ANN) are loosely based on a biological neural net-
work. An ANN (or just neural net) is implemented with software simulations of the
massively parallel processes that occur in the human brain. Neural net models are
not intended to be accurate representations of real biological systems…they are
more of an analogy to the human brain than an accurate model of it. Because of the
interconnected nature of these networks, neural net models are also often referred to
as connectionist models.
Neural nets have the ability to learn from experience, where the experience is
gained from viewing historical data consisting of sets of inputs and corresponding
solutions to those inputs. This process is called training. Some examples of problem
domains where neural nets have been used successfully include:
• recognizing handwritten characters
• training a computer to pronounce English text
• prediction of fraud in business transactions
• diagnosis of complex medical conditions
The remainder of this chapter begins with an overview of the structure and work-
ings of a neural net, followed by a tutorial involving the creation and use of a simple
neural net.
Before going into any more detail about neural nets, let’s talk a bit about their bio-
logical counterpart. The human brain is composed of billions of special cells called
neurons. Neurons are organized into groups called networks. Each network contains
several thousand neurons that are highly interconnected. Signals are sent from one
neuron to another, and each neuron has the capability to either leave the signal
strength unchanged, decrease the signal strength, or increase the signal strength
before sending it along to other cells. Information is stored in the gaps between the
cells, called synapses.
Summation Transfer
Inputs Weights Function Function Output
X1 W1j
Neuron j
X2 W2j ∫ Yj
Σ Wij Xj
Xn Wnj
its own weights, the Wij’s. These weights serve to represent the relative importance
of the input. The neural net learns by adjusting these weights so that the desired
output is created. The adjustment of these Wij’s is analogous to the changing of
synaptic strength in biological neurons. The summation function consolidates the
various inputs along with their respective weights into a single weighted sum. This
defines the internal activation level of the neurode. The transfer function processes
this internal activation level and creates the output, usually transforming the internal
activation into a value between 0 and 1. A low internal activation level may result in
an output value of 0 or near 0, and high internal activation levels may result in an
output near 1. The final output result, Yj, would then serve as an input to one or
more other neurodes in the neural net, and the whole process would repeat itself for
those other neurodes.
In Fig. 18.2 we see the conceptual layout of a neural net. Remember that these com-
ponents are all simulated in a computer program. The basic component of a neural
net is a neurode. Each of the colored ovals in Fig. 18.2 represents a single neurode.
A neural net is composed of a collection of neurodes grouped in layers. A typical
structure is shown in Fig. 18.2, with three layers: an input layer, an intermediate
layer called the hidden layer, and an output layer.
Each input to the input layer corresponds to an attribute of a problem, for exam-
ple: if this were for a loan application problem, the inputs X1, X2, X3, and X4 might
represent the loan applicant’s income level, age, marital status, and gender. The
inputs can be numeric or categorical in nature. A numeric representation of these
attribute values serve as the input to the neural net. The input layer typically doesn’t
change the value of the input it receives…it just passes it along to the hidden layer.
254 B.R. Huguenard and D.J. Ballou
Once the input values arrive at the hidden layer, they are processed as shown in
Fig. 18.1. The results of the hidden layer are then passed on to the output layer,
where final processing occurs and the results of the neural net are computed. The
output of the neural net might contain the answer to the problem, or it might provide
input to another neural net. In the case of a loan application problem, the final
answer coming from the net might be a yes or no.
A neural net learns through a process called training. Training techniques fall into
two categories: supervised and unsupervised. Supervised learning requires histori-
cal data giving cases of inputs and correct outputs. For example, we might have data
where each case provides loan applicant characteristics, and the corresponding out-
put is the correct classification of that application as being accepted or denied the
loan. The neural net is provided the input, one case at a time, and when it produces
a final answer for that case it compares its answer to the desired answer. If there is
a difference between the two, then the weights of the net are adjusted in an attempt
to correct the net’s accuracy. Unsupervised learning involves using only input stim-
uli. No comparison to a desired output is performed, and the net develops its own
set of categorizations for the input. Humans must then interpret the categorizations
created by the net, and determine if they are useful or not. This approach can be
useful for exploratory data analysis. Supervised learning is the most commonly
used approach, and is the one we will discuss further.
The basic technique of supervised learning is one of repeatedly giving the neural
net examples of solved problems, where the net has to first come up with a solution on
its own, then is allowed to compare its answer to the correct one. If there is a differ-
ence between the two answers, that’s where the learning occurs…the weights of the
net are adjusted in an attempt to improve the performance of the net for future training
18 Neural Net Tutorial 255
problems, and then another example solved problem is provided. This iterative pro-
cess will continue until some threshold value of accuracy has been reached, or until
some limit has been reached in the number of training problems to be processed.
The most commonly used supervised learning algorithm is called backpropaga-
tion. In Fig. 18.3 we once again have a representation of an individual artificial
neuron, with representations of the inputs, the weights, the summation function, the
transfer function, and the output.
If the output shown here, Yj, were a final output of the neural net for one training
case, then during backpropagation the value Yj would be compared to the actual
correct answer, or target answer, contained in the training case. We’ll call that cor-
rect answer Tj. An error calculation is performed that is a function of the difference
between the net’s answer (Yj) and the target answer (Tj). If the net’s answer is close
enough to the target answer (based on some predetermined tolerance), then no fur-
ther action is required for this particular training case, and the next training case
would be started. However, if the error between the net’s answer and the target
answer is large enough (over the designated tolerance level), then the weights of this
neuron will then be adjusted in proportion to the severity of the error. Now this was
all for just one artificial neuron. The error from this neuron would then be propa-
gated backwards through the hidden layers of the entire neural net, until adjust-
ments have been made to weights as needed over the entire network.
The ultimate goal of backpropagation is to arrive at a set of weights that fits the
training data so as to minimize the error between the network’s answers and the
target answers. Once a stable set of weights have been obtained, then training is over
and the neural net is ready to accept new input, and its output can be used to help
form a recommended decision.
256 B.R. Huguenard and D.J. Ballou
Neural nets can save time and effort in that manual design and creation of
programming-language based models can be avoided. Since the neural nets train
themselves, we don’t have to begin with a deep understanding of how the inputs are
related to the outputs… we don’t have to have human experts in the problem domain
in order to user neural nets. Neural nets are also naturally adaptable to changing
input, rather than requiring a programmer to modify a more static application. Like
humans, neural nets can process incomplete or inaccurate data and still produce
useful output. In general, neural nets are very useful for pattern recognition (for
example, interpreting handwritten messages), classification (example: classifying
corporate bonds in terms of risk rating), and prediction problems (example: predict-
ing future performance of the economy).
On the negative side, the inner workings of a neural net are a black box. There is
no mechanism for providing an explanation of the decisions that it makes. Although
training can be automated, it can still require a lot of time if there are many hidden
layers. As a result, most neural nets have only one or two hidden layers. Finally,
neural nets are not appropriate for number-crunching type problems…they are best
at pattern recognition and classification.
Before you work on this tutorial you need to download and install the software
EasyNN-plus (the software runs on a PC only, not on a Mac). Go to the website
http://www.easynn.com/dltrial.htm and click on the “Download as a Zip file” link.
When asked whether to open or save the file, you should save it to your desktop.
Once the file has finished downloading (it should be called something like “ennsetup.
zip”), you need to unzip it (you should be able to right-click on it and choose the
“Extract All” or “Unzip” command). After unzipping the file you should now have a
folder on your desktop called “ennsetup”. Go into the “ennsetup” folder and double
click on the file “ennsetup.exe”. When asked if you want to run the file, say yes, and
that should start up the installer. Just give the default answers to questions it asks, and
it should install the EasyNN-plus software on your machine. The free version of this
software will run for 30 days and can accept a maximum of 100 rows of input data.
You can get download a copy of the Races98.txt file from [?URL?]. This file will
serve as the input file for this tutorial. Once you have acquired a copy of Races98.
txt, double-click on the file so you can have a look at its content (do not change
18 Neural Net Tutorial 257
anything in the file). You will see that the column titles are on the first line and the
other lines start with the line number in square brackets. We will use the titles for
column names and the numbers for row names. Some of the values in the race data
are integers and some are Boolean (0 or 1).
The Races98.txt file contains data from 98 horse races. The aim is to create a neural
network that can be trained and validated using the horse race data. After the neural
network has been trained and validated it can be used to help predict the winner of
other races. When you are done inspecting the content of the Races98.txt file, close it.
Press the New toolbar button or use the File > New menu command to produce a
blank grid for a new neural network.
An empty Grid with a vertical line, a horizontal line and an underline marker will
appear. Now select File > Import... and the file selection dialog window will appear.
Navigate to your copy of the Races98.txt file, select it, and hit the Open button.
The first Import dialog will appear with the Tab delimiter already checked
under the “Columns” section. Each line in the Races98.txt file is numbered, so those
numbers can be used as the row names. Under the “Example row names” section,
select “Use first word(s) on each line for row names”. Under “Example row
types”, select “Training”. Now press OK.
260 B.R. Huguenard and D.J. Ballou
The second Import dialog will appear. Press the Set names button.
The third import dialog called Input/Output Column will appear. The settings in
this dialog are based on the values being imported for the first row of data from the
input file, but they should be checked for possible errors. In particular the mode and
type may not always be correct. There are seven columns of data to be imported, and
all of them are of type input except the last column, which is of type output (the last
column is the “Win” column, which we are trying to predict). Column 0 (the first col-
umn) is the first one you will setup. The column is called Runners and the first row of
data for this column has a value of 11. The column settings will be correct so press OK.
The second column will now be shown (denoted as column 1). This column is
Distance with a value of 7. The settings are correct so press OK.
Column 2 (the third column) is Handicap with a value of 0. The mode should be
changed to Bool (short for Boolean, meaning this column has values of False (indi-
cated by 0) or True (indicated by 1)). Press OK. Column 3 (the fourth column) is
Class and is another integer. The mode will change back to Integer. Press OK.
Columns 4 and 5 (the fifth and sixth columns) are boolean so the mode should be
set to Bool. Press OK for both columns.
The last column (denoted as column 6, but this is the seventh column) is Win and
is also a boolean so the mode should be set to Bool. The type is not correct and
needs to be changed to Output. Then press OK.
A “Save file as” dialog will now be shown. You will use this dialog to save the
data file in a format used by EasyNN-plus. Save the file in the same directory as the
original Races98.txt input file. Note that the file is saved with a “tvq” extension.
The data will be imported and the grid columns will be set to the correct mode and
type. Those columns that were set with type of “input” are used to predict the value
of the column(s) set with type “output”. In the current exercise we will be using the
first six columns to predict the “Win” column. Note that if the amount of data you are
importing has more than 100 rows then you will be warned that this trial version of
EasyNN-plus can only process the first 100 rows. The “Races98.txt” input file you
are using only has 98 rows, so that won’t be a problem for this exercise.
The Grid of training data is now complete.
To create the neural network press the “Grow new network” toolbar button or use
the Action > New Network menu command. This will open the New Network
dialog. There will be an input node for each of the input columns, and there will be
an output node for each output node. In this dialog window you will specify how
many hidden layers there will be in the network, and how many nodes will be
allowed in each hidden layer. For the simple network used in this exercise we will
use one hidden layer, and we will keep the default values for the min and max num-
ber of nodes allowed in that layer. Under the “Hidden layers” section, check Grow
layer number 1 and press OK. If you get a “Generating new network will reset
learning.” warning message, answer Yes. The neural network will be produced from
the data you imported into the Grid.
18 Neural Net Tutorial 261
In the window saying “Network created”, note that you are told how many nodes
are actually being used in the hidden layer(s) of the network. On the “Network cre-
ated” window, click on the Yes button under the question “Do you want to set the
controls?”. This will open the Controls dialog.
On the Controls dialog box we will leave most of the settings alone. Look under
the “Learning” section and check Optimize for both Learning Rate and Momentum.
[Learning Rate can be changed to any value from 0.1 to 10, representing the size of
weight changes during learning. Very low values will result in slow learning and
values above 1.5 will often result in erratic learning or oscillations. Checking
“Optimize” allows EasyNN-plus to determine a reasonable value for Learning Rate
by running a few learning cycles with different Learning Rate values. Momentum is
used to prevent the neural network from getting “stuck” at a local minimum or
maximum…leave momentum at its default value.] We now need to indicate how
many of the input rows will be kept aside for testing the trained network. We have
98 rows of total input in the “Races98.txt” file, so let’s (arbitrarily) keep 30 of those
rows aside to be used later for testing purposes...go to the “Validating” section and
enter 30 where it says “Select __ examples at random from the...”.
We next have to establish a stopping criteria so that the software knows when to
stop training the neural network. There is no point in using more than 1000 cycles
to train the simple network in this exercise, so look under the “Stops” section and
put a check by “Stop on __ cycles” and enter a value of 1000.
Press OK. Answer Yes if you get the “Optimizing controls will reset learning.”
warning message. The controls will be set and the neural network will be ready to
learn. If you get a warning message in a “Validating Problems” window, click on
the “Change All Out of Range Validating Rows to Training”.
A summary of the control settings and what is going to happen next will appear.
Answer Yes to the question “Do you want filename to start learning?” (the filename
will be whatever you named it near the end of step (3) earlier). Next the AutoSave
dialog window will open...you don’t need to change anything on that window so just
press OK. This will start the training process for the neural network. [Note: if for
some reason you are not asked if you want to start the learning process, you can
start it manually by using the Action > Start Learning menu option.] Learning and
validating will run for 1000 cycles.
Training for this small amount of data should complete almost instantly...if the
status bar (the bar at the bottom of the main window) says “Fixed cycles stop:
1000 cycles”, then training has completed. You can now use the View > Information
menu command to see the results. Look under the “General” section for the line
labeled “Validating results:”... this value tells you how accurately the neural net
was able to predict the winner in the 30 test rows. If everything has worked as
expected the neural network will have correctly predicted the results of at least
60% of the races.
262 B.R. Huguenard and D.J. Ballou
If the data grid is not in view, use the View > Grid menu command to bring the Grid
to the front. Make sure that the first row of data is selected (just click on the row
number [1]). Then use the Insert > Querying Example Row menu command to
add a new row to the grid where you can enter values for a new horse. The “Example
presets” window will come up...you should be sure the “set all values in row to 0”
radio button is selected, then hit the “Set all row values” button. This should cause
an empty row to appear at the top of the data grid, with 0 (or False for Boolean col-
umns) for each value. You can use this query row to enter values for the input col-
umns for a new horse, and then you can view what the prediction is in the output
column. Use the following values in the query row: Runners = 10, Distance = 15,
Handicap = False, Class = 3, Stake > 5 k = True, Odds > 2 = False. The resulting
prediction should be that the horse will win.
18.4 Conclusion
Neural networks are flexible tools that can be used for a wide variety of decision tasks,
such as classification, prediction, and pattern recognition. Neural networks do not
require advanced knowledge of statistical techniques, are able to model complex non-
linear relationships, and can detect multiple interactions between predictor variables.
This chapter provides an overview of an implementation of a simple neural network.
The same instructions could be used to create neural networks in a variety of domains.
Biographies
References
Mary M. Dunaway
Abstract Today’s business colleges are attempting to meet the industry demand by
developing marketable ERP (Enterprise Resource Planning) skills and delivering
exposure to the realities of modern business into the curricula. Role adaptions in
real-world settings such as ERP systems use can enhance students’ ability to learn
conceptual knowledge for practical application. The situated learning theory capi-
talizes on a specified context where the context extensively impacts learning.
Education data text mining is emerging to produce new possibilities for gathering,
analyzing, and presenting student learning outcomes. This chapter aims to reveal
ERP learning patterns and themes as evidence of knowledge transfer in ERP role
adaptions. The results demonstrate amplified learning through role play in a simu-
lated ERP learning environment.
19.1 Introduction
The use of analytics in higher education is a relatively new area of practice and
research. Corporate business practices have led the way in data mining across most
all organizations to leverage competitive and profit-driven strategies (Chen et al.
2012; Holsapple et al. 2014). Data mining is gaining momentum in higher educa-
tion, which is now using a variety of applications, most notably in enrollment, learn-
ing patterns, personalization, and threaded discussion analysis (Norris et al. 2008;
Baepler and Murdoch 2010; Ravishanker 2011; Edgington 2011). By discovering
hidden relationships, patterns, interdependencies, and correlating raw/unstructured
data, data mining is beginning to facilitate not only higher education institutional
decision making but also student learning outcomes and performance.
Learning analytics does not have the same goals in the business streams as learning
academic streams (e.g. Campbell et al. 2007; Baepler and Murdoch 2010; Van
Barneveld et al. 2012; Ferguson 2012). The focus of learning analytics primarily tar-
gets two areas—learning effectiveness and learning operational excellence (Luan
2004; Zhang et al. 2010; Minami and Ohura 2013). The latter refers to the metrics that
provide evidence of how the learning aligns with and meets the goals of the higher
education institution. Learning analytics in the academic domain is focused on the
learner, gathering data from course management and student course data to manage
student success and performance outcomes (Van Barneveld et al. 2012). Different
from other types of analytics by virtue of the fact that learning analytics is focused
specifically on students and their learning behaviors (Baepler and Murdoch 2010).
Data mining techniques can enable institutions of higher education to rethink
and improve students’ learning experiences. Faculty can streamline their teaching
and learning processes to extract and analyze students’ learning outcomes and
behaviors (e.g. Campbell et al. 2007; Baepler and Murdoch 2010; Edgington 2011).
Data mining techniques can provide greater insights about student learning as their
learning experiences unfold (Hung and Crooks 2009; García et al. 2011). Text min-
ing, a type of data mining technique, provides a combination of explicit knowledge,
analytical skill, and domain knowledge to uncover hidden trends and patterns from
text. Text is a type of unstructured data where row and column attributes are not
distinctive characteristics. Text mining involves information retrieval, lexical analy-
sis to study word frequency distributions, pattern recognition, tagging/annotation,
information extraction, visualization, and predictive analytics.
AACSB (2015 August 1) states that the curricula facilitate and encourage active
student engagement in learning. Most importantly, “in addition to time on task
related to readings, course participation, knowledge development, projects, and
assignments, students engage in experiential and active learning designed to improve
skills and the application of knowledge in practice is expected” (AACSB 2015).
Today’s business colleges are attempting to meet the demands by developing mar-
ketable Enterprise Resource Planning (ERP) programs and delivering exposure to
the realities of modern Information Systems (IS) into the curricula (Léger 2006;
Seethamraju 2011; Cronan et al. 2011; Monk and Lycett 2014). Situation Learning
theory posits the idea that much of what is learned is specific to the situation in
which it is learned (e.g. Lave 1988; Lave and Wenger 1991). Particularly, important
has been situated learning’s emphasis on the mismatch between typical learning
situations and “real world” situations such as the workplace in environments where
complex IS are employed.
The aim of this research is to investigate the application of text mining to under-
stand student learning outcomes gained in an ERP Fundamentals course. Specifically,
this paper combines role play and simulations as forms of experiential learning.
Students take on different business functional roles, interact, and participate in a
diverse and complex learning settings where a simulated, real-work ERP system is
19 An Examination of ERP Learning Outcomes: A Text Mining Approach 267
engaged. Using the Situated Learning theory as a lens to examine student behaviors
and applying a text mining approach can help to uncover and evaluate learner c entric
outcomes. These results can support faculty and student feedback for improved
ERP teaching and student learning processes.
ERP courses in the Information System (IS) curricula is at the forefront of experi-
ential and active learning (Cronan and Douglas 2012; Cronan et al. 2011; Hepner
and Dickson 2013). ERP courses at major universities provide the IS educational
community with a goldmine of data about students’ learning characteristics, learn-
ing outcomes, behaviors, attitudes, and use patterns (Léger 2006; Cronan and
Douglas 2012; Cronan et al. 2011; Léger et al. 2011). Many IS courses utilize simu-
lation games such as ERPSIM (Léger 2006) where students utilize a real-world SAP
IS as the learning environment. Students learn real-time business decision making,
dynamic functional processes, and Business Intelligence (BI) skills. The continuous
availability of real-time SAP provides the students with large amounts of data where
accelerated business decision experience can be gained.
Role play and simulations are forms of experiential learning (Russell and Shepherd
2010). Learners take on different roles, assuming a profile of a character to interact
and participate in diverse and complex learning settings. Computer simulation of
business in higher education and experiential learning theory accelerate the use of
simulation games in business education (Keys and Wolfe 1990). Good-quality
learning design provides opportunities for situated and authentic learning.
The ERP Simulation (ERPSIM) game and the team interaction provide the envi-
ronment for learning to transpire (Léger 2006). ERPSIM consists of several games
(Distribution, Manufacturing, and Logistics) that are used for learning. Students
learn cross-functional decision making to maximize firm profit and integrate busi-
ness processes to achieve desire outcomes (Boudreau 2003; Kang and Santhanam
2003–2004). The ERPSIM game and the functional role play adaptions provide the
specific learning context. According to Léger “as the learner’s knowledge and skills
increase, the role and status of the learner as a member of a community gradually
evolves from that of novice or apprentice to expert” (Léger 2006, p. 39). Teams
compete against each other during the simulation game. Moreover, students typically
have had no experience with functional integrative ERP software.
268 M.M. Dunaway
The increased importance of ERP and its pedagogical value to demonstrate business
process integration, functional role play, and business decision making already have
started to reengineer curricula (Magal and Word 2009). According to Chen et al.
(2011), ERP system learning involves complex knowledge domains requiring a
holistic curricula perspective to enhance student motivation and interest. Advances
in pedagogical approaches that emphasize learning approaches such as active or
learn-by-doing experience provide greater benefits to students than solely lecture-
based (Chen et al. 2011). Prior research has shown learning limited to a lecture-
based approach can make students passive learners (Bok 1986).
The business process drives the functional role identity for the activities and tasks to
be performed in the ERPSIM game. Typical business processes within an ERPSIM
learning entail the planning, procurement, production, and sales processes. Each
business process area is tied tasks similar to real-world job responsibilities within a
company. Within each business process operational transactions, reporting, and BI
inform decision making for the functional role. For example, within the Sales pro-
cess, hands-on basic handling of market expense allocation, price list changes, and
sales order reports are positioned for the Sales Analyst or Product Marketing role.
Whereas in a Production process, finished goods forecast, materials requirement
planning, and purchase supplier interactions are performed.
Teams of four students perform operations and tasks which requires them to
interact with suppliers and customers by sending and receiving purchase orders,
delivering products, and completing the entire cash-to-cash cycle. The simula-
tion game, ERPSIM automates (1) the sales process where each firm receives in
a large number of orders every minute, (2) the procurement process purchases
raw materials, and (3) the production process utilize machine capacity, inven-
tory, and warehouse functions. These operational functions are performed
directly in the ERPSIM real-time. Many pre-defined SAP reports are available
to help students evaluate their company’s profit and operations of the business.
Several key business decisions which student teams are required to make during
the simulation are:
• Product formulation (raw materials and packaging)
• Target sales by region and market segment
• Product pricing adjusted throughout the business operations
• Sales forecasting to predict sales volumes by product for production planning
• Manufacturing resource planning and production
• Investment in production efficiencies to reduce cost/time delays
• Advertising to regional markets
• Debt management which consists of loan repayment
Once all of the ERP Simulation quarters have been completed, the teams must
provide an overall reflection report of their simulation experience. The role response
guidelines were written by the instructors and developed to emphasize the integra-
tion of the student role adaption and their use of the analytic and BI capabilities
from the ERPSIM game for decision making. There is a section of the report where
a role response is written by each team member. The role response includes the
responsibilities of the role and how the role fits into the overall cash-to-cash cycle.
Figure 19.1 shows the role response written guidelines given to students. Also, the
Analytics and BI section of the report was to be written from a role response point
of view, specifically how the operational reports in the ERP system were used to
help with decision making, how the student’s role utilized the information obtained
through the metrics, and finally how the role contributed to the overall team’s com-
petitive advantage.
Role response
• Role
o What were the responsibilities of your role?
o How did your role fit into the overall cash-to-cash cycle?
• Analytics and Business Intelligence
o How did operational reports in SAP aid in your decision making?
o How did you in your role use the informational data provided in Access and Excel?
o If your role achieved competitive advantage through the use of operational and informational reports, explain.
In retrospect, how could you have used this information more effectively?
Before the simulation game begins, each team makes a decision on what roles
should be represented during the game and who will perform the activities and tasks
accompanying the role. Teams usually follow a similar pattern for role division of
labor across the simulation game. Common roles selected are procurement man-
ager, production manager, business analyst, pricing specialist or analyst, operations
manager, chief financial officer, financial analyst, and marketing director. Figure 19.2
describes the ERP role adaption examples from a previous class.
There are several objectives to be accomplished by functional role play adaption.
Students will have an understanding of:
• How roles collectively work together is integrated in the business processes per-
formed while using an ERP system.
• How the individual roles of which they represent contribute to business
processes.
• The importance of selecting and using appropriate BI metrics based on their role
to gain competitive advantage for the team.
There are three main areas of pedagogy of which roles can provide value infused by
the role strategies. They are community learning, engagement encouragement, and
motivation and interest improvement. Table 19.1 shows the steps of how the functional
role play is imparted in the students’ learning.
19.4 Results
Teaching for transfer is one of the seldom-specified but most important goals in
education. We want students to gain knowledge and skills that they can apply both
in and outside of the university setting immediately and in the future. Transfer of
learning is often done without conscious thought. The role responses written by the
students demonstrate how contextualized learning in a real-world setting such as
ERPSIM reinforces learning assurance, real industry roles, and business process
knowledge. In Table 19.2 are excerpt responses of the role responses written by
several students.
The data in this paper reflect student participation during the administration of the
ERPSIM Manufacturing game. The sample (N = 62) was collected from students’
final written projects spanning three semesters. The student teams were randomly
selected and student roles were self-selected. Each team was required to write a
19 An Examination of ERP Learning Outcomes: A Text Mining Approach 273
reflection paper as instructed in the ERP Simulation Guidelines. Within the guidelines,
each team member was responsible for writing a role response. The response relates
to each team member’s role and application of the Analytics and BI.
KH Coder (Higuchi 2015) is an open source software for quantitative content
analysis or text mining. Currently, there are almost 500 scholarly publications using
this software (Wikipedia 2015). Several recent studies have used KH Coder to per-
form Text Mining in an education and learning context (Tsubakimoto 2011; Ishii
et al. 2013; Minami and Ohura 2013). Also, KH Coder has been used for computa-
tional linguistics. Three words were removed from the sample that did not provide
additional meaning for the analysis.
The words excluded: were, and, the, and of. A text mining analysis using the
KH Coder (Higuchi 2015) software was performed across the students’ role
response data to uncover patterns and learning themes. Table 19.3 results show the
word frequency distribution from the student data to describe their role during the
ERP simulation experience. The words with the highest frequency used to describe
the student role were Manager, Officer, and Chief. The business process areas
related to the roles were Marketing, Pricing, Financial, and Production. The areas
and roles align with the ERP system knowledge transfer where students learn
Sales, Procurement, and Planning business processes. The results confirm known
differentiated roles associated with real-world ERP system use.
274 M.M. Dunaway
forecast, and financial related tasks. Managers are related to material planning and
financial control. Marketing roles were interconnected with pricing, forecasting,
and process improvements. Lastly, Accounting Manager roles were interconnected
with Supply Chain, Procurement, and Inventory tasks. Stronger interactions exist
mainly in Process Improvements, Procurement, and Accounting Manager roles.
Table 19.4 shows the overall ERP Role adaptions resulted from the student ERP
knowledge transfer. The word cluster analysis results show the emerged ERP roles
from the situated learning context using the ERP simulation and the related score.
Forty-four word clusters emerged as role adaptions from the data. The highest three
role adaptations were marketing managers, production managers, and pricing man-
agers. Each word cluster is scored, and normally, highly scored clusters are reliable
(Higuchi 2015).
The word cluster analysis uses a TermExtract transformation. The transforma-
tion is an automatic technical term (keyword) extraction system published by the
Nakagawa Laboratory of the Digital Library Division, Information Technology
Center of The University of Tokyo. The Term Extraction transformation can extract
nouns only, noun phrases only, or both nouns and noun phases. Because the process
in KH Coder is automatic, unintended word combinations may occur. Unintended
words or combinations can be managed as nonexistent in the statistical analysis.
276 M.M. Dunaway
The findings observed in this study closely align with the transfer of learning in ERP
learning outcomes (Seethamraju 2011; Cronan and Douglas 2013; Monk and Lycett
2014). Role play in the ERP situated learning context enriches the experience for
students to develop attributes and cognitive practices for real-world ERP roles. This
type experience can help students transition into the workforce after graduation to
anticipate the business acumen, value, and role responsibilities in an ERP profes-
sion. These results support prior literature where the aspects of the knowledge trans-
fer enhancing activities lead to recall and application of what has been learned
(Thayer and Teachout 1995). Teaching for transfer is one of the seldom-specified but
most important goals in education. We want students to gain knowledge and skills
that they can use both in school and outside of school, immediately and in the future.
Transfer of learning is commonplace and often done without conscious thought.
The pedagogical approach as described provides business process learning that
emphasizes the functional role play. The situated learning experience enhances
19 An Examination of ERP Learning Outcomes: A Text Mining Approach 277
students’ knowledge capabilities for ERP, BI, and business process concepts beyond
the lecture only. The use of functional role play and BI application are a strategies
to improve learning and provide a deeper insight of business process knowledge.
This approach is synergistic and reinforces the students’ learning. This study sup-
ports (McLellan 1986) research findings to affirm that the hands-on experience of
ERP systems indeed help students understand business process. Moreover, this
examination demonstrated valuable student knowledge from the ERP course and
that can impact practical application to business decisions in their future career roles
(Cronan et al. 2011).
This study has a few limitations that need to be addressed in future research.
The sample size is small (N = 62). Though a recommended sample size for text min-
ing was not found in the literature, the ROI of text analytics increases exponentially
with the size of the data. Thus, a larger sample may potentially reveal patterns and
relationships not revealed. Another limitation is the teaching practice as described is
specific to a southern university and its curriculum. While this paper asserts the
benefits from the functional role play, other factors may play a direct or indirect role
in the learning outcomes. Further research will utilize additional quantitative and
qualitative methods to examine empirically the functional role play, attitudes, and
team behaviors on learning outcomes.
Biography
Mary M. Dunaway, Ph.D. is the Director of Data Science Programs and Assistant
Professor at the University of Virginia in the College of Continuing and Professional
Studies. As a rising scholar, she has successfully published several journal articles,
book chapters, and conference proceedings. Also, Dr. Dunaway is leading the effort
to develop and launch an Applied Data Analytics graduate certificate program. She
is a dynamic STEM academic who is a sought-out speaker and panelist for numer-
ous conferences/workshops sharing her expertise in Information Systems and Data
Science.
References
Campbell JP, DeBlois PB, Oblinger DG (2007) Academic analytics: a new tool for a new era. Educ
Rev 42(4):41–57
Chen D, Hung D (2002) Personalised knowledge representations: the missing half of online dis-
cussions. Br J Educ Technol 33(3):279–290
Chen K, Razi M, Rienzo T (2011) Intrinsic factors for continued ERP learning: a precursor to
interdisciplinary ERP curriculum design. Decis Sci J Innov Educ 9(2):149–176
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big
impact. MIS Q 36(4):1165–1188
Cronan TP, Douglas DE (2012) A student ERP simulation game: a longitudinal study. J Comput
Inf Syst 53(1):3–13
Cronan TP, Douglas DE, Alnuaimi O, Schmidt PJ (2011) Decision making in an integrated business
process context: learning using an ERP simulation game. Decis Sci J Innov Educ 9(2):227–234
Cronan TP, Douglas DE (2013) Assessing ERP learning (management, business process, and
skills) and attitudes,” Journal of Organizational and End User Computing 25(2):59–74
Duffy T, Cunningham D (1996) Constructivism: implications for the design and delivery of
instruction. In: Jonassen DH (ed) Handbook of research for educational communications and
technology. Simon and Schuster, New York, pp 170–198
Edgington TM (2011) Introducing text analytics as a graduate business school course. J Inf
Technol Educ 10:207–234
Ferguson R (2012) Learning analytics: drivers, developments and challenges. Int J Technol Enhanc
Learn 4(5/6):304–317
García E, Romero C, Ventura S, de Castro C (2011) A collaborative educational association rule
mining tool. Internet High Educ 14(2):77–88
Goel L, Johnson N, Junglas I, Ives B (2010) Situated learning: conceptualization and measure-
ment. Decis Sci J Innov Educ 8(1):215–240
Henning PH (1998) Ways of learning: an ethnographic study of the work and situated learning of a
group of refrigeration service technicians. J Contemp Ethnogr 27(1):85–136
Hepner M, Dickson W (2013) The value of ERP curriculum integration: perspectives from the
research. J Inf Syst Educ 24(4):309–326
Higuchi K (2015) KH coder (version 2.0) [software]. http://khc.sourceforge.net/en/
Holsapple C, Lee-Post A, Pakath R (2014) A unified foundation for business analytics. Decis
Support Syst 64:130–141
Hung JL, Crooks SM (2009) Examining online learning patterns with data mining techniques in
peer-moderated and teacher-moderated courses. J Educ Comput Res 40(2):183–210
Ishii N, Suzuki Y, Fujii T, Fujiyoshi H (2013) Development and evaluation of question templates
for text mining. Recent Prog Data Eng Internet Technol 156:469–474
Kang D, Santhanam R (2003–2004) A longitudinal field study of training practices in a collabora-
tive application environment. J Inf Syst Educ 17(4):441–447
Karia M, Bathula H, Abbott M (2014) An experiential learning approach to teaching business
planning: connecting students to the real world. In: Li M, Zhao Y (eds) Exploring learning and
teaching in higher education. Springer, Berlin, pp 123–144
Keys B, Wolfe J (1990) The role of management games and simulation in education and research.
J Manag 16(2):307–336
Land SM, Hannafin MJ (2000) Student-centered learning environments. In: Jonassen DH, Land
SM (eds) Theoretical foundations of learning environments. Lawrence Erlbaum, Mahwah, NJ,
pp 1–23
Lave J (1988) Cognition in practice: mind, mathematics and culture in everyday life. Cambridge
University Press, New York, NY
Lave J, Wenger E (1991) Situated learning: legitimate peripheral participation. Cambridge
University Press, New York, NY
Léger P (2006) Using a simulation game approach to teach enterprise resource planning concepts.
J Inf Syst Educ 17(4):441–447
19 An Examination of ERP Learning Outcomes: A Text Mining Approach 279
Léger P, Charland P, Feldstein HD, Robert J, Babin G, Lyle D (2011) Business simulation train-
ing in information technology education: guidelines for new approaches in IT training trends
in higher education: from situational theory to simulation games. J Inf Technol Educ Res
10(1):39–53
Luan J (2004) Data mining applications in higher education, SPSS executive report. SPSS Inc. http://
www.spss.ch/upload/1122641492_Data%20mining%20applications%20in%20higher%20edu-
cation.pdf
Lunce LM (2006) Simulations: bringing the benefits of situated learning to the traditional class-
room. J Appl Educ Technol 3(1):37–45
Magal SR, Word J (2009) Essentials of business processes and information systems. Wiley,
Hoboken, NJ
McLellan H (1986) Situated learning: multiple perspectives. In: McLellan H (ed) Situated learning
perspectives. Educational Technology, Englewood Cliffs, NJ, pp 5–18
Minami T, Ohura Y (2015) How student’s attitude influences on learning achievement? An analy-
sis of attitude – representing words appearing, looking-back evaluation texts. Int J Database
Theory App 8(2):129–144
Monk E, Lycett M (2014) Measuring business process learning with enterprise resource planning
systems to improve the value of education. Educ Inf Technol 21:1–22
Norris D, Baer L, Leonard J, Pugliese L, Lefrere P (2008) Action analytics. Educ Rev 43(1):42–67
Ravishanker R (2011) Doing academic analytics right: intelligent answers to simple questions.
Research bulletin 2, EDUCAUSE review. http://net.educause.edu/ir/library/pdf/ERB1102.pdf
Russell C, Shepherd J (2010) Online role-play environments for higher education. Br J Educ
Technol 41(6):992–1002
Sadler TD (2009) Situated learning in science education: socio-scientific issues as contexts for
practice. Stud Sci Educ 45(1):1–42
Seethamraju R (2011) Enhancing student learning of enterprise integration and business process
orientation through an ERP business simulation game. J Inf Syst Educ 22(1):19–29
Thayer P, Teachout M (1995) A climate for transfer model. Report AL/HR-TP-1995-0035.Air
Force Materiel Command, Brooks Air Force Base, TX. http://www.dtic.mil/cgi-bin/GetTRDo
c?AD=ADA317057andLocation=U2anddoc=GetTRDoc.pdf
Tsubakimoto M (2011) Development and technical evaluation of an interactive environment for a
term paper grading support system in higher education. In: World conference on e-learning in
corporate, government, healthcare, and higher education, vol 1, pp 966–972
Van Barneveld A, Arnold K, Campbell J (2012) Analytics in higher education: establishing a com-
mon language. EDUCAUSE Learn Initiat 1(1):1–11
Wikipedia: The free encyclopedia. (2004, July 22). FL: Wikimedia Foundation, Inc. Retrieved
August 1, 2015, from https://www.wikipedia.org
Zhang Y, Oussena S, Clark T, Kim H (2010) Use data mining to improve student retention in higher
education—a case study. In: ICEIS 2010: proceedings of the 12th international conference on
enterprise information systems, pp 190–197
Chapter 20
Data Science for All: A University-Wide
Course in Data Literacy
David Schuff
20.1 Introduction
Increasing attention has been paid to the demand for data scientists. In fact,
Davenport and Patil (2012) declared the data scientist as the “sexiest job of the 21st
century.” What exactly is meant by the term “data scientist,” however, is unclear.
We often think of a data scientist as a highly quantitative, technically-trained
professional with advanced knowledge of statistics and big data infrastructure
technologies.
However, Davenport and Patil (2012) define a data scientist as “a high-ranking
professional with the training and curiosity to make discoveries in the world of big
data.” Press (2012) to defines the data scientist as “an engineer who employs the
scientific method and applies data-discovery tools to find new insights in data.”
D. Schuff (*)
Department of Management Information Systems, Fox School of Business,
Temple University, 210 Speakman Hall, 1810 North 13th Street, Philadelphia,
PA 19122-6083, USA
e-mail: [email protected]
These definitions are broad and do not necessarily imply a data scientist is a statisti-
cian, a computer scientist, or even a business analyst. Davenport and Patil’s defini-
tion specifically mentions “big data,” while Press’ definition does not.
What these definitions have in common is that they underscore the importance of
data literacy (as opposed to statistical and technological proficiency) as a skill for
discovery. Infusing data literacy into a curriculum is an unrealized opportunity for
higher education to truly make an impact on the current generation as they prepare
to move into the workforce. Universities are squarely focused on providing focused
Master’s Degrees in Business Analytics; there are now 117 such programs accord-
ing to the website “Master’s In Data Science” (www.mastersindatascience.org).
However, a data-literate undergraduate population, through their sheer numbers,
have a far greater potential impact on the way organizations operate.
This chapter describes the design and structure of a new, unique undergraduate
elective course introduced last year into the curriculum of Temple University, a
large, public University in the Northeastern United States. In its first year it has gone
from a pilot to a regular, multi-section offering in the University’s “General
Education” curriculum by emphasizing practical data literacy through current
events, readily available analysis tools and the methods of scientific inquiry.
Temple University is a large, public, urban institution with over 37,000 students. Its
primary mission is to educate the regional undergraduate population through 140
bachelor degree programs (the University also has 126 master’s degree and 57 doc-
toral programs). There are 17 schools and colleges including liberal arts, business,
education, law, media and communication, music and dance, and engineering.
Like many large Universities, there is an institution-wide core curriculum that
covers several broad categories. To fulfill the “General Education” or “GenEd”
requirements, students must select from a menu of courses in each category, which
includes analytical reading and writing, humanities, quantitative literacy, arts,
human behavior, race and diversity, science and technology, US society, and world
society.
One of the stated goals of the University’s GenEd program is that, in an environ-
ment where “the amount of information is available … and the speed with which we
can access information … continues to expand,” the University must teach students
“how information is linked and how pieces of information are interrelated” (Temple
University 2015). This is certainly a reasonable and important goal for undergradu-
ate education regardless of a student’s major field of study, with obvious ties to
concepts of data science and data literacy.
Further, Information Systems is a field that is well-positioned to deliver this
material to a broad audience. Key aspects of the IS2010 Model Curriculum includes
“understanding and addressing information requirements” and “exploiting
opportunities created by technology innovations” (Topi et al. 2010). Most impor-
20 Data Science for All: A University-Wide Course in Data Literacy 283
tantly, Information Systems is one of the few fields with the orientation and skill set
to teach data literacy to a non-technical audience. Its emphasis on training business
professionals create an applied focus on the identification and collection of data for
problem-solving and use of practical analysis tools.
With this in mind, we proposed, developed, and executed a new course for the
University’s GenEd curriculum that would employ this dual focus on data and tech-
nology, targeted at a non-technical audience. The design of the course set out to
inspire an “evidence-based” mindset, encouraging students to identify and use data
relevant to them in their field of study and the larger world around them.
The course was designed to address several of the University GenEd program’s
broad learning goals (Temple University 2015):
1. Information literacy, including the ability to recognize and articulate informa-
tion needs; to locate, critically evaluate, and organize information for a specific
purpose; and to recognize and reflect on the ethical use of information.
2. Development of critical thinking skills, including the evaluation of evidence,
analysis and synthesis of multiple sources, and reflection on varied
perspectives.
3. Communications skills, using spoken and written language to construct a mes-
sage that demonstrates the communicator has established clear goals and has
considered her or his audience.
4. Retrieve, organize, and analyze data associated with a scientific model.
5. Understand and communicate how technology encourages the process of
discovery.
6. Recognize, use, and appreciate scientific or technological thinking for solving
problems that are part of everyday life.
From these broad goals, we developed ten specific learning goals for the course
that could be evaluated through assignments and exams. Because of the course’s
dual focus—literacy and skill-building—the course learning goals span Krathwohl’s
(2002) knowledge dimension, with goals focused on factual (e.g., knowing data sci-
ence terminology), conceptual (e.g., applying data visualization principles to assess
the effectiveness of a graphic), and procedural knowledge (e.g., how to clean a data
set). The learning goals also span the entire range of Krathwohl’s cognitive process
dimension, requiring students to remember, understand, apply, analyze, evaluate,
and create. This is in line with the purpose of the course, which is to impart termi-
nology, teach basic skills, and have them apply those skills to produce original
knowledge. Table 20.1 lists each learning goal, along with where the specific goal
lies on each dimension. These learning goals can involve several components and
therefore may span multiple levels in a dimension.
284 D. Schuff
This module builds skills in support of the learning goals of “information literacy,”
“critical thinking,” “how technology encourages discovery,” and “technological
thinking for everyday problems.” The basics of scientific inquiry is discussed in this
module, including the notion of theory and hypotheses formation. Students also
learn to identify sources of relevant data. They will learn the role of data across
many disciplines, with concrete examples from current events. For example, this
module discussed the National Security Agency’s collection and use of telephony
metadata. This module will also cover how citizens and organizations can use
government-published “open data” to understand the world around them.
20.4.3 O
verview of Module 3: Working with Data in the Real
World
This module builds skills in support of the “information literacy” GenEd learning
goal, and the Science & Technology area goals to “retrieve, organize, and analyze
data” and “technological thinking to solve everyday problems. Students learn how
to fix problems in data sets. This builds on the course’s first module where they learn
how to identify data quality issues in data. Students learn how to address these prob-
lems through data cleansing and transformation to create a useable, reliable data set
using Microsoft Excel. They will resolve inconsistencies within and across data
20 Data Science for All: A University-Wide Course in Data Literacy 289
sets, and determine when data is in the wrong form. The exercise below introduces
students to the concept of a Key Performance Indicator. The exercise is intended to
apply the SMART criteria (Specific, Measurable, Achievable, Relevant, and Time
Phased) to evaluate candidate KPIs for a scenario. Another exercise requires the
students to create several KPI scorecards using a more business-oriented scenario:
on-time flight data for a set of airports.
This module builds skills in support of the “critical thinking” GenEd learning goal
and the Science & Technology area goals to “retrieve, organize, and analyze data,”
as “how technology encourages discovery,” and “technological thinking for every-
day problems.” Students learn how data is stored and organized. Specifically, they
will learn the differences between spreadsheets and databases and why each is used.
They will also learn three analytics techniques to give them a sense of what can be
done with data analytics, and also to give them hands-on experience with analytics
tools such as Microsoft Excel and Tableau Desktop. For example, students learn
how to use Pivot Tables to summarize large data sets (such as the crime activity
assignment described below), and Association Analysis to discover which products
are likely to be bought together at a store (such as graham crackers and marshmal-
lows). Students also learn to interpret the output from these analyses and make
inferences about underlying patterns in the data.
The final project is a group project that requires students to bring together what
they’ve learned throughout the course. Student teams source an “original” data set
(i.e., not one already used in the course), develop a research question, and then
answer that question by using one or more of the data analysis techniques and tools
covered in the course. Students are encouraged to find a data set and question that is
relevant and interesting to them.
The deliverable is a five-minute presentation with two minutes for questions
from the instructor and the class. The short presentation is a deliberate choice
because (1) it forces students to hone their presentation skills by being direct and to
the point and (2) it allows the course to scale.
The suggested format for the final presentation is:
• Slide 1 should list the group members and the title of the presentation.
• Slide 2 will describe the scenario. What question will be answered and why is it
important?
20 Data Science for All: A University-Wide Course in Data Literacy 291
2 . Within each feed, click on a few tweets and read the replies.
3. Find three examples of positive tweets, three examples of negative tweets,
and three examples of neutral tweets (neither positive nor negative). Write
them down in three lists.
4. Make a note of why you classified them as positive, negative, or neutral.
Part 2: Larger Group (5 min)
1 . Find another group to form a group of four.
2. Share your lists of positive and negative tweets. See if you agree with each
other’s choices.
3. Come up with rules for determining whether a tweet is positive or nega-
tive. For example:
(a) Are there certain words which increase your certainty of how to clas-
sify the tweet?
(b) Are there certain tweets that sound positive but really are negative?
(c) How do you detect sarcasm?
(d) How would you explain to someone how to classify tweets?
Part 3: Class Discussion (20 min)
We’ll compare notes. Specifically, we will discuss:
• What are some rules for determining positive versus negative sentiment?
• Were some tweets difficult to categorize? Why?
• In what ways would this be a good method of understanding how people
felt about your brand? In what ways could it give you bad information?
292 D. Schuff
• Slide 3 will describe the data. What are the key elements and how was it obtained?
• Slides 4 and 5 will describe the analysis and the results, making good use of data
visualizations.
• Slide 6 will summarize the conclusions. What was learned? Students should sup-
port their conclusions using the results of the analysis, citing specific evidence.
• Slide 7 will list the references.
Some examples of final projects from past classes include:
• Exploring the question of whether the best soccer players are the highest paid.
• An analysis of the national origin of members of terrorist organizations.
• Investigating what time of day students are most likely to answer an online
survey.
• Correlations between stock price and social media sentiment for a major aero-
space firm.
20.6 Conclusions
Data analytics is no longer the prevue of data scientists. It is a fundamental skill for
the twenty-first century workforce. This is both a challenge and an opportunity for
higher education, one that the Information Systems discipline is uniquely positioned
to meet. With courses such as the one described here, Information Systems can
increase their reach beyond the business school and provide valuable, marketable
skills to a broad audience across the University.
The key to seizing this opportunity is to recognize that “data literacy” is the true
core skill for undergraduate students, not sophisticated analytics techniques. We
must instill in our students an appreciation of evidence-based decision-making
through an appreciation of what data can do and how even simple analysis can yield
sophisticated insights.
Biography
Course Description
We are all drowning in data, and so is your future employer. Data pours in from
sources as diverse as social media, customer loyalty programs, weather stations,
smartphones, and credit card purchases. How can you make sense of it all? Those
that can turn raw data into insight will be tomorrow’s decision-makers; those that
can solve problems and communicate using data will be tomorrow’s leaders. This
course will teach you how to harness the power of data by mastering the ways it is
stored, organized, and analyzed to enable better decisions. You will get hands-on
experience by solving problems using a variety of powerful, computer-based data
tools virtually every organization uses. You will also learn to make more impactful
and persuasive presentations by learning the key principles of presenting data
visually.
Course Objectives
Assignments
# Assignment description
1 Create a data analysis plan (individual)
Develop a plan for data analysis by forming hypothesis and finding data sets that will allow
you to test those hypotheses. The scenario: Once students graduate, it’s time for them to go
get a job. But is staying in the area the best choice? Evaluate our city as a place to live,
work, and play compared to the rest of the United States
2 Analyze a data set using tableau (individual)
Use Tableau to analyze and reveal various relationships within a data set. Use the data set
from the Environmental Protection Agency regarding fuel economy 2015 model year cars.
Answer a series of questions by creating the most visually effective charts and graphs using
the guidelines discussed in class
294 D. Schuff
# Assignment description
3 Cleaning a data set (individual)
Correct the errors in a data set for the fictitious company “Vandelay Industries.” The sales
group is suspicious that there might be errors in the data for January. Work with a new data
set of 3296 orders with 5192 line items from January 2014
4 Group data analysis (group term project)
In groups, perform an original analysis on a data set of your choosing. The data set can
come from any source as long as it is something you have not already worked on for this
course. Possible sources of data include: open data from Data.gov, data sets from the Pew
Research Center, sports statistics, a data set from your current employer, or an original
survey conducted by your group
Your analysis should clearly demonstrate the tools and techniques you’ve been exposed to in
this course. This can take any form you’d like (i.e., comparison of averages across
categories, mapping geographic data, sentiment analysis, developing and visualizing KPIs)
Your group will present your work in class through a five-min presentation, with 2 min for
questions
Week/
session Topic/key questions Readings
Module 1: Data in our daily lives
1.1 Introduction
• Course introduction/syllabus
• What is the difference between
data, information, and knowledge?
• What makes “big data” big?
1.2 Science and data science Dhar, V. (2013). Data Science and
• What is data science? Prediction. Communications of the
• What is the difference between a ACM. Vol. 56, No. 12. pp. 64–73
theory and a hypothesis?
Allain, R. (2013). Three Science Words
• What are the dangers of data
We Should Stop Using. Wired.com.
analysis without a hypotheses?
March 27
2.1 A brief introduction to data Stein, G. (2013). I’m Beating the NSA to
• What are the forms data can take? the Punch by Spying on Myself.
• Where does data come from? Fastcolabs.com. June 12
• What is metadata? A data
Di Justto, P. (2013). What the
dictionary?
N.S.A. Wants to Know about Your Phone
Calls. The New Yorker. June 7
2.2 Identifying Sources of Data Silver, N. (2014). What the Fox Knows.
• What kinds of data are available in FiveThirtyEight.com. March 17
different disciplines (arts, sciences,
Open Data. Wikipedia
medicine, business, government,
etc.)? Silver, N. (2014). In Search of America’s
• What kinds of problems and issues Best Burrito. FiveThirtyEight.com.
can data insight address? June 5
20 Data Science for All: A University-Wide Course in Data Literacy 295
Week/
session Topic/key questions Readings
3.1 Learning to (Mis)trust Data Weisberg, J. (2011). Bubble Trouble: Is
• How do you spot reliable sources Web Personalization Turning Us Into
of data? Solipsistic Twits? Slate.com
• How do you assess data quality?
Crawford, K. (2013). The Hidden Biases
• What is the “Filter Bubble?”
in Big Data. Harvard Business Review
Blog Network. April 1
Hayes, B. (2013). In Data We Trust.
Business Over Broadway. November 4
3.2 Guest speaker
Module 2: Telling stories with data
4.1 Viewing data Unwin, A. (2008). Chapter II.2: Good
• What are different ways of viewing Graphics? Handbook of Data
data? Visualization. Chen, Hardle, and Unwin
• When do you need to visualize (Eds.). pp. 57–78
data?
• What are the basic techniques of
data visualization?
4.2 Introduction to Tableau Hoven, N. (n.d.). Stephen Few on Data
• What is Tableau? What can you do Visualization: 8 Core Principles. Tableau
with it? Software
• How is it different from Microsoft
Acohido, B. (2013). Watch Out,
Excel?
Terrorists: Big Data is on the Case.
USAToday.com. July 29
5.1 Communicating using data Davenport, T. (2013). Telling a Story
• What are the principles of with Data. Deloitte University Press
communicating data?
Matlin, C. (2014). Visualizaing a Day in
• How do you communicate complex
the Life of a New York City Cab.
ideas using data?
FiveThirtyEight.com. July 17
• How do you construct
visualizations that complement a
report? That stand on their own?
5.2 Storytelling with infographics Krum, R. (2014). Cool infographics:
• How are infographics different Effective Communication with Data
from other types of visualizations? Visualization. (Chapter 1: The Science of
• How do infographic tools differ Infographics)
from other data tools we’ve used
Krum, R. (2014). Cool infographics:
so far?
Effective Communication with Data
Visualization. (Chapter 6: Designing
Infographics)
6.1/6.2 Exam review/EXAM 1
Module 3: Working with data in the real world
7.1 Dirty Data Redman, T. (2013). Data’s Credibility
• How does data get dirty? Problem. Harvard Business Review. Vol.
• What are the consequences (i.e., 91, No. 12. pp. 84–88
ethical, financial) of dirty data?
Gandel, S. (2013). Damn Excel! How the
• How do you clean it?
‘Most Important Software Application of
All Time’ Is Ruining the World. Fortune.
com. April 17
296 D. Schuff
Week/
session Topic/key questions Readings
7.2 Data cleansing Taber, D. (2010). Stupid Data Corruption
• How do you identify data Tricks: Take our CRM Quiz. CIO.com.
problems? November 2
• How do you correct data
Top Ten Ways to Clean Your Data.
problems?
Microsoft
• When is fixing the data not worth it?
8.1 Choosing relevant data Performance Indicator. Wikipedia
• How do you identify Key
Schambra, W. (2013). The Tyranny of
Performance Indicators (KPIs)?
Success: Nonprofits and Metrics.
• How do you identify the right
NonprofitQuarterly.com. December 30
measure for the selected problem?
8.2 Evaluating key performance indicators Olson, P. (2014). Wearable Tech is
• How do you categorize and Plugging into Health Insurance. Forbes.
visualize KPIs according to a com. June 19
threshold?
Bialik, C. (2014). Tracking Health One
• How do you use Tableau to
Step (and Clap, and Wave, and Fist
evaluate KPIs? How would you use
Pump) at a Time. FiveThirtyEight.com.
Excel?
March 17
9.1 Connecting diverse data Strickland, J. (n.d.). How Data
• How do you identify data sets that Integration Works. howstuffworks.com
can be combined?
Gallagher, S. (2014). The GOP Arms
• How do you combine data sets?
Itself for the Next “War” in the Analytics
• How do you resolve conflicts?
Arms Race. arstechnica.com. February 7
9.2 Creating interactive dashboards Best Practices for Designing Views and
• How does a dashboard differ from Dashboards. Tableau Software
an Infographic? A chart?
Farmer, D. (2014). The One Skill You
• How do dashboards facilitate
Really Need for Data Analysis
decision-making?
10.1/10.2 Exam Review/EXAM 2
Module 4: Analyzing data
11.1 Storing and retrieving data Rosenblum, M. and Dorsey, P. (n.d.).
• What is a database? How are Knowing Just Enough about Relational
spreadsheets just a type of Databases. Dummies.com
database?
Bertolucci, J. (2013). How to Explain
• How are technology advances
Hadoop to Non-Geeks. InformationWeek.
changing how we think about
com. November 19
storing data?
• What are the core technologies of
big data analytics?
11.2 Using Tableau to aggregate data Acampora, J. (2013). How to Structure
• What can you learn from Source Data for Excel Pivot Tables &
aggregation? Unpivot. July 18
• How does thinking of data
dimensionally help solve
problems?
20 Data Science for All: A University-Wide Course in Data Literacy 297
Week/
session Topic/key questions Readings
12.1 Beyond numbers Hurwitz, J., Nugent, A., Halper, F., and
• What is the difference between Kaufman, M. (n.d.). Unstructured Data in
structured and unstructured data? a Big Data Environment. Dummies.com
• What can you learn from text data
Feldman, R. (2013). Techniques and
that you can’t from numeric data?
Applications for Sentiment Analysis.
• What are the tools for text
Communications of the ACM. Vol. 56,
analysis?
No. 4. pp. 82–89
12.2 Twitter sentiment analysis using Excel Wohlsen, M. (2014). Don’t Worry,
and Google Drive Facebook Still Has No Clue How You
• What are the steps in performing a Feel. Wired.com. July 2
sentiment analysis?
• What are the challenges in deriving
meaningful information from text?
13.1 Predicting the future Paine, N. (2014). What Analytics Can
• What is predictive analytics? What Teach Us About the Beautiful Game.
problems does it address? June 12
• What kinds of analysis can be
Bertolucci, J. (2013). Big Data Analytics:
done?
Descriptive vs. Predictive vs.
• What kinds of data are needed for
Prescriptive. InformationWeek.com.
an analysis?
December 31
13.2 Predictive analytics using Tableau Peck, D. (2013). They’re Watching You at
• Perform a forecasting analysis Work. TheAtlantic.com. November 20
• Perform a simple association
analysis
13.1/13.2 Group presentations/FINAL EXAM
review
References
Davenport TH, Patil DJ (2012) Data scientist: the sexiest job of the 21st century. Harv Bus Rev
90(10):70–76
Krathwohl D (2002) A revision of Bloom’s taxonomy: an overview. Theory Pract 41(4):212–218
Press G (2012) Data scientists: the definition of sexy. Forbes. http://www.forbes.com/sites/gil-
press/2012/09/27/data-scientists-the-definition-of-sexy/#526d00375187. Accessed 27 Sept
2012
Temple University (2015) General education program. http://gened.temple.edu. Accessed 29 Feb
2016
Topi H, Valacich J, Wright RT, Kaiser K, Nunamaker JF, Sipior JC, Jan de Vreede G (2010)
IS 2010: curriculum guidelines for undergraduate degree programs in information systems.
Commun Assoc Inf Syst 26:18