0% found this document useful (0 votes)
35 views512 pages

Fundamentals of Big Data & Business Analytics

Uploaded by

bs968367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views512 pages

Fundamentals of Big Data & Business Analytics

Uploaded by

bs968367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 512

FUNDAMENTALS OF BIG DATA & BUSINESS ANALYTICS

-\;�
SVL\1'S ?

NMIMS
u.-,..,...,liNIVERStTY
CENTRE FOR
DISTANCE AND
ONLINE EDUCATION
FUNDAMENTALS OF BIG DATA & BUSINESS ANALYTICS

Edited by:
Dr. Brinda Sampat
NMIMS Centre for Distance and Online Education

ISBN:
978-93-86052-15-5

NMIMS Centre for Distance and Online Education


Address: V. L. Mehta Road, Vile Parle (W), Mumbai - 400 056, India. NMIMS Centre
C O N T E N T S

CHAPTER NO. CHAPTER NAME PAGE NO.

1 Business Transformation with Big Data 1

2 Technologies for Handling Big Data 29

3 Basics of Business Analytics 69


..

4 Resource Considerations to Support Business Analytics 87

5 Descriptive Analytics 111


- .,,
6 Predictive Analytics 141
\

7 163

8
A(: Prescriptive Analytics

181
Social
M edia Analytics and Mobile Analytics

9 239

10

11
"'> Data Visualisation
Business Analytics in Practice

Case Studies
263

287
FUNDAMENTALS OF BIG DATA &
BUSINESS ANALYTICS

CURRICULUM

Business Transformation with Big Data: What is Big Data; Structured v/sUnstructured data
BigData Skills and Sources of Big Data; Big Data Adoption; Characteristics of Big Data - The Seven
V's; Under standing Big Data with Examples; Key aspects of a Big Data Platform; Governance for
Big Data; Text Analytics and Streams; Business applications of Big Data; Technology infrastructure
required to store, handle, and manage Big Data

Technologies for handling Big Data: Distributed and Parallel Computing for Big Data, Introduction
to Big Data Technologies (Hadoop, Python, R etc.) Cloud Computing and Big Data, In-Memory
Technolo gy for Big Data; Big Data Techniques (Massive Parallelism; Data Distribution; High-
Performance Com puting; Task and Thread Management; Data Mining and Analytics; Data Retrieval;
Machine Learning; Data Visualization)

Introduction to Business Analytics: What is Business Analytics (BA)? ; Types of BA; Business Analyt
ics Model; Importance of business analytics now; what is Business Intelligence (BI)? Relation
between BI and BA; Emerging Trends in BI and BA

Resource considerations to support Business Analytics: What is Data, Information and


Knowledge; Business Analytics Personnel and their roles, required competencies for an analyst;
Business Analytics Data; Ensuring Data Quality; Technology for Business Analytics; Managing
Change

Descriptive Analytics: What is descriptive Analytics; Visualizing and Exploring Data; Descriptive
Sta tistics; Sampling and Estimation; Introduction to Probability Distributions

Predictive Analytics: What is predictive Analytics; Introduction to Predictive Modeling: Logic


driven and data driven models; Data Mining; Data Mining Methodologies

Prescriptive Analytics: What is Prescriptive Analytics; Introduction to Prescriptive Modeling;


Nonlin ear optimization

Social Media Analytics, Mobile Analytics, and Visualization


□ Social media analytics: What Is Social Media? Social Analytics, Metrics, and Measurement; Key
Elements of Social Media Analytics
□ Mobile analytics: Introducing Mobile Analytics; Mobile Analytics Tools; Performing Mobile Analyt
ics
□ Big Data visualization techniques: What Is Visualization?, Importance of Big Data
Visualization, Big Data Visualization Tools
□ Business Analytics in Practice: Financial and Fraud Analytics, HR analytics, Marketing
Analytics, Healthcare Analytics, Supply Chain Analytics, Web Analytics, Sports Analytics and
Analytics for Government and NGO's
CONTENTS

1.1
Introduction
1.2 Evolution of Big
Data
Self Assessment Questions
1.3
Activity
Structured v/s Unstructured data
Self Assessment Questions
1.4
Activity
1.4.1 Big Data Skills and Sources
The Sources of Big
Data
1.5
Self Assessment Questions
1.5.1 Activity
1.5.2 Big Data Adoption
1.5.3 Use of Big Data in Social Networking
Use of Big Data in Preventing Fraudulent Activities
Use of Big Data in Retail Industry
1.6 Self Assessment Questions
Activity
Characteristics of Big Data -The Seven Vs
1.7 Self Assessment Questions
1.7.1 Activity
Big Data Analytics
Advantages of Big Data Analytics
1.8 Self Assessment Questions
Activity
Key Aspects of a Big Data
1.9 Platform Self Assessment
Questions Activity
Governance for Big Data
Self Assessment Questions
Activity
CONTENTS

Text Analytics
Self Assessment Questions
Activity
1.11 Business Applications of Big Data
Self Assessment Questions
Activity
1.12 Technology InfrastructureRequirement
1.12.1 Storing of Big Data
1.12.2 Handling of Big Data
1.12.3 Managing Big Data
Self Assessment Questions
Activity
1.13 Summary
1.14 Descriptive Questions
1.15 Answers and Hints
1.16 Suggested Readings & References
INTRODUCTORY CASELET
N O T E S

BIG DATA HANDLING IN CGL CORPORATION


-
A $10 billion-dollar IT corporation, CGL Inc. has over 30,000
data centres across the world. With new-age virtualisation
support catching up, CGL has already virtualised 82% of its data
centres and now it is aiming at 95% data centre's virtualisation.
The cus tomers of CGL belong to multiple domains - industrial,
IT, phar maceuticals, aviation, government, defense, and so on.
The data related to these domains, such as customer details,
products, ser vices and network activity is actually what defines
the business intelligence.

The data itself is a repository of great informational value for the


industries it serves and the customers it deals with. This data
alone is of inlpeccable value towards the corporation and serves
as a driving factor for most of the part of its business decisions,
the way forward strategy, trend analysis and internal quality con
trolling policy formulations.

However, at the same time, it also accounts for an immense mag


nitude of never-ending unstructured data, like videos, images,
documents, server configurations, customer set-ups, infrastruc
ture details, and so on. To unleash the actual BI potential lying
underneath those information mountains, the corporation decid
ed to implement Hadoop - an open source framework for distrib
uted data intensive application. Overall the inlplementation has
resulted in a positive yield with lessened disk I/O bottlenecks and
has provided linear scalability.

Going forward, CGL expects the consolidation of scattered


data across different data centres present throughout the world
so that its basic functions like retrieval and data-fetching
operations can be performed faster. The Big Data analytical
strategy, while fruit ful, needs to be adaptive enough to
accommodate further chang es in methodologies, technicalities
of businesses and become a multi-source dividend-yielding
platform.
N O T E S

@ LEARNING OBJECTIVES

After studying this chapter, you will be able to:


Discuss about the evolution of Big Data
Describe the differences between structured and unstruc
tured data
Explain Big Data skills and sources
Describe Big Data adoption
Elucidate about characteristics of Big Data
Explain Big Data analytics
Describe key aspects of a Big Data platform
Elucidate governance for Big Data
Discuss about text analytics
Describe business applications of Big Data
Explain technology infrastructure requirement

Ii■INTRODUCTION
The 21st century is characterised by the rapid advancement in the
field of information technology. IT has become an integral part
ofdaily life as well as various other industries, be it health, education,
enter tainment, science and technology, genetics, or business
operations. In today's competitive and global economy, organisations
must possess a number of skills to create their place and sustain in
the market. One of the most crucial of these skills is an
understanding of and the ability to utilise and harness the immense
potential of information technology.

This is truly an information age where data is being generated at an


alarming rate. This huge amount of data is often termed as Big Data.
Organisations use data generated through various sources to run
their businesses. They analyse the data to understand and interpret
market trends, study customer behavior, and take financial decisions.
The term 'Big Data' is now widely used, particularly in the IT
industry, where it has generated various job opportunities.

Big Data consists of large datasets that cannot be managed efficiently


by the common database management systems. These datasets range
from terabytes to exabytes. Mobile phones, credit cards, Radio Fre
quency Identification (RFID) devices, and social networking
platforms create huge amounts of data that may reside unutilised at
unknown servers for many years. However, with the evolution of
Big Data, this data can be accessed and analysed on a regular basis to
generate use ful information.

This chapter first discusses about the evolution of Big Data. Next,
the chapter describes the differences between structured and
unstruc-
N O T E S

tured data. Further, the chapter explains Big Data skills and sourc
es. This chapter next discusses about the adoption of Big Data.
The chapter also discusses the characteristics of Big Data and Big
Data analytics. Next, the chapter discusses about key aspects of a
Big Data platform and text analytics. Towards the end, the chapter
discusses about business applications of Big Data and technology
infrastructure requirement.

•fj EVOLUTION OF BIG DATA


The earliest need for managing large datasets of information origi
nated back in the early eighteenth century around 1880, when the US
census authorities were facing a critical problem as they had data of
several citizens, which includes age, sex, gender, even the people
who are 'insane', and so on. The data also includes those people who
got displaced after the great rail road program into random habitats or
at different places faraway from their original ones. Authorities felt
the need of having an efficient system that could hold the data of such
dynamics.

In 1890, the Hollerith Tabulating System was utilised for census - it


was a mechanical device and worked with punch cards that could
hold 80 different variables or attributes. It revolutionised the way
census was conducted and reduced the time taken for compilation of
census data from almost seven years to six weeks.

Some years later in 1919, IBM took up the agricultural census with
over 5000 federal employees deployed across Washington and over
90,000 enumerators by using more than 100million IBM punch cards
and other processing equipment. After that successful program, Big
Data took yet another leap forward with the development of The
Man hattan Project - the atomic bomb developed by the US in World
War II and further more in US space programs from 1950. Later, a
synoptic data collection model was adopted, which relied heavily on
allocation of large data sets. This shift in data-collecting techniques,
analysis, and subsequent collaboration helped to redefine how bigger
scientific projects were planned and accomplished. One such
ambitious project was the International Biological Program - it
studied the environmen tal changes on the species and flora-fauna of
a particular place. This program led to the exponential increase in
the amount of data gath ered and combined latest analysis
technologies. Although, it was met with difficulties related to
research structures and methodologies, and ultimately ended in 1974,
it opened a host of different transformed ways that data was
collected, organised, shared and redefined the ways the existing tech
could use data science more efficiently.

The lessons gained from the arrival of Big Data Science laid way for
further contemporary Big Data projects, like weather prediction, su
percollider data analytics and other physics based research,
astronom ical sciences and data collection like planetary image
detection, med-
N O T E S

ical research and many others. Big Data has become such a dynamic
force that it doesn't apply only to sciences anymore; many businesses
have got their critical data based services hooked onto its methodolo
gies, techniques and objectives too which has allowed the businesses
to unleash the data value that might have gone unnoticed earlier.

SELF ASSESSMENT QUESTIONS

1. The path towards modern Big Data was actually laid during

2. In 1890, the Hollerith Tabulating System was utilised for


census. (True/False)

ACTIVITY
Where else, instead of existing industries and domains, do you
think Big Data can play a crucial role in improving the overall


op erational and organisational efficiency? Make a list of the
domains with reasons to back them up.

STRUCTURED V/S UNSTRUCTURED


DATA
Anything that has a well-defined arrangement, easy to understand
structure and comprehensible hierarchy is considered a structurally
sound entity. Anything which doesn't have the above-mentioned
attri butes is considered unorganised and structurally weak entity.

For example, imagine a 10 GB outlook .psd file (Outlook email


config uration file) with mails from the last two years for a company
execu tive that receives over 100emails per day. If you open it raw,
via means of reverse-engineering, all you are going to see is a sea of
randomly oc curring datasets that point to nothing with hard to
decipher meanings and number codes with occasional familiar words'
sightings. But if you open it in the program it is made for, you will
see the structure and arrangement in which it is supposed to be
presented and aligned with.

So, anything that has a structure falls in place everywhere as a


struc ture?

No.Actually, a word file may not fit in a database where only text
files are supposed to be kept. Word file may have an internal
structure with all sorts of indentations, grammar, alignment and
margins thoroughly worked upon but in a database with different
definitions for the data, the database designer expects a text or excel
file as a word file is con sidered as unstructured.
•• •
N O T E S

The joys of having a structurally sound data are many like they can
be seamlessly added in a relational database and are easily
searchable by simplest of search engine operations or even
algorithms; whereas, the unstructured data is basically the reverse
of the above definition. It is a nightmare for the designers to
connect the random strands of data with the existing meaningful
ones and present it as a structure. Structural data is closer to
machine language than the unstructured data. So, the battle of
finding out a fine balance between keeping the machine happy and
the user happier is all that leads to the ever-refin ing Big Data
sciences and its affiliated technologies.

[f SELF ASSESSMENT QUESTIONS


3. Anything that has a well-defined arrangement, easy-to
understand structure and comprehensible hierarchy 1s
considered a structurally sound entity. (True/False)

ACTIVITY
In your day-to-day life, write all the structured data patterns.
You see that you have observed for a week and compare them
with the unstructured patterns around you. Now, think of the
ways how the connection between them can be laid, if required.
Please note re late only logically cohesive things, i.e. things that
can co-exist.

EXHIBIT
Semi-structured data
Semi-structured data, also known as having a schema-less or
self-describing structure, refers to a form of structured data that
contains tags or markup elements in order toseparate elements and
generate hierarchies of records and fields in the given data. Such
type of data does not follow the proper structure of data models as
in relational databases. In other words, data is stored
inconsistently in rows and columns of a database. Some sources
for semi-struc tured data include:
o File systems such as Web data in the form of cookies
O Data exchange formats such as JavaScript Object Notation
(JSON) data

Now, consider the following scenario:

Mr. Smith also observes the presence of some semi-structured


data saved in the database system of the publishing house. This
data
N O T E S

contains personal details of the authors working for the publishing


house, as shown in the following table:

SEMI-STRUCTURED DATA
S.No. Name E-mail
1. Sam Jacobs [email protected]
2. First Name: David [email protected]
Last Name: Brown

As you can notice from the preceding table, semi-structured


data indicates that the entities belonging to the same class can
have dif ferent attributes even if they are grouped together. In this
case, dif ferent names and different e-mails are grouped under a
common colu1nn name.

Iii BIG DATA SKILLS AND SOURCES


Now that we know theoretically what Big Data means, its evolution
into data sciences, the dramatic turnaround that made the sever
al industries latch onto it, what kind of data exist in the known data
sphere - armed with that level of knowledge, comes the next stage of
reining the data science where we will look at the tools of the trade
that are frequently used and skills you need to possess to tame the
dataset bulls that mayalmost seem intimidating at first.

Normally, while dealing with enormous number of datasets, you need


to have a good sense of observing the patterns, frequency of data oc
currences and other features that help in narrowing down a data fur
ther to its correct place. A keen statistical and data mining mind will
always take lesser time in finding out the patterns and studying the
data. Hence, it is necessary to have hands on with statistics and good
mathematical skills - needless to say, you don't need to be a genius.

Big Data science uses concepts of statistics, relational database pro


gramming extensively. In forensic data analysis, the patterns are
often recorded and studied for days before they yield something out
even with the help of sophisticated software and tools.

According to a survey, the technical skills most commonly required


for the Big Data positions in between 2012-17 comprises knowledge
of NoSQL, Oracle, Java and SQL. Moreover, the knowledge of
technical process/methodological requirements most often cited by
recruiters were in relation to Agile Software Development, Statistical
Analysis, Test Driven Development, Extract, Transform and Load
(ETL) devel opment, and Cascading Style Sheets (CSS). Besides,
existing technol ogies such as Hadoop, a Java based open source
framework that ac tively supports large data set processing, are
already in the game since long. Within the data framework
ofHadoop, multiple technologies like
•• •
N O T E S

Hive, MapReduce, Pig, HBase and so on are also an efficient medi


um of transforming large data sets into meaningful bits and pieces
for varying degree of requirements.

Over the next five years, demand for Big Data staff, by comparison,
is forecasted to increase at an average rate of between 13% (low
growth) and 23% p.a. (high growth). A mid-point average of these
two rates would give an expected growth rate of 18% p.a. This
would be a fa voured situation and should equate to the creation of
approx. 28,000 job opportunities p.a. by 2017.

That was a read about the Big Data technologies and methodologies
with a brief overview of how the job prospects are for a potential Big
Data candidate. Let's take a brief look on the sources of datasets that
define Big Data as a science and complete it as a method.

1.4.1 THE SOURCES OF BIG DATA

The philosophy around Big Data sciences and collection has often
been defined around the 3 Vs - volume, velocity and variety of data
in flowing a system. For many yesteryears, this used to be enough
but as companies moved more towards online processes,
thisdescription has been stretched to take in variability as well -
which simply denotes the increase in the range value of a large
data set - and value, that addresses the evaluation of a typical
enterprise data.

The chunk of Big Data comes from three primary sources:


machine data, social data and transactional data. Besides,
companies need to make out the difference between internally
generated data, like data residing behind a corporation's firewall,
and externally generated data which is imported into a system.

Whether data is structured or unstructured is also a crucial factor


since unstructured data does not have a definite data model and,
hence, requires more resources to make sense out it.

The three top primary sources of data are described as follows:


□ Social data comes from the Tweets, Likes, Comments, Retweets,
Video Uploads, and the overall media that is shared on the
world's most popular social media platforms. This type of data
provides vital understanding of consumer behavior and
perception and can be hugely effective in marketing analytics.
The public Web hap pens to be another major source of social
data, and tools such as Google Trends can be used to
advantageous effect to increase the Big Data volume.
□ Machine data is the data created from/by sensors installed in
ma chinery and industrial equipment, and even logs that track
the typical user behavior. This data type is likely to grow
manifolds as the Internet of things (IoT) grew ever more
prevalent and ex-
F.UNDAMENTAI::S :

N O T E S

pand around the world. Sensors present in devices, such as smart


meters, medical devices, satellites, road cameras, games and the
ever-growing IoT will deliver high value, velocity, variety, and vol
ume of data in near future.
D Transactional data is the data generated from online and offline
transactions occurring daily. Invoices, storage records, payment
orders, delivery receipts - all are considered as transactional data.
Despite the immense variety of existing data, these datasets and
types alone are almost meaningless, and most organisations struggle
to make sense of the data that they are generating and how it can be
put to effective use.

[f SELF ASSESSMENT QUESTIONS

5. Data that comes from door-to-door surveys falls in _


category.
6. data is the data created from/by sensors installed in
machinery and industrial equipment, and even logs that track
the typical user behaviour.

ACTIVITY
Can there be specific data types that are most reliable and authen
tic while another one that is more prone to errors? Consider met
rics, such as references, quotes, and sources while creating the vi
sualisation.

Iii BIG DATA ADOPTION


The adoption of a contemporary technology like Big Data can
enable the altering innovation that can bring a transition in the
structure of a business, either with its services, products, or
organisation. Howev er, managing innovation requires due attention:
too many regulations can throttle the initiative and diminish the
results, and too little omis sions can turn a great project with great
intentions into a science trial that never yielded promised results.
Given the Big Data nature and its analytical prowess, there are many
issues that require consideration and planning at the very start. For
example, with the adoption of any new technology, it becomes
equally important to secure it in a way conforming to current
corporate stan dards. Tracking issues related to the source of a
dataset from its dis covery to its consumption is considered as a new
requirement for the organisations. Managing the privacy of elements
whose data or iden tity is being controlled by analytical processes
must also be planned ahead.
.. • •

N O T E S

In fact, all the above deliberations require the organisation to identify


and set up different decision frameworks and governance processes
to ensure that accountable parties know about Big Data's
consequences and management requirements.

As explained earlier, there are many things to consider and account


for when adopting Big Data.

Big Data frameworks are not push-button answers. For data analysis/
analytics to offer value, corporations ought to have data management
and the governance frameworks of Big Data. Complete well-defined
processes and ample skill sets for those who will be responsible for
customising, implementing, populating and using Big Data solutions
are also necessary. Additionally, the data quality aimed for Big Data
powered processing needs to be evaluated as well.

1.5.1 USE OF BIG DATA IN SOCIAL NETWORKING

The magnitude of datasets present in social media even in not so


pop ular sites are large enough to warrant the consideration of Big
Data as the crucial technology to effectively utilise the barrage of
data that is longing to be comprehended.

For example, Facebook's ad feature is a comprehensive analytical


tool that studies the user's activities on different e-commerce
websites and targets them with contextual ads that may arouse
user's interest and end up in a successful purchase. This may seem
simple at first but in truth, it is the clever use of Big Data science
deployed to study user activity and customise their experiences
with a mutually rewarding experience for the corporations - with a
win-all situation at the end. Big Data is also used to gather the
friend requests, activity sugges tions and pages to be followed - all
these are nothing but Big Data behind the scenes as the chief
driving force enabling you to reconnect with your old lost friend,
customise your account as per the liking and interests. Not only on
Facebook, but interconnection of several other social media
platforms has opened the potential of a new social media world
order that might be brewing with several hidden features, ex
ploiting which can prove beneficial for all.

1.5.2 USE OF BIG DATA IN PREVENTING FRAUDULENT


ACTIVITIES

"The accountant for a U.S. company recently received an e-mail from


her chief executive, who was on vacation out of the country, requesting
a transfer of funds on a time-sensitive acquisition that required comple
tion by the end of the day. The CEOsaid a lawyer would contact the ac
countant to provide further details. It was not unusual for me to receive
e-mails requesting a transfer of funds," the accountant later wrote, and
when shewas contacted by the lawyer via e-mail, she noted the
appropri ate letter of authorisation-including herCEO's signature over
the com-
F.UNDAMENTAI::S :

N O T E S

pany's seal.-andfollowed the instructions to wire more than$737,000 to


a bank in China."

The clerk for a U.S. council received an e-mail from her senior, who
was out of the country on a vacation, requesting the funds transfer
for a time-bound acquisition requiring to be closed by the end of the
day. The senior said that a lawyer would contact her to provide
further details.
"It was not uncommon for me to get official e-mails seeking funds trans
fe1;" the clerk said. Later the lawyer contacted her via e-mail,
with the appropriate authorisation-including her senior's signature
with company's seal-she simply followed the directions to
transfer more than $880,000 to a bank in China.
Clearly, to handle such attacks, you need a unique defense outlook
where Big Data offers a potential answer as it allows institutions or
corporations to tackle the fraud differently and get results
accordingly.
Here is how Big Data helps in preventing frauds:
D Recognising suspicious activities in advance: Banks are always
on a lookout for real time data with suspicious behaviour. Like, if
a credit card owner transacts for the first time from a particular
de vice, the bank gets notified. If multiple transactions are
occurring from different devices in a day, the subsequently
generated data is enough to raise the alarm and red flag the
transactions. Few banks also inform the actual card holders
instantly and can prohibit the transaction. Big Data is simplifying
the detection of unusual trans actions like if two transactions take
place from a single credit card in different cities within a short
period, the bank is going to get alerted.
D Leverage data to detect suspicious activities: Banks access large
number of customer's data from various sources such as social
media, logs, call center conversation and that data can be very
helpful in determining abnormal activities. For example, a credit
card holder travelling in an airplane currently and is posted his
present status on Facebook. Therefore, any transaction on user's
credit card during that period is considered suspicious and can be
blocked at the bank's discretion.
Let us now consider insurance industry that receives a lot of deceit
ful claims and even accepts some and disburses the substantial claim
amount. How does Big Data assist in such a case? The industry can
access data gained from a variety of sources such as past claim re
cords, social media, phone and criminal records. Upon receipt of a
claim, the scrutiniser should verify the claimant information. If any
suspicious activity is found in the claimant's record, it should
forward the claim for additional investigation.
.. • •

N O T E S

The Chinese e-commerce giant, Alibaba, utilises Big Data effectively


to handle fraud by subjecting any suspected fraudster to pass through
5 stages of verification, like Device Check, Account Check, Risk Strat
egy, Activity Check, and Manual Review. Each step uses the immense
amount of seller related data and activities. For example, in the first
stage, multiple questions may be asked such as previous record in sus
picious activity, retailing experience, and so on. The second layer will
inspect the technicalities, such as the IP number and device ID, num
ber of devices used by seller or is going to use, and so on.

Using Big Data provides industries involved in critical financial


trans actions an opportunity to avoid the scam to a great extent.
However, the Big Data usage for such industries is still in its early
stages and a lot has to be done in this regard. Using Big Data
requires the compa nies to be conducive to change and need to learn
to be data-driven and data-centric and solve problems that call for
bigger data sets, such as cultural change needs to happen for Big
Data solutions to become uni versal norm across the industry,
including solutions that don't work or take you to a dead end but
invariably end up educating you.

1.5.3 USE OF BIG DATA IN RETAIL INDUSTRY

Big Data has brought in some remarkable results for retailers across
the industries as evident from their testimonials.

A famous jewellery shop claims a 47% increase in holiday season


sales all thanks to Big Data. Similarly, a prominent hotel chain
experienced increased online and over the phone reservations and
enquiries af ter implementing a Big Data solution recommended to
them by con sultants they employed to drive their business aU across.
As per the analysis undertaken and completed by McKinsey, more
than 250 data centric and data driven approaches owned by
companies over a five year period of sales and marketing choices
improved their overall ROI by 20 to 25 percent.

However, as with any other great bargain, plenty of obstacles and


cyn icism still remain with using Big Data as the key retail transition
ex pert. BigData is creating a lot ofinterest, as confirmed by many
senior executives but most of them struggle with common challenges
- like aligning Big Data with the use cases, identifying new (usually
unstruc tured) types of data and howto utilise Big Data for faster and
efficient decision-making.

Cases of some clever usages of Big Data in retail industries are


true examples of creative thinking of solution architect. Consider
the fol lowing big Data examples in in which hotels using BigData to
increase reservations.

A bad weather naturally results in decreased overnight staying at


ho tels due to lesser travellers. If you're in the hotel business, places
with
F.UNDAMENTAI::S :

N O T E S

such unpredictable weather are not good. However, Cafe Inn turned
this adversity into their advantage. They observed that the travellers
of a cancelled flight end up in an urgent situation and are in need of
an overnight stay. The company used weather and flight cancellation
information that was readily and freely available, coupled with hotel
and airport information, and an algorithm was developed, which took
factors, like travel conditions, weather severity, time of the day and
rates of cancellation by airlines among other variables. With insights
of Big Data, and pattern recognition of travelers using the mobiles
for this use case, the company effectively used Pay Per Click (PPC)
and search mobile campaigns to send specific mobile ads to stuck
travel ers and made it easier for them to book a nearby hotel and
increasing the overall hotel revenue by manifolds even in the most
unexpected of times.

There are several such case studies and stories where BigData's effec
tive utilisation resulted in a great deal of turnaround for corporations.

SELF ASSESSMENT QUESTIONS

7. The fashion industry can utilise topredict the


next stage of fashion resurgence.

ACTIVITY

Big Data for retail industries can be a hit and miss affair. Discuss
with your friends.

CHARACTERISTICS OF BIG DATA- THE


SEVEN Vs
The seven si(g)ns of Big Data almost perfectly define the true
BigData attributes and sum it up as an effective yet extremely
straightforward solution for those datasets that require dealing with
an incredibly plumped up information. The key Vs used in Big
Data are:
D Volume: While deliberating Big Data volumes, incredible sizes
and numerical terms are required. Each day, the data to the tune
of 2.5 quintillion bytes is produced. Most companies on average
have 100 terabytes of data stored which the Facebook users
upload that much data on daily basis.
D Velocity: The speed at which data is accumulated, generated and
analysed is considered vital to have more responsive, accurate
and profitable solutions. The knowledge of rate of data
generation will result only in faster system ready to handle that
traffic.
O Variety: Beyond the massive volumes and data velocities lies an
other challenge - operating on the vast variety of data. Seen as a
= •• •
N O T E S

whole, these datasets are incomprehensible without any finite or


defined structure.
□ Variability: A single word can have multiple meanings. Newer
trends are created and older ones are discarded over time - same
goes for meanings as well. Big Data's limitless variability poses a
unique decipher challenge if its full potential is to berealised.
□ Veracity: What Big Data tells you and what the data tells you
are two different situations. If the data being analysed is
incomplete or inaccurate, the Big Data solution will to be
erroneous. This situa tion occurs when data streams range from
multiple with a variety of formats. The veracity of the overall
analysis and effort is useless without cleaning up the data it
begins with.
□ Visualisation: Another daunting task for a Big Data system is to
easily be represented the immense scale of information it pro
cesses into something easily comprehensible and actionable. For
human purposes, the best methods are conversion into
graphical formats like charts, graphs, diagrams, etc.
□ Value: Big Data offers an excellent value to those who can actually
play and tame it on its scale and unlock the true knowledge. It
also offers newer and effective methods putting new products to
their true value even in formerly unknown market and demands.
While Velocity, Volume and Variety are inherent itself to Big Data, the
other Vs of Variability, Value, Veracity and Visualisation are important
properties that reflect the gigantic complexity that Big Data presents
to those who would analyse, process and benefit from it.

SELF ASSESSMENT QUESTIONS


8. If a dataset complies with all the Vs but fails in one, due to
incorrect details of the data received, which V it fails to
adhere to ?

ACTIVITY
Are seven V's enough/too much for Big Data classification.
Critical ly explain both the cases with examples.

Id BIG DATA ANALYTICS


Big Data analytics are a set of advanced analytic techniques used
against very large, miscellaneous data sets that include unstructured/
structured, batch/streamingand different sizes ranging from terabytes
to zettabytes. Analysis of Big Data allows researchers, analysts, and
business users to make better and faster decisions using data that was
previously unusable or inaccessible. Using advanced analytics tech
niques, such as machine learning, text analytics, predictive analytics,
F.UNDAMENTAI::S :

N O T E S

statistics, data mining, and natural language processing,


businesses can examine previously untouched data sources
independently or to gether with their current enterprise data to gain
new perceptions re sulting in faster and better decisions.

1.7.1 ADVANTAGES OF BIG DATA ANALYTICS


Big Data analytics helps the corporations to utilise their data and
use it in identifying new opportunities which further leads to more
effi cient operations, smarter and well calculated business moves,
hap pier clients and higher revenues. Companies are actively
looking to find workable insights about their data. Many Big Data
projects are initiated from the need of answering key business
requirements and questions. With selection of a correct Big Data
platform, the enter prise can increase efficiency, sales and improve
operations, be better at managing risks and servicing customers.
D Cost reduction: Big Data technologies like Hadoop bring
substan tial cost advantages when it comes to storage of large
data and to recognise more efficient ways of doing business.
D Better and faster decision making: With the evolving new age
technologies and memory analytics, coupled with the ability to
analyse new data sources, corporations are now able to immedi
ately analyse the information - and make decisions based on the
learnings they derive.
D New services and products: With the clarity to read the custom
er's need and analytical satisfaction enables the power to give
consumers what they want - even till the levels to tailormake the
solution according to the requirements of each customer individu
ally. More such technological prowess has enabled and opened
up further potential arenas of customer servicing.

SELF ASSESSMENT QUESTIONS


9. The process of evaluating a situation and analysing it for
creating faster and efficient decision-making systems is called

ACTIV1TY


Analyse a real life situation around you that can use Big Data ana
lytics to increase the overall operational and functional
efficiencies.

KEY ASPECTS OF A BIG


• DATA PLATFORM
For most organisations, the answer to many questions is Big Data it
self - the massive volumes of structured and unstructured data gen-
= •• •
N O T E S

erated within the organisation. Being able to analyse all this data in a
meaningful way can be an intimidating task without the proper infra
structure and ways to process data from diverse sources and
effective ly. And once you have managed it, it's another fight to
make it mean ingful to the people who need to understand it. So, for
organisations to build the correct Big Data policy, here are the five
crucial components to consider:
□ A universal data model: Ensure your entire data is centralised
and unified in a common data model to provide a single accurate
view of the business. The conventions for common data model
such as naming, fields relationships and attributes are created by
data model itself in a way that everything is aligned across trans
actional and other related systems.
□ Exploit the power of external data: Capturing the true
meaning of the data means successfully integrating initial data
from inter nal sources with external data from diverse
environments (like so cial media, vendor data and demographics).
The platform should be flexible enough to accommodate
information in multiple ways from multiple structured or
unstructured distributed databases.
□ Focus on open standards and scalability: Organisations can uti
lise existing systems efficiently by using a platform with scalable
standards, simultaneously gaining flexibility and reducing the IT
related costs in terms of businesses. Open industry standard com
pliant systems are readily available and preferred to existing sys
tems for many reasons, one being their effortless integration with
existing systems from multiple other vendors, legacy systems and
future add-on solutions.
□ Platform independent model: In today's age, the information
is readily accessible across various platforms, hence
organisations must ensure a universal infrastructure for delivering
and produc ing scorecards, dashboards, enterprise reports and ad-
hoc anal ysis while giving end-users the real-time round the clock
access to mobile BI, self-service BI, and the capacity to tailor
made their own BI content and customised dashboards using a
simpler point and-click interface.
□ Provide users with insights: Users need to be at single point in
order to act on information rather than switching between tasks
or multiple applications. This type of cross-domain, closed-loop
analytics ensures that BigData will have an instant beneficial and
informative impact on daily operations.

Foundation establishment for leveraging Big Data is worth the ex tra


effort. When business users can make decision, and take actions
straight from the analytics dashboard, the positive impact customer
experience and on operations is almost instantaneous.
N O T E S

SELF ASSESSMENT QUESTIONS

10. An ideal Big Data platform should not have:


a. Singular Access Mode
b. Platform-independent architecture
c. Well-defined structure
d. End-user-friendly and easily accessible

ACTIVITY

How will you unify the different data sets in case you are given
an opportunity to design and develop an architecture?

•J•GOVERNANCE FOR BIG DATA


Big Data governance is a crucial factor in dealing with management
of diverse datasets because many times, such data poses as risks, like
unplanned costs, input and misleading data.
Since Big Data is a new model and ever changing with dynamics of
industries, data governance is at nascent stage and not many know
about it. With policies and procedures yet to be developed, many
gov ernance companies are offering services to help companies to
organ ise their data.
With the help of data modelling tools that aid with governing data,
these tools have unified the Metadata repository further allowing the
effective integration and metadata aggregation from different data
sources. Big Data governance gives data the required validation and
the authority to selectively distribute the data within or outside the
organisation. These modeling tools also provide a great graphical
data representation and advanced research while maintaining
accuracy. This gives an organisation scalability to discover and study
different implications of Big Data.
Data governing tools actively deploy data pipelining technology, that
enables sequential data processing where the output from one pro
cess works as an input for the next process. With these pipelines
being linear or dynamic, the scale of data flexibility becomes high in
data governance.
A data governance strategy is a must - youcan borrow it from
another successful strategy, but make sure to custom-fit it with
your unique business needs. The strategy should contain things such
access to the information, ownership of different information
types, and the pur pose data is used. While defining the strategy,
consider data quality, regulatory requirements, management of
information lifecycle, priva cy and security.
.. • •

N O T E S

A cross-functional method for data governance is recommended,


which is a compliance requirement since data and information sys
tems often interact with different departments that can result in an
opaque, non-transparent view to the top management about the func
tioning. Hence, there needs to be an all-inclusive way to Big Data
by a team consisting of members from all the required departments
to check on auditable proof, controls, and compliance
documentation.

The data usually has a lifecycle beyond which it either becomes


obso lete or simply becomes a liability to be looked after.
Overlooking such aspects is a common error organisations commit.
Hence, a standard schedule is never recommended for all data types
as they may have different retention stages. Data archival is
recommended to enhance the overall performance of your
applications.

SELF ASSESSMENT QUESTIONS

11. Big Data governance model utilises the thoroughly


to present the information effectively.

ACTIVITY

Which other governance model other than Big Data you can think
for managing low traffic data centers?

■ 8111TEXT ANALYTICS
The Text Analytics are the conversion of unstructured textual data
into comprehensible analytical data, that often includes processes
like check product reviews, measure consumer opinions, buyer
sentiment analysis and feedback, provide search facility, and object
modelling to ensure factual decision-making. Text analysis requires
multiple statis tical, linguistic and machine learning techniques and
involves retriev al of information from unstructured data and
restructuring the input text to create patterns and trends, evaluate and
interpret the data output. It also involves categorisation, alphabetical
analysis, tagging, recognition of a recurring or singular pattern,
clustering, extraction of vital information, visualisation, link and
association, and predictive analytics. Text Analytics determines
topics, keywords, category, tags, semantics, from the humongous text
data stored in different files and formats in a typical organisation.
The term 'Text Analytics' also refers to 'text mining'.

The textual analytical software provides servers, algorithm and tools


based applications, extraction tools and provisions for data mining to
turn unstructured data into data with some value. The output is com
posed of recovered entitiesr, elationships and is stored in a relational
format - typically XML, and in a format, that's compliant with other
F.UNDAMENTAI::S :

N O T E S

analytical applications such as or Big Data analytics, business intel


ligence tools or predictive analytics tools. Figure 1.1 shows the text
analytics process flow:

Text
Identification 1 -----
Visualisation Text Mining

Summarisation Text
Categorisation

Text Analytics
Text
Clustering

Search
Access
Entity/Relation
Modeling

Figure 1.1: Displaying the Text Analytics Process Flow


Source: https://s-media-cache-ak0.pinimg.com/originals/05/3d/e0/053de0478bb02ab7dlb-
73222059fel82.jpg

The features and processes involving Text Analytics solutions are as


follows:
D Text parsing, mining, identification, categorisation, extraction
and clustering.
0 Extraction of entities, concepts, events, relations.
0 Indexing, Web crawling, Search Access, duplicate document iden
tification.
0 Link analysis, identifying and analysing people, sentiments and
other information from websites, reports, internal files, forms,
sur veys, claims, underwriting notes, employee surveys, medical
re cords, blogs, emails, news, social media, online forums,
customer surveys, market surveys, online reviews, website
feedback, review sites, scientific journals, call center logs, snail
mail, transcripts, sales notes, etc.

Textual analytics has a wider application dynamics and is often uti


lised in analysing the market sentiment, consumer behavioural pat
terns and segments that lead the domain for a given product or a
company. Besides, a lot of actionable items, such as ad placement on
sites customised on user's browsing experience, enterprise business
intelligence and records management, national security and intelli
gence, scientific discoveries of species and especially in life sciences
domain, etc.
.. • •

N O T E S

SELF ASSESSMENT QUESTIONS

12. The isconversion of unstructured textual data


into comprehensible analytical data.

Big Data for retail industries can be a hit-and-miss affair. Explain.

Iii• BUSINESS APPLICATIONS OF BIG DATA


From examples described earlier in this chapter, Big Data's crucial
role transforming even the most adverse situations for companies and
organisations or even smaller hotel chains, is no uncommon achieve
ment for a model that is meant to failsafe you against the worst of
the conditions. But, Big Data is not only limited to that; it comes with
much deeper and broader applications.

As discussed earlier, Big Data is used to have better understanding of


a customer's behaviours, need and preferences. You might remember
the example of hotel chain we discussed earlier that can now
almost accurately predict when the weather is going to go bad and
customers will come down hunting for their tailor-made services.
Similarly, a car dealership can predict when the next car going to
be sold, Walmart can predict the most selling item in each point of
time for a month or in a year or around any holiday season.

Big Data helps in optimising election campaigns as well. Reportedly,


in the 2012 presidential election campaign, many believed Obama's
win was because of his team's greater ability to use Big Data
analytics to their advantage.

Big Data is now seeping in those areas that were earlier prone to mis
calculations and predictions, such as stock inventory model where a
retailer couldn't decide whether to stock up for the upcoming
seasonal sales based on the factors around or not. Now, the same
retailer can optimise their stock from the Web search trends, social
media data and weather forecasts predictions.

In supply chain or delivery route optimisation, Big Data is helping


big time as well. Radio sensory along with route optimisation
based on the traffic data, road blockages or even live protest
detectors, are be ing actively used by many postal corporations. The
power of Big Data analytics is now helping the scientists in decoding
the entire DNAmu tations in minutes and allow them to find new
cures, predict disease patterns and better understand the genomes.
Science and research is currently under transformation by Big
Data and its associated tech niques being actively used for
example, CERN, the Large Hadron
F.UNDAMENTAI::S :

N O T E S

Collider nuclear physics lab, world's most powerful and largest parti
cle accelerator is currently experimenting on genesis of the universe
in search of the elusive God particle. The datacentre responsible for
managing CERN's datasets has 66,000 processors to analyse around
30petabytes of data produced. It uses the distributed computing pow
er of thousands of systems located across 140 datacentres around the
world. Suchcomputing powers can be utilised to change the
waymany other areas of science and research function and give the
results.

13. 's election campaign actively used Big Data


analytics to gain traction over the competition as per a report.

ACTIVITY
Pattern-based recognitions and fingerprint recognition system
store their data and keep it unique based on patterns and finger
prints. What do you think facial recognition systems keep as
unique identifier and how does Big Data help in it?

TECHNOLOGY INFRASTRUCTURE
REQUIREMENT
Big Data is simply a large data repository with the following charac
teristics:
□ Has distributed redundant data storage
D Handles large amounts (a petabyte or more) of data
D Provides data processing (MapReduce or equivalent) capabilities
D Processes tasks in parallel
□ Is relatively inexpensive
□ Centrally managed and orchestrated
D Extensible - basic capabilities can be augmented and altered
D Accessible - easy to use and available

So, the infrastructure that's going to host BigData as the prime driver
of an organisation must be robust, scalable, ductile and fail-safe for
unplanned situations. But how do we arrive at such robust scale of
infrastructure? Merely having a super-expensive high-spec systems
and networking gears will be enough or Big Data requires something
more than these usual factors?
= •• •
N O T E S

Another driving force behind the successful implementation of Big


Data is the software - both analytics and infrastructure. Primary in
frastructure is called Hadoop - an open source Big Data
management software used to distribute, manage, catalogue, and
query data across horizontally scaled multiple server nodes. Hadoop is
basically a frame work for storing, processing and analysing massive
amounts of un structured distributed data. Hadoop Distributed File
System (HDFS) as a file storage subsystem was planned and
designed to handle few trillions bytes of parallel data distributed
across multiple nodes.

The most important components of Hadoop are the Hadoop


Distrib uted File System (HDFS) which provides storage and
MapReduce, for parallel processing of large dataset. Going
forward, we will use Ha doop as a chief example for a Big Data
product and infrastructure.

1.12.1 STORING OF BIG DATA


The dataonce gathered from yoursources is stored in the
sophisticated but accessible systems and traditional data
warehouse, a distributed/ cloud-based storage system, a data lake,
and in the company servers or even a simple computer's hard disk
depending on the magnitude of the data received. For not so larger
amounts of data, one can consid er using their clustered networks
storage as the data-storing option, given it's well designed and has
failsafe measures to withstand those unpredictable storage issues.
However, for larger data inflows, where a group of interconnected
networks won't suffice alone, it is better to consider the cloud-based
data caches or professionally managed data centres. Following are
the characteristics that a typical HDFS storage system should be
compliant of:
□ Scalable: Storage should be flexible in throughput, size and
access speed.
□ Tiered storage: It is important for the storage system to manage
the hierarchy of the data across the range of storage devices pres
ent within a system like fast disk, flash, tape and slower disk.
□ Widely accessible: Storage should be globally distributed to be
closer to users for ready access.
□ Backward compatible with analytical and content applications,
and legacy systems: A well-built Big Data storage system should
be flexible and heterogeneous. It should be composed of interfaces
allowing access to the Big Data storage and inbuilt functionality.
□ Supports integration with cloud ecosystems: A near-perfect Big
Data storage system must be built keeping cloud storage as in
purview as cloud-based storage has come up as a great option for
most of the businesses. It's flexible, doesn't requires your
physical presence nor the physical systems onsite and it reduces
the data security problem. It's also much cheaper than investing
and main taining the expensive data warehouses and dedicated
systems.
F.UNDAMENTAI::S :

N O T E S

1.12.2 HANDLING OF BIG DATA

While handling large datasets is never a one-time job. Hadoop is


changing the conventions of Big Data management, especially with
the unstructured data. Let us see how a framework called Apache
Ha doop software library plays a crucial role in managing Big Data.
Apache Hadoop streamlines the excess data for any distributed pro
cessing system across computer clusters using simple programming
based models. Instead of having hardware dependency to provide the
uptime, the library inbuilt with features at the application layer, to
detect and handle breakdowns, providing a reliable and always
avail able service along with a computer cluster, since both versions
may be prone to failures.
The Hadoop Community Package consists of:
D OS level and File system abstractions
D A MapReduce or YARN (Yet Another Resource Negotiator) engine
o Hadoop Distributed File System (HDFS)
D Java ARchive (JAR) files
D Scripts needed to start Hadoop, documentation and source code,
and a contribution section

1.12.3 MANAGING BIG DATA

A lot has been discussed and written about the Big Data's
functioning, its associated workflows, technologies used and traits
they require to share in order to perform efficiently. Following key
points should be considered while keeping Big Data management in
context:
D Cluster design: Application requirements are evaluated in terms
of volume, workload and other associated factors that form the
ba sis of cluster design, which is not a repetitive process. The set-
up in initial stages is validated and verified with an application
and data sample before being actuated. Although, the cluster
design of a typical Big Data structure allows scalability in tuning
config uration parameters, a large number of other parameters
and their impacts on each other lead to additional complexity.
D Hardware architecture: The key factor that works in favour of
Hadoop clusters is the high-quality equipment used by it. Since
most of the Hadoop users concern about the cost and as the clus
ters grow, cost rises significantly. In the current scenario, the ar
chitecture hardware requirements for the NameNode are higher
RAM and lower level or mid-lower levels ofHDD. If the
JobTrack er turns out to be a separate server, it will have higher
CPU speed and RAM latency. DataNodes are a standard for
lower end server machines.
= •• •
N O T E S

□ Network architecture: As of now, network architecture is not


designed explicitly for Big Data. Inputs from application require
ment and cluster design are not always mapped to it. Standard
set-up for the network within the existing datacentre is used as
the primary set-up. This results in network deployment that's
over valued most of the time and has a negative effect on
MapReduce algorithm responsible for data processing. Hence,
therein lies the great scope for creating actual guidelines linked to
network archi tecture design for Big Data.
□ Storage architecture: Most enterprises are already hugely invest
ed in SAN and NAS devices when they consider Big Data.
During the implementation, the attempts to reuse the current
storage in frastructure even though DAS is recommended as
storage for clus ters of Big Data.
□ Information security architecture: A usual examination of mul
tiple Big Data implementations illustrates that less security fea
tures are considered secondary over other pressing requirements
of a demanding system and aftermarket security solutions are not
tailor made for these clusters. These deployments often turn out
to be insecure and solely rely on perimeter and network security
support.

[;f SELF ASSESSMENT QUESTIONS


14. MapReduce was originally the part of a framework developed
at ---

ACTMTY
Can HDFS have a replacement with a much more efficient system?
Make a list of technologies that have the potential to do so.

■ lijsuMMARY
□ The Big Data sciences use concepts of statistics, relational data
base programming extensively.
□ Normally, while dealing with enormous number of datasets, you
need to have a good sense of observing the patterns, frequency of
data occurrences and other features that help in narrowing down
a data further to its correct place.
□ The chunk of Big Data created comes from three primary
sources: machine data, social data and transactional data.
o The adoption of a contemporary technology like Big Data can
enable the altering innovation that can bring a transition in the
structure of a business, either with its services, products, or
organ isation.
F.UNDAMENTAI::S :

N O T E S

D Big Data has brought in some remarkable results for retailers


across the industries as evident from their testimonials.
D Analysis of Big Data allows researchers, analysts, and business
us ers to make better and faster decisions using data that was
previ
o usly unusable or inaccessible.

m KEYWORDS

D Big data analytics: It is a set of advanced analytic techniques


used against very large, miscellaneous data sets.
□ Structured data: It is a well-defined arrangement, easy-to-un
derstand structure and comprehensible hierarchy of data.
D Social data: It is the data that comes from the tweets, likes,
com ments, retweets, video uploads, and overall media that is
shared on the world's most popular social media platforms.
□ Transactional data: It is the data that is generated from online
and offline transactions occurring daily.
D Unstructured data: It is the data that is not well organised.

Iii• DESCRIPTIVE QUESTIONS


1. Discuss the evolution of Big Data.
2. What are the basic differences between structured and
unstructured data?
3. Enlist and explain different sources of Big Data.
4. Explain various characteristics of Big Data.
5. What are different advantages of Big Data?
6. Explain the concept of text analytics with suitable examples.

•l i ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. At1swe1·s


Evolution of Big Data 1. Late Fifties
2. True
Structured v/s Unstructured 3. True
data
4. Machine
Big Data Skills and Sources 5. Transactional
.. • •

N O T E S

Topic Q. No. Answers


6. Machine

Big Data Adoption 7. Analytics

Characteristics of Big Data - 8. Veracity


The Seven Vs
Big Data Analytics 9. Big Data analytics
Key Aspects of a Big Data 10. a. Singular Access
Platform Model

Governance for Big Data 11. Data Model


Text Analytics 12. Text Analytics
Business Applications of Big 13. Presidential
Data
Technology Infrastructure 14. Google
Requirement

ANSWERS FOR DESCRIPTIVE QUESTIONS


1. The earliest need for managing large datasets of information
originated back in early eighteenth century around 1880. Refer
to Section 1.2 Evolution of Big Data.
2. Anything that has a well-defined arrangement, easy-to
understand structure and comprehensible hierarchy is
considered a structurally sound entity. Refer to Section
1.3Structured v/s Unstructured Data.
3. Whether data is structured or unstructured is also a crucial
factor since unstructured data does not have a definite data
model and, hence, requires more resources to make sense out
of it. Refer to Section 1.4 Big Data Skills and Sources.
4. The seven signs of Big Data define the true Big Data attributes
and sum it up as an effective yet extremely straightforward
solution for those datasets that require dealingwith an incredibly
plumped-up information. Refer to Section 1.6Characteristics of
Big Data -The Seven Vs.
5. Big Data Analytics are a set of advanced analytic techniques
used against very large, miscellaneous data sets that include
unstructured/structured, batch/streaming and different sizes
ranging from terabytes to zettabytes. Refer to Section 1.7 Big
Data Analytics.
6. Text analysis requires multiple statistical, linguistic and
machine learning techniques and involves retrieval of
information from unstructured data and restructuring the
input text to create patterns and trends, evaluate and interpret
the data output. Refer to Section 1.10 Text Analytics.
F.UNDAMENTAI::S :

N O T E S

lltMSUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Mayer-Schonberger, V., & Cu.kier, K. (2014). Big data: a
revolution that will transform how we live, work, and think.
Boston: Mariner Books, Houghton Mifflin Harcourt.
D Erl, T., Khattak, W, & Buhler, P. (2016). Big data fundamentals:
concepts, drivers & techniques. Boston; Prentice Hall.

E-REFERENCES
D What is Big Data and why it matters. (n.d.). Retrieved April 22,
2017, from https://www.sas.com/en_us/insights/big-data/what-is
big-data.html
D Big Data. (2017, March 17). Retrieved April 22, 2017, from https://
www.ibm.com/big-data/us/en/
CONTENTS

2.1
Introduction
2.2
Distributed and Parallel Computing for Big Data
Self Assessment Questions
Activity
2.3 Introduction to Big Data Technologies
2.3.1 Hadoop
2.3.2 Python
2.3.3 R
Self Assessment Questions
Activity
2.4 Cloud Computing and Big Data
Self Assessment Questions
Activity
2.5 In-Memory Technology for Big Data
Self Assessment Questions
Activity
2.6 Big Data Techniques
2.6.1 Massive Parallelism
2.6.2 Data Distribution
2.6.3 High-Performance Computing
2.6.4 Task and Thread
2.6.5 Management Data Mining
2.6.6 and Analytics Data Retrieval
2.6.7 Machine Learning
2.6.8 Data Visualisation
Self Assessment Questions
Activity
2.7 Summary
CONTENTS

2.8
Descriptive Questions
2.9 Answers and Hints
2.10
Suggested Readings & References
• •• •

INTRODUCTORY CASELET
N O T E S

IMPROVED DATA SECURITY WITH CISCO AND MapR TECHNOLOGIES

A company, Solutionary, located in Omaha, Nebraska, provides


IT security and managed services to its customers. It consists of
more than 310 employees to handle more than trillions of
queries per year of their customers. The main challenges for the
compa ny were to increase data analytics capabilities for
improving the data security for their customers. The compa:tly
also wanted to improve scalability as the number of clients and
datasets grow every year remarkably. In addition, the company
wanted to re duce the costs of expanding the database solution
to meet current business demands.
The company formed partnership with Cisco and MapR
Technol ogies for implementing Cisco UCS Common Platform
Architec ture (CPA) for Big Data. MapR Technologies had
suggested the Apache Hadoop solution that provides a complete
new way of handling Big Data. Unlike traditional databases that
store struc tured data only, Hadoop allows Solutionary to
distribute and analyse both types of data, structured or
unstructured, smoothly on a single data infrastructure.
This partnership resulted in the following benefits for the
com pany:
□ Less time required to investigate security events for
relevance and impact
□ Easy data availability along with new services and enhanced
security features
□ Enhanced agility along with on-demand deployment of
appli cation or sevice
According to Dave Caplinger, Director of Architecture of Solu
tionary, "By implementing MapR and Cisco UCS,we have
achieved performance and flexibility withincredible scalability via
Hadoop's clustered infrastructure. This infrastructure allows us
to perform real-time analysis on big data in order to help protect
and defend
against sophisticated, organised, and state-sponsored adversaries."

He also declares, "MapR and Cisco UCS have many of the


same values: high performance, efficient management, and ease
of use. Using both solutions together enables us to scale our
security analy sis services while keeping complexity and cost
under control."
F.UNDAMENTAI::S :

N O T E S

@ LEARNING OBJECTIVES

After studying this chapter, you will be able to:


>- Explain distributed and parallel computing for Big Data
>- Recognise Big Data technologies
>- Describe cloud computing in reference to Big Data
>- Discuss in-memory technology for Big Data
>- Elucidate Big Data techniques

ill INTRODUCTION
The market is flooded with corporations offering custom-made tools
and frameworks for implementing Big Data and analytics. However,
behind the branding and beneath the platform, the basic features are
common in all. Given below is a list of methods and practices that
are usually followed for a typical Big Data implementation:
O NoSQL database: It offers a provision for storage and extraction
of the data modelled in tabular relations instead of typical
relation al databases to cater efficiently to real-time situations.
O Data incorporation: Data management tools available as solu
tions like Amazon Elastic MapReduce (EMR) that run
underneath a customised version of Apache Hive, Pig, Spark,
Couchbase, Ma pReduce, Hadoop, MongoDB, etc.
Data virtualisation: Virtualisation of multiple data sources into
one helps in real-time extraction, fetching and storage operations
from multiple sources such as Hadoop and distributed data
stores. It is possible from a single point.
o Search and knowledge finding: These tools and applications aid
in self-serviced processes to extract information and new
findings from humongous storage spaces consisting of
structured/unstruc tured data residing in numerous sources such
as databases, file systems, APis, streams, other platforms and
applications.
O Stream analysis: These tools and applications can enrich, aggre
gate, filter and analyse a high data influx from multiple
incongru ent real-time data sources and in any format.
O Data memory composition: These tools provide faster access
and processing of humongous data by spreading it across the
dynamic RAM, SSD or Flash storage of a distributed computer
system.
O Big Data predictive analytics: Predictive analysis is simply the
analysis of the expected events and pre-planning to manage such
events that might have an impact on overall structural, operation
al and functional aspect of an organisation. It usually comprises
hardware or tool-based solutions to let the organisation discover,
• •• •

N O T E S

evaluate, deploy and optimise predictive models by evaluating


big data sources to better business performance and alleviate
risks.
□ Quality of data: These products perform data cleansing and im
provement on voluminous, high-speed datasets, using simultane
ous operations on distributed databases and storage. These con
sist of software that perform the process of sourcing, cleansing,
making and sharing different and untidy datasets to make the
final data useful for analytics.

In this chapter, you will first learn distributed and parallel comput
ing for Big Data. Next, you will learn the basics of Big Data
technolo gies. Further, you will study cloud computing in reference
to Big Data. Next, you will learn in-memory technology for Big
Data. Towards the end, you will learn about various Big Data
techniques.

DISTRIBUTED AND PARALLEL


COMPUTING FOR BIG DATA
In Big Data, terminologies related to computing have similar
meaning that they have in other fields although with different scope
of appli cability. Let's have a look at what they mean and what they
stand for:
□ Distributed computing: It works on the rules of the divide and
conquer approach, performing modules of parent tasks on multi
ple machines and then combining the results. It is basically
multi ple processors interconnected by communication links as
opposed to parallel computing models which usually work on
shared mem ory (but not always). Distributed systems basically
aim towards passing the message. If systems are separated by
geographically different locations, such a setup is said to be
characteristically dis tributed. Imagine you and your friend's
computer in a same room and bysome interconnectingtechnology
you have managed to join it up as a single system for performing
any task. Such a system will be called a parallel system. Now
consider the same setup, albeit your friend's computer is miles
away from yours and it is connect ed to a node that runs common
to both the system's processing power. Such a setup will be called
distributed.
□ Parallel computing: Parallel computing refers to the utilisation
of a single CPU present in a system or a group of internally
coupled systems by the means of efficient and clever multi-
threading op erations. It aims at finishing a specific computation
operation in the lowest time possible, by utilising multiple
processors. The pro cessor scale may vary from multiple logical
units within a single processor to many memory sharing
processors, to computation al process distribution on multiple
computers. On computational models, parallelism is simply the
execution of internal simultane ous threads of computation to
achieve a final result. Parallelism is evident in finite real-time
systems consisting of multiple proces sors with a single master
clock used by all. In the context of Big
F.UNDAMENTAI::S :

N O T E S

Data, such parallel systems are the ones that execute from multi
ple datasets throughput points and run in parallel connected to a
master system. Parallel computing is a close-coupled system and
is used in solving the following:
♦ Computer-exhaustive problems
♦ Bigger problems in the same time
♦ Similar-sized problems in the same time with high precision

Figure 2.1shows a comparison between distributed and parallel pro


cessing techniques:

Distributed Computing
Grid Node
/'
Control Server /
,,,1.;'

,
,
Parallel Computing

l
z
.. . ,
Q )

::,
0..
s 0
u

Figure 2.1: Distributed Computing and Parallel Computing

Organisations use both parallel and distributed computing techniques


to process Big Data. The most important constraint for businesses to
day is time. In case there had norestriction on time, every
organisation would have hired outside (or third-party) sources to
perform analysis of its complex data. The direct benefit of adopting
this method is that the organisation would not require any resources
and data sources to process and analyse complex data. These third
parties are usually specialised agencies in the field of data
manipulation, processing and analysis. Apart from being effective,
hiring third-party agencies also reduces the storage and processing
costs of handling large amounts of data.
N O T E S

DIFFERENCE BETWEEN DISTRIBUTED AND PARALLEL


COMPUTING SYSTEMS

Table 2.1 differentiates between distributed and parallel computing


systems:

TABLE 2.1: DIFFERENCE BETWEEN DISTRIBUTED AND


PARALLEL COMPUTING SYSTEMS
Distributed Computing System Parallel Computing System
An independent, autonomous A computer system with several
system connected to a network for processing units attached to it
accomplishing specific tasks
Coordination is possible between A common shared memory can be
connected computers that have directly accessed by every process
their own memory and CPU ing unit in a network
Loose coupling of computers con Tight coupling of processingre
nected in a network that provides sources that are used for solving a
access to data and remotely single, complex problem
located resources

Besides these computing models, a common occurring model that


lies somewhere between these two models is called the concurrent
computing model. Concurrency of a system is simply the operation
of multiple threads that execute on single or multiple processors.
Con currency refers to sharing of multiple sources in real-time.

Distributed computing is considered as the subset of parallel comput


ing, which is the subset of concurrent computing.

SELF ASSESSMENT QUESTIONS

1. The works on the rules of the divide and conquer


approach, performing modules of parent tasks on multiple
machines and then combining the results.
2. refers to sharing of multiple sources in real-time.
3. Parallel computing is a close-coupled system that is used in
solving similar-sized problems in the same time with high
precision. (True/False)

ACTIVITY

Supercomputers are multi-threaded, multi-processor and mul ti-


core utilising systems. What kind of systems are they-parallel,
distributed, concurrent or some hybrid in between these
models? Explain.
N O T E S

INTRODUCTION TO BIG DATA


TECHNOLOGIES
A Big Data system is vastly different from other solution providing
systems and is based on the seven Vs, as described in previous chap
ter, namely: Volume, Velocity, Variety, Veracity, Variability, Value and
Visualisation. A system that complies with these properties and hap
pens to be robust to withstand unexpected events and scalable enough
to accommodate future methodologies is qualified to be called a Big
Data system.

A typical Big Data system consists of a setup that adheres to these


sev en Vs and provides a great infrastructure that can withstand the
in flux of huge datasets with high velocity, meanwhile providing an
effec tive mechanism to process the datasets bycleansing, shaping,
filtering and sorting into meaningful information aimed towards
making the data both user- and machine-friendly. Beneath the
complex system of architecture, sophisticated hardware and
methodologies working in conjunction with each other, therein lie
the interfaces that are re sponsible for communicating with the
hardware and user simultane ously - the programmable applications
or the tools that are the prime drivers of the efficiency of a typical
Big Data system setup. A fewsuch contemporary interface
development programs are described in the next few sections along
with their applications.

2.3.1 HADOOP

Hadoop is an open-source platform that provides analytical technol


ogies and computational power required to work with such large vol
umes of data.

Earlier, distributed environments were used to process high volumes


of data. However, multiple nodes in such an environment may not
always cooperate with each other through a communication system,
leaving a lot of scope for errors. The Hadoop platform provides an
improved programming model, which is used to create and run dis
tributed systems quickly and efficiently.

A Hadoop cluster consists of single MasterNode and multiple worker


nodes. The master node contains a NameNode and JobTracker and
a slave or worker node acts as both a DataNode and TaskTracker.
Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher
version of JRE. The standard start-up and shutdown scripts require
Secure Shell to be set up between nodes in the cluster. In a larger
cluster, Hadoop Distributed File System (HDFS) is managed through
a NameNode server to host the file system index and a secondary
NameNode that keeps snapshots of the NameNodes and at the
time of failure ofNameNode the secondary NameNode replaces the
prima ry NameNode, thus preventing file-system from getting
corrupt and
N O T E S

reducing data loss. Figure 2.2 shows The Hadoop multinode


cluster architecture:

FS/namespacelmeta ops
HOFS Secondary
NameNode NamcNode
Clicnc
Numespac backup

Heartbeats, balancing, replication etc..

O aca Node O a1a Node


□□ □ □□ □

Nodes write to Local disk

Figure 2.2: Hadoop Multinode Cluster Architecture

Thesecondary NameNode takes snapshots of primary NameNode di


rectory information after a regular interval of time, which is saved in
localor remote directories. These checkpoint images can be used in
the place of the primary NameNode to restart a failed primary
NameNode without replaying the entire journal of file-system actions
and editing the log to create an up-to-date directory structure.
NameNode is the single point for the storage and management of
metadata. To process the data, Job Tracker assigns tasks to the Task
Tracker. Let us assume that a DataNode cluster goes down while the
processing is going on, then the NameNode should know that the
some DataNode is down in the cluster, otherwise it cannot continue
processing. Each DataNode sends a "Heart Beat Signal" to
NameNode after every few minutes (as per Default time set) to make
NameNode aware of the active/ inactive status of DataNodes. This
system is called Heartbeat mechanism.

There are two main components of Apache Hadoop-the Hadoop Dis


tributed File System (HDFS) and the MapReduce parallel processing
framework. Both of these open source projects, HDFS is used for
stor age and MapReduce is used of processing.

HDFS is a fault-tolerant storage system in Hadoop. It stores large


size files from terabytes to petabytes across different terminals and
attains reliability by replicating the data over multiple hosts. The
default rep lication value is 3. Data is replicated on three nodes: two
on the same rack and one on a different rack. The file in HDFS is
split into large block size of 64 MB by default (typically 64 to 128
megabytes) and each block of the file is independently replicated at
multiple data nodes. The NameNode actively monitors the number
of replicas of a block
F.UNDAMENTAI::S :

N O T E S

(by default 3 times). When a replica of a block is lost due to a


DataN ode failure or disk failure, the NameNode creates another
replica of the block.

Figure 2.3shows the typical HDFS architecture:

HDFS Architecture
HDFS Architecture

Metadata (Name, replicas, ...):


Metad<! -ops'.. Namenode /home/foo/data, 3, ...

Datanodes Datanodes

□□ D Replication D D I[?..,D
□ ----
CTs1oc s

Rack 1 Rack 2

Figure 2.3: HDFS Architecture


Source: https://hadoop.apache.org

MapReduce is a framework that helps developers to write


programs to process large volumes of unstructured data parallel
over a distrib uted architecture/standalone architecture which
produces result in useful aggregated form. MapReduce consists of
several components; a few important ones are mentioned here:
D JobTracker: It is the master that looks over the execution of a
Ma pReduce job.It acts as a medium between the application and
Ha doop.
D TaskTracker: It manages individual task execution on each of
the slave nodes.
D JobHistoryServer:It tracks completed jobs.

We can write MapReduce programs in several languages like C, C++,


Java, Ruby, Perl and Python.

The following are some important features of Hadoop:


D Hadoop performs well with several nodes without
requiringshared memory or disks among them. Hence, efficiency-
related issues in the context of storage and access to data get
automatically solved.
• •• •

N O T E S

□ Hadoop follows the client-server architecture in which the server


works as a master and is responsible for data distribution among
clients that are commodity machines and work as slaves to carry
out all computational tasks. The master node also performs the
tasks of job controlling, disk management and work allocation.
□ The data stored across various nodes can be tracked in Hadoop
NameNode. It helps in accessing and retrieving data as and when
required.
o Hadoop improves data processing by running computing tasks on
all available processors that are working in parallel. The perfor
mance of Hadoop remains up to the mark both in the the case of
complex computational questions and of large and varied data.
□ Hadoop keeps multiple copies of data (data replicas) to improve
resilience that helps in maintaining consistency, especially in
case of server failure. Usually, three copies of data are
maintained, so the usual fault-replication factor in Hadoop is 3.
Hadoop also manages hardware failure and smoothens data
handling. Following few inbuilt components of Hadoop make it a
great platform to perform larger dataset related operations:
□ Hive: A data warehouse tool created by Facebook based on Ha
doop converts query language into MapReduce jobs. It deals
with storage, analysis and queries of large sets of data. HQL
(Hive Que ry Language) statement is used as a query language in
HIVE that is similar to a SQL statement.
□ Hbase: Hbase is a Hadoop application running atop the HDFS. It
represents the set of relations or tables, but is a column-oriented
DBMS, different from conventional row-oriented DBMS.
Usually, conventional databases we know are relational database
system, but Hbase is not a relational database and nor does it
support any query language, e.g. SQL.
□ Pig: Pig is a high-level modular programming tool developed by
Yahoo in 2006 for streamlining huge data sets with the use of
Ha doop and MapReduce. Pig comprises two components -
PigLat being the programming language and the other one being
run time environment where programs are executed similar to
Java environment.

2.3.2 PYTHON

Python is a popular interpreted, general-purpose, high-level dynam


ic programming language that aims to improve code readability and
overall ease of use and expression in fewer statements than other
competitive languages such as C++ or Java.
The most acknowledged fact that goes in the favour of Python as a
lan guage is that it is widely used by developers, analysts or even
finance/
F.UNDAMENTAI::S :

N O T E S

statistical executives and people of all intellectual levels without get


ting toosyntax heavy. It retains its simplistic characteristics of having
a not too verbose semantics system and yet it turns out to be one of
the most flexible, powerful languages with a plenty of data libraries
for data analysis and manipulation. It has a unique distinction of be
ing a well-crafted programming language as well as easy-to-use for
quantitative and analytical computing. Anyone with slightest of prior
programming experience can settle down with Python faster than any
other language. This makes it a great choice for many companies that
always try to find the best value for their time investment.

Python has been instrumental in building up enormously flexible


Web applications such as YouTube and has almost singlehandedly
driven internal infrastructure of the search giant Google. Numerous
corpo rations like Disney or Sony trust the reliability of Python to
manage colossal groups of graphics servers to compile the imagery
for the chartbuster movies. Python consistently ranks higher than
JavaS cript, Ruby and Perl in popularity ratings.

Just like Hadoop, Python consists of custom implementation of


Spark framework of Apache which is used to handle, manage and
analyse large chunks of datasets. Apache Spark is a large-scale data
process ing framework which is fast and can be customised
according to the platform being implemented upon.

However, a key point to note here would be - Python is not being


im plemented in a Big Data system, it is used for implementing
multiple things like machine learning, data processing and
visualisation and so on with the help of multiple frameworks
available for specific tasks. Libraries such as PyDoop and SciPy
available with Python make it actually easier for an analyst to
evaluate and manage datasets.

Python can be used for creating Hadoop MapReduce programes and


applications which access the Hadoop HDFS API with PyDoop
pack age. The PyDoop package offers a MapReduce and HDFS
compatible Python API letting you connect with existing HDFS
installation, read and write files, get information on files, directories
or global file sys tems. Also, the MapReduce API helps you in
solving many complex problems with nominal programming efforts.
On the other hand, ad vance MapReduce concepts like Record
Readers and Counters can also be implemented using PyDoop.

Python provides effective provisions for tackling Big Data problems,


some of which are listed as follows:
□ Numpy: Its arrays mapped with the memory allow you to access
a file saved on the disk as if it were an array. Only those array
parts youneed or are working with are loaded into the memory.
□ Pytables and h5py: Libraries that provide access to HDF5 files,
which allow access to just a specific part of the data. Further,
many

N O T E S

manipulations and mathematical operations on the data can be


done without formally loading it into a python data structure due
to the underlying libraries. It also allows lossless and seamless
compression.
□ Pandas: It allows high-level access to different types of data such
as csv files, HDF5 data, databases or websites. It offers HDFS
file wrappers access for Big Data, making it easy to do closer
scrutiny on big datasets.
□ Mpi4py: A tool for executing the Python code in a distributed
manner across numerous processors or even computers, allowing
you to work in a modular fashion on your data parts concurrently.
□ Blaze: A tool specifically meant to cater Big Data. Basically, a
wrapper built around the libraries described above, providing a
constant and steady interface to many huge data amount storage
spaces (such as databases or HDFS) and applications to make it
easier to mathematically operate or manipulate the data or simply
analyse the data that is otherwise too big for the memory.
There are some limitations to Python in the context of a Big Data
im plementation. In the case of benchmarking performance, Python
fares less than or equal to Java. It is not slow by any measure, but
still there remains a lot of optimisation to be done. Let's move to
another statis tical and analytical language called R, study about it
and summarise differences between the two languages.

2.3.3 R

R is an open source programming language and an application en


vironment for statistical computing with graphics, developed by R
Foundation for Statistical Computing. It is an interpreted language
like Python and uses a command line interpreter. It supports proce
dural and as generic functions with OOP.
R is extensively used by data miners and statisticians, providing a
vast variety of graphical and statistical techniques, with linear and
nonlinear modelling, time-series analysis, classical statistical
tests, clustering, classification and others. R is easily extendable and
imple mentable through functions and available extensions.
Another area of strength where R scores over its competitors is the
static graphics representation that can produce quality publication
standard graphs.
R is an extremely powerful statistical and visualisation analysis tool
that is used in Big Data for the following purposes:
□ Visualisation (charts, graphs, etc.): Using ggplot2 packages and/
or some functions inbuilt, e.g. plot()
□ Data cleansing: Polishing the data to take out useful information
□ Cluster/parallel computation: Using Apache Spark (SparkR)
F.UNDAMENTAI::S :

N O T E S

Usually tackling Big Data with a language like R, there has to be


a strategy and a streamlined process to be followed. Considering that
R is a statistical analyst's dream language that can visually enthral
the best of the data miner, while it may not appeal to coders that
depend or prefer outputs as raw as they could get. Here are a few
key things one should take care when dealing with R:
D Sampling: A dataset that is toobigto be analysed as a whole
bunch, it can be sampled down to reduce the sizes. Now, the
problem with sampling down is that the performance of a model
can be affected significantly, because in a typical Big Data setup,
a plenty of data is always preferred over fewer sets of scattered
data. However, ac cording to many experts, a sample-based
model is fine till the size of data records goes beyond the one
billion threshold limit. So, if sampling can be avoided to bypass
the uncalled complexities, an other Big Data approach is
recommended. However, in situations where sampling is a must,
it can still lead to substantial models, especially if the sample is:
♦ Still big in total numbers
♦ Not too small proportionally to the size of the entire dataset
and not biased
D Bigger hardware: Since R retains and keeps all objects in the dy
namic memory, it can pose a serious problem if the dataset gets
exponentially larger. However, given the memory costs, it is
easier to upgrade memory and now the current version of R can
support to 8 TB of RAM (on 64-bit machines) of standard data
ready to be traversed or extracted which is by no means slouch
for even the most demanding of programs.
D Storing objects on hard disk: As a substitute, there are a few
packages that ensure that the objects are stored on hard disk and
are analysed in chunks. Though, the bunch of data groups leads
to parallelisation as a side effect, it is not much of a problem if al
gorithms are capable of parallel analysis of data chunks.
However, only those algorithms that are designed to analyse data
chunks within the R system are supported. Any external
concoction with interface from a different platform might result
in an error.
D Integration with other programming languages like Java or
C+ +: The integration of programming languages gives Ra great
advantage of multi-platform compatibility and being perfor
mance-oriented language. Small modules of a program are moved
to another language's (like Java or C+ +) compilation and exe
cution environments to avoid bottlenecks and expensive perfor
mance procedures. The goal of this feature happens to be balanc
ing R's elegant way of dealing with data efficiently on one hand
and at the same time, take advantages of performance of other ad
vance programming languages on the other hand thus coming out
with the best of the both.
N O T E S

The rJava package ofR and Java is an example that facilitates


the above-mentioned operation. Similarly, the RCPP is an
example of the integration between R and C+ +. It is easier to
outsource the code from R to C+ + using RCPP. A simple
understanding of C+ + syntax is enough to utilise it.

DIFFERENCE BETWEEN RAND PYTHON FOR BIG DATA


Both Python and R are popular and widely used programming lan
guages for statistics. While R's functionality and usability is created
with statisticians at its crux, given its strong data visualisation and
charting prowess, Python is considered to be easier to comprehend
both by machines and by the user due to its simpler syntax.

In this section, we will study some differences between R and


Python, and how they both co-exist successfully in the statistics
world and data science.

Rand Python: The General Numbers


On the Internet, youcan find a good comparison between the
languag es, adoption numbers and popularity charts of R and Python.
While these figures are a good indicator of how these two languages
have evolved so far and still are evolving in the computer science
ecosys tem, it's always tough to compare them side-by-side. Primary
reason being R is only found in a data science related stats heavy,
number crunching environment whereas Python is a dynamic
language with a wide variety of applications in many fields, such as
Weband software development.

When and how to use R?


R is used primarily when the standalone computing task is required
for data analysis or for individual servers. It is a great mode of exam
ining the data, figuring out the patterns with great usage of visualisa
tions. It is always ready for any type of data analysis because
ofreadily available tests that keep you updated with necessary tools
to get up and running in shorter time.

When and how to use Python?


When your data analysis errands require to be combined with Web
based apps or if a statistics-heavy code needs to be merged with a
pro duction database - Python is a no brainer. Being a full-fledged
dynam ic programming language, it is a great tool to implement
algorithms for production use.
N O T E S

Table 2.2 lists the pros and cons of using Rand Python for Big Data:

TABLE 2.2: PROS AND CONS OF USING RAND


PYTHON FOR BIG DATA
R Python
PROS Visualised data is often easi Ease of doing it - In-built note
er to understand unreadable book IPython makes it easier to
numbers randomly lying work since youcan easily share
atop each other. R effectively your notebook with your co-work
utilises visualisation as a ma er or peer, without the need of
jor plus point over all other them installing anything. This
options available. reduces extra effort in organising
the code, output and note files
considerably.
R has a wealthy repository Python is a general purpose ori
of front-line packages and ented programming language that
great community support. is easy and intuitive. It comes with
All R packages are virtually no learning curve for
available at R those having prior programming
documentation. experience, and it increases the
speed at which you can create a
prog1·am. You need lesser time to
code while you have more time to
test it
The Python testing framework is
R is meant for statisticians. an in-built testing framework
They can interconnect that promotes test coverage, and
through R code and packag guar antees your code to be
es with their respective reusable and dependable.
ideas not necessarily
requiring
a computer background to
start. It is also highly adap
tive outside its own applica Visualisation is an important
bility zone. criterion in an ideal data analysis
Although efficient. visual software. While Python has good
isation and other related visualisation libraries, such as
processes can take a toll Bokeh, Seaborn and Pygal, but
on computer when compared to R, visualis
performance and R can ations are nowhere close in terms
come out to be a slow of comprehensibility and easy to
performer due to eye.
poorly written and
optimised code, though with
the help of packages like
Python still needs to come up
renjin, pqR and FastR,
,ivith an alternative to several
performance can be
important Rpackages.
improved considerably.
R's learning curve is a
critical aspect especially if
you are coming from a
GUI based environment
for your statistical
analysis.
N O T E S

SELF ASSESSMENT QUESTIONS

4. Pig was developed by Facebook in 2006. (True/False)


5. Which of the following manages hardware failure and
smoothens data handling?
a. Pig b. Hadoop
c. R d. Python

ACTIVITY
Try to find out alternatives to R in Python that equally resonates
well with the Hadoop HDFS architecture.

fj ■ CLOUD COMPUTING AND BIG DATA


One of the vital issues that organisations face with the storage and
management of Big Data is the huge amount of investment to get the
required hardware setup and software packages. Some of these re
sources may be over utilised or underutilised with varying require
ments overtime. We can overcome these challenges by providing a
set of computing resources that can be shared through cloud
computing. These shared resources comprise applications, storage
solutions, com putational units, networking solutions, development
and deployment platforms, business processes, etc. The cloud
computing environment saves costs related to infrastructure in an
organisation by providing a framework that can be optimised and
expanded horizontally. In order to operate in the real world, cloud
i1nplementation requires common standardised processes and their
automation.

Figure 2.4 shows the cloud computing model:

SaaS

Laptop
loud - IaaS
rovider

PaaS

Mobiles or
PDAs

Figure 2.4: Cloud Computing Model


F.UNDAMENTAI::S :

N O T E S

In cloud-based platforms, applications can easily obtain resources to


perform computing tasks. The costs of acquiring these resources
need to be paid as per the acquired resources and their use. In cloud
com puting, this feature of resource acquisition is in accordance with
the requirements and payment of cost and is known as elasticity.
Cloud computing makes it possible for organisations to dynamically
regulate the use of computing resources and access them as per the
need while paying only for those resources that are used. This facility
of dynamic use of resources provides flexibility; however, an
organisation needs to plan, monitor and control its resource
utilisation carefully. Care less resource monitoring and control can
result in unexpectedly high costs.

A cloud computing technique uses data centers to collect data and


ensures that data backup and recovery are automatically performed
to cater to the requirements of businesses. Both cloud computing and
Big Data analytics use the distributed computing model in a similar
manner and hence, are complementary to each other.

FEATURES OF CLOUD COMPUTING


The following are some features of cloud computing that can be
used to handle Big Data:
□ Scalability: Scalability means the addition of new resources to
an existing infrastructure. An increase in the amount of data
being collected and analysed requires organisations to improve
their hardware components' processing ability. These
organisations may, at times, need to replace existing hardware
with a new set of hardware components in order to improve data
management and processing activities. New hardware may not
provide complete support to the software that used to run
properly on the earlier set of hardware. We can solve such issues
by using cloud services that employ the distributed computing
technique to provide scalability to the architecture.
□ Elasticity: Elasticity in cloud means hiring certain resources,
as and when required, and paying for resources that have been
used. No extra payment is required for acquiring specific cloud
services. For example, a business expecting the use of more
data during in-store promotion could hire more resources to
provide high pro cessing power. Moreover, a cloud does not
require customers to declare their resource requirements in
advance.
□ Resource pooling: Resource pooling is an important aspect of
cloud services for Big Data analytics. In resource pooling,
multiple organisations, which use similar kinds of resources to
carry out computing practices, have no need to individually hire
all resourc es. The sharing of resources is allowed in a cloud,
which facilitates cost cutting through resource pooling.

N O T E S

O Self service: Cloud computing involves a simple user interface


that helps customers to directly access cloud services they want.
The process of selecting the needed services requires no
interven tion from human beings and can be accessed
automatically.
D Low cost: A careful planning, use, management, and control
of resources help organisations to reduce the cost of acquiring
hardware significantly. Also, cloud offers customised solutions,
especially to organisations that cannot afford too much initial in
vestment in purchasing resources that are used for computation
in Big Data analytics. The cloud provides them the pay-as-you-
use option in which organisations need to sign for those resources
only that are essential. This also helps the cloud provider in
harnessing benefits of the economies of scale and providing a
benefit to their customers in terms of cost reduction.
o Fault tolerance: Cloud computing provides fault tolerance by
of fering uninterrupted services to customers, especially in cases
of component failure. The responsibility of handling the
workload is shifted to other components of the cloud.

CLOUD DEPLOYMENT MODELS

Depending upon the architecture used in forming the network, ser


vices and applications used, and the target consumers, cloud services
are offered in the form of various deployment models. The following
are the most commonly used cloud deployment models:
D Public cloud (end-user level cloud): A cloud that is owned and
managed by a company than the one (which can be either an indi
vidual user or a company) using it is known as a public cloud. In
this cloud, there is no need for organisations (customers) to
control or manage resources; instead, they are being administered
by a third party. Some examples of public cloud providers are
Savvis, Verizon, Amazon Web Services and Rackspace. You
should under stand that in the case of a public cloud, the
resources are owned or hosted by the cloud service providers (a
company), and the ser vices are sold to other companies.
Companies or individuals can obtain various services in a public
cloud. The workload is catego rised on the basis of service
category, and therefore, in this cloud, hardware customisation is
possible to provide optimised perfor mance. The process of
computing becomes flexible and scalable through customised
hardware resources. For example, a cloud can be used
specifically for video storage that can be streamed live on
YouTube or Vimeo. You can also optimise this cloud for
handling large traffic volumes.
Businesses can obtain economical cloud storage solutions in a
public cloud, which provides efficient mechanisms for complex
data handling. The primary concerns with a public cloud include
security and latency, which can be overlooked citing the benefits
of this cloud.
F.UNDAMENTAI::S :

N O T E S

Figure 2.5 demonstrates the use of a public cloud:

••
Company X
Cloud
Services
Public Cloud (IaaS/ CompanyY
PaaS/
SaaS)
CompanyZ

Figure 2.5: Level of Accessibility in a Public Cloud


o Private cloud (enterprise level cloud): The cloud that remains
entirely in the ownership of the organisc;ltion using it is known
as a private cloud. In other words, in this cloud, the cloud
comput ing infrastructure is solely designed for a single
organisation and cannot be accessed by other organisations.
However, the organi sation may allow this cloud to be used by its
employees, partners and customers. The primary feature of a
private cloud is that an organisation installs the cloud for its own
requirements. These requirements are customary to the
organisation that plans and manages the resources and their use.
A private cloud integrates all processes, systems, rules, policies,
compliance checks, etc. of the organisation at a place. In a
private cloud, you can automate several processes and operations
that require manual handling in a public cloud. Moreover, you
can also provide firewall protection to the cloud; thereby, solving
many latency and security concerns. A private cloud can be either
on-premises or hosted externally. In case of on-premises private
clouds, the service is exclusively used and hosted by a single
organisation. However, private clouds that are hosted externally
are used by a single organisation and are not shared with other
organisations. Moreover, cloud services are hosted by a third
party that specialises in cloud infrastructure. Note that on-
premises private clouds are costlier as compared to externally
hosted private clouds. In the case of a private cloud, se curity is
kept in mind at every level of design. The general objec tive of a
private cloud is not to sell cloud services (IaaS/PaaS/SaaS) to the
external organisations but to get the advantages of cloud
architecture by not providing the privilege to manage your own
data center.

Figure 2.6 demonstrates the use of a private cloud:



N O T E S

Cloud
Services
(IaaS/PaaS/
Saa$)

Figure 2.6: Level of Accessibility in a Private Cloud


□ Community cloud: Community cloud is a type of cloud that is
shared among various organisations with a common tie. This type
of cloud is generally managed by a third party offering the cloud
service and can be made available on or off premises. To make
the concept of community cloud clear and to explain when com
munity clouds can be designed, let's take an example. In any state
or country, say England, the community cloud can be provided so
that almost all government organisations of that state can share
resources available on the cloud. Because of the sharing of cloud
resources on community cloud, the data of all citizens of that
state can be easily managed by government organisations.
Figure 2.7 shows the use of community clouds:

Cloud Services
(IaaS/PaaS/SaaS) Cloud Services
(IaaS/PaaS/SaaS)

II II II II II II
Organisations having common tie
to Organisations having common iie to
share resources share resources

Figure 2.7: Level of Accessibility in Community Clouds


□ Hybrid cloud: The cloud environment in which various internal
or external service providers offer services to many organisations
is known as a hybrid cloud. Generally, it is observed that an
organ isation hosts applications, which require a high level of
security and are critical, on the private cloud. It is also possible
that the ap plications that are not so important or confidential can
be hosted
N O T E S

on the public cloud. In hybrid clouds, an organisation can use


both types of cloud, i.e. public and private together. Such type of
cloud is generally used in situations such as cloud bursting. In the
case of cloud bursting, an organisation generally uses its own
computing infrastructure; however, in high load requirements,
the organisa tion can access clouds. In other words, the
organisation using the hybrid cloud can manage an internal
private cloud for general use and migrate the entire or a part of
an application to the public cloud during peak periods.
Figure 2.8shows a hybrid cloud:

Public Cloud

. ..
Migrated Application

Private
Cloud

Cloud Services
Organisatio
I n

I
(IaaS/PaaS/SaaS)

Figure 2.8: Implementation of a Hybrid Cloud

Cloud is a multipurpose platform that not only helps in handling


Big Data analytics operations but also performing various tasks,
including data storage, data backup and customer service. Nowa
days, business operations are performed mostly by using laptops,
tablets and mobile devices, which are suited for accessing cloud
services, because most people today want to access computers
even when on the move. In addition to this, many customers use
the Internet for purchasing some product or service. These online
orders are taken from customers by product stores, which send
instructions to the warehouse for delivering the product. The en
tire process of receiving orders, forwarding instructions to ware
houses, handling payments and tracking deliveries can be assisted
by the cloud, which is not essential but reduces the infrastructure
cost and improves scalability in content storage.

CLOUD DELIVERY MODELS

Cloud environment provides computational resources in the form


of hardware, software and platform, which are deployed as services.

N O T E S

Therefore, we can categorise these services in the following manner:


□ Infrastructure as a Service (laaS): It is one of the categories of
cloud computing services, which makes available virtualised
com puting resources on the Internet. It helps in avoiding the
expense of buying and managing your own physical resources, as
you can use any resource virtually using the Internet and paying
the rent for as long as you need it. Actually all the responsibility
is of the cloud computing service provider, who manages the
infrastruc ture, its installation, configuration, and the software
purchased.
□ Platform as a Service (PaaS): It is built above IaaS and is the
layer that interacts with the users, allowing them to deploy and
use applications created using programming and run-time envi
ronment platforms that are supported by the provider. This is the
stage where DBMS related to Big Data are implemented.
□ Software as a Service (SaaS): Saas is one of the most popular
cloud-based models and comprises applications provided by the
service provider.

EXHIBIT
Difference between Saas, PaaS and IaaS

The cloud is a broad concept and it covers just about every possi
ble sort of online service, but when businesses refer to cloud pro
curement, there are usually three models of cloud service under
consideration: Software as a Service (SaaS), Platform as a Service
(PaaS),and Infrastructure as a Service (laaS). Each has its own in
tricacies and hybrid cloud models, but today we're going to help
you develop an understanding of high-level differences between
SaaS, PaaS and IaaS.

Buy
IaaS SaaS PaaS

SOFTWARE AS A SERVICE

In some ways, SaaS is very similar to the old thin-client model of


software provision, where clients, in this case usually Web brows
ers, provide the point of access to software running on servers.
SaaS is the most familiar form of cloud service for consumers.
SaaS moves the task of managing software and its deployment to
third-party services. Among the most familiar SaaS applications
for business are customer relationship management applications
like Sales force, productivity software suites like Google Apps and
stor age solutions brothers like Box and Drop box.
N O T E S

Use of SaaSapplications tends to reduce the cost ofsoftware


owner ship by removing the need for technical staff to install,
manage and upgrade software, as well as reduce the cost of
licensing software. SaaS applications are usually provided on a
subscription model.

PLATFORM AS A SERVICE

PaaS functions at a lower level than SaaS, typically providing a


platform on which software can be developed and deployed. PaaS
providers abstract much of the work of dealing with servers and
give clients an environment in which the operating system and
server software, as well as the underlying server hardware and net
work infrastructure are taken care of, leaving users free to focus
on the business side of scalability and the application development
of their product or service. Businesses can requisite resources as
they need them, scaling as demand grows, rather than investing in
hardware with redundant resources. Examples of PaaS providers
include Heroku, Google App Engine and Red Hat's OpenShift.

INFRASTRUCTURE AS A SERVICE

Moving down the stack, we get to fundamental building blocks for


cloud services. IaaS is comprised of highly automated and scalable
compute resources, complemented by cloud storage and network
capability, which can be self-provisioned, metered and available
on-demand.

IaaS providers offer these cloud servers and their associated re


sources via dashboard and/or APL IaaS clients have direct
access to their servers and storage, just as they would with
traditional servers but gain access to a much higher order
ofscalability. Users of IaaS can outsource and build a "virtual
data center" in the cloud and have access to many of the same
technologies and resource ca pabilities of a traditional data center
without having to invest in ca pacity planning or the physical
maintenance and management ofit.

IaaS is the most flexible cloud computing model and allows for
automated deployment of servers, processing power, storage and
networking. IaaS clients have true control over their infrastructure
than the users of PaaS or SaaS services. The main uses of IaaS in
clude the actual development and deployment of PaaS, SaaS and
Web-scale applications.
Som·c.e: https://www.computenext.com/blog/when-to-use-saas-paas-and-iaas/

CLOUD PROVIDERS IN BIG DATA MARKET

Big Data cloud providers have been gearing up to bring the most ad
vanced technologies at competitive prices in the market. Some pro
viders are established, whereas some of them are relatively new to
the

N O T E S

field of cloud services. Some of these providers are rendering


services that are relevant to Big Data analytics only. Some such
providers are discussed as follows:
□ Amazon: Amazon is one of the largest cloud service provider,
and it offers its cloud services as Amazon Web Services (AWS).
AWS includes some of the most popular cloud services, such as
Elastic Compute Cloud (EC2), Elastic MapReduce, Simple
Storage Ser vice (S3), etc. Some of these services are discussed
as follows:
♦ EC2: It is a Web service that employs a large set of
computing resources to perform its business operations. These
resources are not properly utilised by Amazon, and therefore,
they are pooled in the form of an IaaS cloud so that other
organisations can take the benefit of these resources,
ultimately benefitting Amazon through the rental cost.
Organisations can use these resources elastically in a way that
the hiring of resources is possible on an hourly basis.
♦ Elastic MapReduce: It is a Web service that uses Amazon EC2
computation and Amazon S3 storage for storing and process
ing large amounts of data so that the cost of processing and
storage is reduced significantly.
♦ DynamoDB: It is a NoSQL database system in which data
stor age is done on Solid State Devices (SSDs). DynamoDB
allows data replication for high availability and durability.
♦ Amazon S3: Amazon Simple Storage Service (Amazon S3)
is a Web interface that allows data storage over the Internet
and makes Web-scale computing possible.
♦ High Performance Computing (HPC): It is a network that is
replete with high bandwidth, low latency and high computa
tional abilities, which are required for processing Big Data,
especially for solving issues related to education and business
domains.
♦ RedShift: It is a data warehouse service that is used to anal
yse data with the help of existing tools of business
intelligence in an economical manner. You can scale Amazon
RedShift for handling data up to a petabyte.
□ Google: Cloud services that are provided by Google for handling
Big Data include the following:
♦ Google compute engine is a computing environment, which
is secure, flexible, and based on virtual machine.
♦ Google BigQuery is a Desktop as a Service (DaaS), which is
used for searching huge amounts of data at a faster pace on
the basis of SQL-format queries.
♦ Google prediction API, which is used for identifying
patterns in data, storing patterns and improving the patterns
with suc cessive utilisation.
F.UNDAMENTAI::S :

N O T E S

D Windows azure: Microsoft offers a PaaS cloud that is based on


Windows and SQL abstractions and consists of a set of develop
ment tools, virtual machine support, management and media ser
vices and mobile device services. Windows Azure PaaS is easy
to-adopt for people who are well equipped with the operations of
.NET, SQL Server and Windows. In addition, the Windows
Azure HD Insight option added to the PaaS cloud makes it
possible for cloud users to address emerging requirements for
integrating Big Data into Windows Azure solutions.
The platform used for building the Windows Azure PaaS is
Horton works Data Platform (HDP) that, as stated by Microsoft,
is fully compatible with Apache Hadoop. Moreover, Microsoft
Excel and various other Business Intelligence (BI) tools can be
connected to Windows Azure with support from HDinsight,
which can be devel oped on the Windows Server also.
Hadoop is used as a cloud service in Windows Azure PaaS with
the help of HDInsight. HDFS and MapReduce related
frameworks are thus, offered economically, and in a simpler way,
by the in tegration of Hadoop in this PaaS. The efficient
management and storage of data are important features of
HDinsight, which also uses the Sqoop connector for importing
the Windows Azure SQL data into HDFS or exporting the data to
a Windows Azure SQL database from HDFS.

Cf SELF ASSESSMENT QUESTIONS


6. The cloud environment in which various internal or external
service providers offer services to many organisations is
known as a ---
a. private cloud
b. public cloud
c. hybrid cloud
d. community cloud
7. The SaaS model of cloud service allows its users to deploy
and use applications on run-time environment platforms,
which are provided on the Internet and supported by the
provider. (True/False)

ACTIVITY
Search the names of companies on the Internet, which make avail
able different cloud computing service (IaaS, PaaS, or SaaS) bene
fits to their users.
N O T E S

Ill IN-MEMORY TECHNOLOGY FOR BIG DATA


Nowadays, there are systems that require data availability to be faster
than anything ever before. Imagine in future a real-time stock
bidding platform where corporations bid for stocks of their choices in
lots. Even if we consider the bidding for a penny stock being
auctioned for a lot of two million stocks, a single fluctuation of few
pennies can result in profit turning towards the loss statement and
can deride the deal away. Now imagine the same for blue chipshares,
this is one such example where the utopian requirement of the Big
Data to be in al ways ready and standby mode and serve back the
data in the quickest possible time is already being catered and served
by many corpora tions well ahead in time.

Twitch is a social media gaming platform community that serves 100


million members supporting over 3 million concurrent visitors watch
ing and chatting about games from over 2 million broadcasters where
the capacity of a chat room often goes beyond 500,000 in a single
chat room. Besides, it also offers a target-based advertising - a
potential revenue driver, based on the chat history. This is one such
example where hardware obstructions and limitations, lag of memory
indiffer ences have to be sidelined and streamlined with something
faster like a cache memory or dynamic access memory so that the
data is readily available for disposal. To deliver such services and
capabilities, busi nesses require the skill to integrate both abrupt
dynamics with histori cal breakdown and evaluation of the
information. This combo provides direction and context for taking
real-time decisions. The in-memory big data computing tool supports
the processing of high velocity data in real-time and also faster
processing of the stationary data. Tech nologies like event streaming
platforms, in-memory databases and analytics and high level
messaging structures are witnessing massive growth that resonates
with the organisational needs.

Now cost variations for such setups have abridged. Figure 2.9 shows
the cost of various storage technologies available for a sample 1GB
of memory along with respective read/write performance:

DRAM NV-DIMM/PM NVMe SSD SATA SSD

- 1GB cost - Read Latency - Write Latency

Figure 2.9: Showing the Cost of Various Storage Technologies


Source: http://flarrio.com/in-memory-big-data-real-tirne-decisions-technology-2016/
N O T E S

It takes $9 for 1GB of RAM, $0.40 for SSDs and $1 for PCI
compatible memory cards. The choice of a specific memory
technology is subject to its raw performance figures for a real-time
scenario than bench marking figures, for a given use case. As
memory evolution goes on, new dynamic memory substitutes are
shortening performance gaps by far and large. Database-related
technologies are adapting with the evolution that has struck the
goldmine for corporations for giving a capability to fuse the newer
and older setups in tandem with deliver ing radical performance to
cost ratios.

SELF ASSESSMENT QUESTIONS

8. The tool supports processing of high velocity data


in real-time and also faster processing of the stationary data.
9. Twitch is a social media gaming platform community.(True/
False)

!!:] ACTIVITY

Study the evolution of storage-based flash memories along with


their counterparts' dynamic memories and try to figure out com
mon points where the difference between them in future is going
to be the shortest before either one of them takes another lead.

fjj BIG DATA TECHNIQUES


To analyse the datasets, there are many techniques available. In this
section, we will study about some of the techniques that are used to
tackle datasets and bring them to a conclusive end. However, this list
is not exhaustive since newer methodologies and techniques keep on
evolving from time to time.

2.6.1 MASSIVE PARALLELISM

According to the simplest definition available, a parallel system is


a system where multiple processors are involved and associated to
carry out concurrent computations. Since operations go side by side,
the parallelism occurs in the processes and hence the technique is
called parallel computing. Massive parallelism refers to a parallel
sys tem where multiple systems interconnected with each other
pose as a single mighty conjoint processor and carry out tasks
received from the data sets parallelly. However, things don't end here.
In terms of Big Data dynamics, the systems can not only be
processor, but also memory, hardware and even network conjoint to
scale up the opera tional efficiency posing as a massive system that
can eat humongous datasets parallelly without breaking a sweat. But,
this is where the complacency of a hardware owner may pose an
error-prone system.

N O T E S

Let's say an organisation can afford a hypothetical 1TB of RAM in a


single system. While the system will certainly be efficient and faster
in operations than anything else, the framework or the driving force
of that hardware may not be as efficient to utilise the full potential of
those terabytes of dynamic memory, half of which might lie wasted
and underutilised. Further factors that can affect a typical setup can
be many - incompatible processors, latencies, MOSFETs based
error, storage lag, delay in processing and other hardware related
flaws. On the application side, the software may not be properly
optimised for concurrent usage or may break down over multiple
simultaneous ac cess. These are bottlenecks for parallelism which if
duly looked after, can actually work with existing systems and spare
the need of upgrad ing to expensive hardware. While hardware specs
are crucial, the in terface driving application selection is equally
important for such a large-scale methodology.

2.6.2 DATA DISTRIBUTION

Distribution of data is a highly critical step in a typical Big Data


setup. There are approaches to data distribution in a Big Data system
de scribed as follows:
□ Centralised approach: A central repository is used to store and
download the essential dataset by virtual machines. In the
starting script, all virtual machines connect to the central
repository and get the required data. A limitation of such an
approach is that if multiple transfers are parallelly requested, the
server will drop the connections due to numerous virtual
machines seeking blocks of data - leading to a flash crowd effect.
□ Semi-centralised approach: Given the flash crowd effect in
the earlier approach, the semi-centralisedapproach reduces the
stress on the networking infrastructure.It shares the dataset across
mul tiple machines in the data centre at different times. The
limitation of such approaches is that when datasets change,
they may grow beyond its predefined size making it difficult to
foresee the chang es and expect the outcome.
□ Hierarchical approach: If datasets keep on adding new data
to itself, semi-centralised approach becomes hard to track and
main tain. In a hierarchical approach, the data is fetched from
the par ent node, i.e. the virtual machine, in the hierarchy. But,
this conse quently leads us to bottleneck of the first approach
and it cannot offer failure-resistance during the transfer if one
virtual machine gets stuck then the deployments of all the VMs
fail after the trans fers have been initiated.
□ P2P approach: P2P streaming connections are based on
hierarchi cal multi trees. Each system acts as a client and server
and to ac cess virtual machines, the data centre environment
offers a low-la tency, firewall and NAT excluded, and
unmonitored ISP traffic to deliver a P2P delivery of datasets for
the big data.
N O T E S

These approaches deal with design challenges for flexible data-heavy


systems which stem from issues as described ahead. First, a high
ly-distributed system automatically paves the way for high
availability and scalability. Data distribution occurs in all levels, from
web/cloud server farms to caches to storage at the backend. Second,
the single system image abstraction with consistent reads and
transactional writes using query languages is difficult to achieve at
the given scale. Applications need to be alert of the data replicas; and
to handle incon sistencies from replica updates that are conflicting;
and continue op erations even in the occurrence of a network
processor and software failure. Third, each Big Data database
application like NoSQL comes with a set of compromises on quality,
especially in terms of scalability, performance, consistency and
durability. The solution architects must meticulously evaluate and
select the databases that fulfil the appli cation's requirements. This
situation often ends up in a polyglot per sistence - where multiple
database technologies are singularly used to store multiple datasets in
a single system.

2.6.3 HIGH-PERFORMANCE COMPUTING

High-performance computing is the simultaneous use of


supercom puters and parallel processing techniques for solving
intricate com putation problems. It emphasises making parallel
processing systems and algorithms by joining both parallel and
administrative computa tional methods. The words
'supercomputing' and 'high-performance computing' are often used
to resemble each other.

High-performance computing is used for performing research activ


ities and cracking advanced problems through computer simulation,
modelling and analysis. Sometimes, such computing prowess is used
in special observations, satellite imagery and weather analytics as
well through the means of concurrency of computing resources.
For Big Data, where large datasets are required to be broken down
into chunks and then evaluated into meaningful data, high perfor
mance computing comes as an excellent partner to be coupled
with.
Hadoop enjoys some excellent libraries and with the load sharing ca
pability of MapReduce, a typical big data system can use large files,
and for analytics processing, can perform huge block-wise sequential
read operations. Utilising a parallel file system such as Lustre, which
is a massively parallel and open-source file system developed by In
tel, designed for large-scale data and high-performance charts, comes
handy in such systems. The bandwidth for such a file system often
ex ceeds 700GB/s or more, with premium users getting 1.9TB/s as
band width, Lustre easily scales up to thousands of clients and few
hundred petabytes as storage.
Besides that, Hadoop utilises popular accelerators, such as Kepler
GPUs. Like these technologies assist significantly in calculating
solu-
N O T E S

tions, they also assist Big Data in the bioinformatics domain as they
do for sequencing and alignment.

2.6.4 TASK AND THREAD MANAGEMENT


Threads are simply the OS-based feature with their own Kernel and
memory resources, and allow an application logic to be segregated
into concurrent multiple execution paths. It is a useful feature when
complex applications having multiple tasks need to be performed at
the same time.
When an OS executes an application instance, it creates a process
having an execution thread to manage the instance. This is just the
programming instruction being performed by the code. You can sus
pend or resume a thread, but not a task. Task can only be killed or
started. This is where it becomes a problem statement for data intense
environments and to deal with such concurrency related issues in Big
Data, we deal with two types of parallelisms - Task and Data.
Task parallelism refers to the execution of computer programes
throughout the multiple processors on different or same machines. It
emphasises on performing diverse operations in parallel to best
utilise the accessible computing resources like memory and
processors.
One example for such parallelism would be an application creating
multiple threads for doing parallel processing with every thread re
sponsible for performing a dissimilar operation.
Data parallelism focuses on effective distribution of datasets
through out multiple calculation programs. Same parallel operations
are exe cuted on multiple computing processors on the subset of the
distrib uted data.

It is often dealt in normal programming languages under the syntax


of synchronous and asynchronous programming techniques which are
similarly implemented in Hadoop, with the use of Java.

2.6.5 DATA MINING AND ANALYTICS


Data mining is a process of data extraction, evaluating it from
multiple perspectives and then producing the information summary in
a mean ingful form that identifies one or more relationships within
the data set. Descriptive data mining gives information about existing
data and the patterns recorded within it; while predictive data
mining, foretells predictions based on the occurrence of a pattern of
figures within the dataset.
Data analysis is an experiential activity, where the data sourcing
gives out some insight. By looking at the dataset of a premium
system vs. a budget bound system, you can well say that while the
initial cost is higher in premium systems, operational faults and
failures are less likely to happen than those budgeted systems.
•I F.UNDAMENTAI::S :

N O T E S

Data analytics is about applying an algorithmic or logical process to


derive insights from a given dataset. For example, looking at the past
year's weather and pest data, for the current month, we can deter
mine that a particular type of fungus grows often when humidity lev
els reach a definite point.

2.6.6 DATA RETRIEVAL


Big Data refers to the large amounts of multi-structural data that con
tinuously flows around and within the organisations, and includes
text, video, transactional records and sensor logs. Big Data systems
utilise the Hadoop and the HDFS architecture to retrieve the data us
ing MapReduce - a distributed processing framework.
It helps programmers in solving parallel data problems where the
dataset can be divided into small chunks and handled autonomous ly.
MapReduce is an important step as it allows normal developers to
utilise parallel programming concepts irrespective of cluster commu
nication details, failure handling and task monitoring.
MapReduce simplifies all that bysplitting the input data-set into mul
tiple portions, each assigned a map task to process the data parallelly.
Each map task takes the input (key, value) and creates a transformed
(key, value) output.
MapReduce uses TaskTracker and JobTracker mechanisms for task
sched\lling and monitoring. HDFS keeps bulky data files by cutting
them into lots (64 or 128 MB) and copying the lots on more than
three servers. MapReduce applications use APis provided by HDFS
to par allelly read and write data. Performance and capacity can be
met by adding single NameNode and DataNodes and the mechanism
manag es the data location and monitors server accessibility.
In addition to MapReduce and HDFS, Apache Hadoop includes
many other components, some of which are very useful for data
retrieval and extraction:
0 Apache Flume: It is a distributed system for gathering,
combining and moving huge data from various sources into
HDFS.
D Apache Sqoop: It is a tool for moving data between relational da
tabases and Hadoop.
0 Apache Hive and Pig: These are programming languages that
streamline the application development while retaining the Ma
pReduce framework.

2.6.7 MACHINE LEARNING


Machine learning formally focuses on the performance, theory and
properties of learning algorithms and systems. Machine learning is
considered to be an ideal research field for taking advantage of the
opportunities available in Big Data.
• •• •

N O T E S

It delivers on the potential of mining the value from huge and differ
ent data sources with less dependence on human instructions. It is
data-driven and runs at machine scale and well-suited to the compli
cation of dealing with different data sources and the enormous range
of variables and quantities of data involved. And in contrast to con
ventional analysis, machine learning blooms on expanding datasets.
More data a machine learning system gets, more it learns and applies
the results to yield higher quality insights.

Machine learning systems utilise multiple algorithms to discover and


show the patterns hidden in the datasets. Most of them are the gradi
ent-based algorithms, which are a form of algorithms used to
optimise the problems with a form f(x) with search instructions well-
defined by the gradient function at a current point. Following are
examples of algorithms:
□ Logistic Regression
□ Linear Regression
□ Autoencoders
□ Neural Networks

Machine learning comprises a wider collection of algorithms, with


some being more efficient than others making it harder to select an
efficient algorithm without knowing about the dataset it will work
on. For example, Linear regression algorithm can be solved
recursively or with normal equations. The recursive process is much
efficient for datasets greater than the variable range of 10,000
because the normal equation solution becomes tough to be solved in
lesser time.

Let's discuss a few machine learning methods that may prove to be


vital for solving the Big Data problems are discussed. These methods
do not focus on the algorithm logic only rather on the idea of
learning:
□ Representation learning: Datasets with multi-dimensional fea
tures are becoming gradually more common nowadays, which
challenges the current learning algorithms to excerpt and man age
the discerning information from the datasets. Representation
learning aims to achieve a rational size learned representation
that can capture many likely input configurations, and can pro
vide improvements in both statistical efficiency and
computational efficiency.
□ Deep learning: Unlike most learning techniques that use scarcely
designed learning styles, the deep learning technique uses con
trolled and/or uncontrolled strategies in deep structures to learn
hierarchical representation automatically. Deep architectures
gather hierarchically launched statistical and complicated input
patterns for achieving adaptiveness for newer areas than outdat ed
learning methods and frequently beat the state-of-the-art tech
niques.
F.UNDAMENTAI::S :

N O T E S

D Distributed and parallel learning: Learning from the massive


amounts of datasets and figuring out the meaning hidden beneath
those data behemoths can be exciting but a bottleneck occurs in
the form of incapability of algorithms to use all the data present
in a dataset to learn in a given time limit. This is where
distributed and parallel learning offers a capable solution since
assigning the learning process to several workplaces is an
obvious way of im proving the efficiency of the system as well as
the machine learn ing algorithms.
D Active learning: In real-world applications, data may be plenty
but labelling is scarce and expensive to be instantaneously ob
tained. Also, learning from enormous quantities of raw data is
time consuming and difficult. Active learning deals this issue by
picking a subgroup of most critical occurrences for labelling. In
this way, the learner machine looks forward to achieving high
precision by using less labelled instances possible, thus curtailing
the cost of finding the labelled data.

All the above forms of learning find a supportive library function in


Hadoop and HDFS file structure. Textual analysis, analytical tools
end up deploying a few of above learning techniques implicitly
during regular operations, which is further evaluated and later
studied to fig ure out valuable insights offered by the automated
learning. It is a clear case of artificial intelligence coupled with Big
Data and associat ed technologies and several developments in this
field have only sup ported the overall machine learning narrative for
corporations and service providers.

2.6.8 DATA VISUALISATION

Data visualisation is a valuable means through which the larger data


sets after being combined may appear practical, sensible and open to
most people. Data visualisation is a trailblazing method that not only
keeps you enlightened but helps other with the attributes of a typical
statistical and computational result that would've otherwise appeared
intimidating for normal minds.

Visual representation is often considered to be the most effective me


dium of information and communication channel. As the saying
goes, a picture is worth thousand words, data visualisation is a great
exam ple of that saying. When properly aligned, it can convey
critical infor mation of data analysis in probably the easiest way
possible.

Data visualisation should consist of the correct amount of communi


cating quotient to be truly effective. They should be easy to use, well
designed, meaningful, understandable and approachable.

Typical data visualisation helps in:


D identifying the areas requiring improvement or attention
• •• •

N O T E S

□ clarifying the factors that affect consumer decision making or be-


haviour
□ making you realise about the products popularity
Many conventional data visualisation methods are still popular for
imparting critical information in easier formats like histogram, table,
line chart, scatter plot, bar chart, area chart, pie chart, flow chart,
combination of charts, data flow diagram, Venn diagram and entity
re lationship diagram. Besides, a few data visualisation approaches
are less known compared the above methods but still are used like
tree map, parallel coordinates, semantic network and cone tree.

Visualisation in Big Data can be achieved through numerous ap


proaches such as greater than one view per illustrative display, active
changes in filtering and factor numbers (star-field display, dynamic
query filters and tight coupling). There are also a few standard prob
lems for big data visualisation:
a Visual noise: Most dataset objects are too tightly coupled to each
other making it tougher for users to divide them as distinct
objects on the screen.
0 Information loss: Lessening of evident datasets often leads to in
formation loss.
□ High image change rate: Users simply observe the data and can
not react to the data change or their intensity in real time on dis
play.
□ High performance necessities: Good data visualisation requires
a higher degree of efficient setup backed by scalable and robust
machines that are ready to churn out visualisation in high perfor
mance environment.

According to the dataset criteria, following points are considered be


fore planning a dataset evaluation: data volume, variety and
dynamics. Few popular forms of data visualisation are Treemap,
Circle Packing, Sunburst, Parallel Coordinates, Circular
Network Diagram.

A number of visualisation tools are available on the Hadoop


platform. The common modules in Hadoop namely Hadoop
Distributed File System (HDFS), Hadoop Common, Hadoop YARN
and MapReduce, efficiently analyse the big data, but lack suitable
visualisation. Some software with the visualisation and interactive
functions for the Big Data have been developed and are given below:
□ Pentaho: Supports the BI functions such as dashboard, analysis,
data mining and enterprise-class reporting.
□ Flare: A library belonging to ActionScript for making data
visual isation in Adobe Flash Player.
□ JasperReports: It has a different software layer for producing vi
sual reports from dataset storage.
F.UNDAMENTAI::S :

N O T E S

D Platfora: It changes raw Big Data of Hadoop to interactive data


processing engine and has the segmental functionality of data en
gine built in memory.

SELF ASSESSMENT QUESTIONS


10. Each system acts as a client and a server, and to access
virtual machines, the data centre offers firewall-free, low-
latency ISP traffic. Such an approach is called
11. Parallelism is the execution of multiple threads concurrently
to complete a task in the shortest possible time. (True/False)

ACTMTY
Research on different machine learning methods and find out
which methods and their algorithms are vital for solving Big Data
problems.

IJisuMMARY
D Distributed computing works on the rules of the divide and con
quer approach, performing modules of the parent tasks on multi
ple macllines and then combining the results.
D Parallel computing refers to the utilisation of a single CPU
present in a system or a group of internally coupled systems by
the means of efficient and clever multi-threading operations.
D Concurrency of a system is simply an operation of multiple
threads that execute on single or multiple processors.
D Distributed computing is considered to be the subset of parallel
computing, which further is the subset of concurrent computing.
D A Big Data system is vastly different from other solution provid
ing systems and is based on the seven Vs, as described in previ
ous chapter, namely: Volume, Velocity, Variety, Veracity, Variability,
Value and Visualisation.
D Hadoop is an open-source platform that provides analytical tech
nologies and computational power required to work with such
large volumes of data.
D MapReduce is a framework that helps developers to write pro
grams to process large volumes of unstructured data parallel over
a distributed architecture/standalone architecture which
produc es result in a useful aggregated form.
D Hive is a data warehouse tool created by Facebook based on Ha
doop and converts the query language into MapReduce jobs.
D Hbase is a Hadoop application running atop the HDFS.
• •• •

N O T E S

□ Pig is a high-level modular programming tool developed by


Yahoo in 2006 for streamlining huge data sets with the use of
Hadoop and MapReduce.
□ Python is a popular interpreted, general-purpose, high-level dy
namic programming language that aims to improve code
readabil ity and overall ease of use and expression in fewer
statements than other competitive languages such as C++ or Java.
□ R is an open source programming language and an application
en vironment for statistical computing with graphics, developed
by R Foundation for Statistical Computing. It is an interpreted
lan guage like Python and uses a command line interpreter.
□ One of the vital issues that organisations face with the storage
and management of Big Data is the huge amount of investment to
get the required hardware setup and software packages.
□ Cloud computing makes it possible for organisations to dynami
cally regulate the use of computing resources and access them as
per the need while paying only for those resources that are used.
□ The in-memory Big Data computing tool supports the processing
of high velocity data in real time and also faster processing of the
stationary data.
□ Massive parallelism refers to a parallel system where multiple
sys tems are interconnected with each other pose as a single
mighty conjoint processor and carry out the tasks received
from the data sets parallelly.
□ Distribution of data is a highly critical step in a typical Big Data
setup.
□ High-performance computing is used for performing research ac
tivities and cracking advanced problems through computer simu
lation, modelling and analysis.
□ Task parallelism refers to the execution of computer programs
throughout multiple processors on different or same machines. It
emphasises performing diverse operations in parallel to best uti
lise accessible computing resources like memory and processors.
□ Data mining is a process of data extraction, evaluating it from
mul tiple perspectives and then producing the information
summary in a meaningful form that identifies one or more
relationships within the dataset.
□ Machine leaning formally focuses on the performance, theory and
properties of learning algorithms and systems. Machine learning
is considered to be an ideal research field for taking advantage of
the opportunities available in Big Data.
□ Data visualisation is a valuable means through which larger data
sets after being combined may appear practical, sensible and
open to most people.
F.UNDAMENTAI::S :

N O T E S

II KEYWORDS
□ Hadoop distributed file system (HDFS): It is a fault-tolerant
storage system in Hadoop.
□ Hive: A data warehouse tool created by Facebook based on
Ha doop that converts a query language into MapReduce jobs.
□ MapReduce: It is a framework that helps developers to
write programs to process large volumes of unstructured data
over a distributed architecture/standalone architecture which
produc es results in a useful aggregated form.
□ Object O1·iented Programming (OOP): A paradigm where
data is encompassed within an object and carries several
heuristic properties.
□ Pig: Pig is a high-level modular programming tool developed by
Yahoo for streamlining huge data sets with the use of Hadoop
and MapReduce.
□ Python: It is a popular interpreted, general-purpose, high-lev
el dynamic programming language that aims to improve code
readability and overall ease of use and expression in fewer
state ments than other competitive languages such as C+ + or
Java.
□ R: It is an open source interpreted programming language
and an application environment for statistical computing with
graphics, developed by R Foundation for Statistical
Computing.
□ So1id State Drives (SSD): Such storage drives have no me
chanical components and higher read/write rates that result in
less wear or tear and robust performance.
IJ:jDESCRIPTIVE QUESTIONS
1. Differentiate between parallel and distributed computing.
2. Explain the concept of Hadoop in Big Data.
3. What do you understand by cloud computing? Also, discuss its
three basic types of services.
4. Describe the concept of in-memory technology for Big Data.
5. Enlist and explain different types of Big Data techniques.

Ill ANSWERS AND HINTS


ANSWERS FOR SELF-ASSESSMENT QUESTIONS

Computing for Big Data


• •• •

N O T E S

Topic Q. No. Answers


2. Concurrency
3. True
Introduction to Big Data 4. False
Technologies
5. b. Hadoop
Cloud Computing and Big 6. c. Hybrid cloud
Data
7. False
In-Memory Technology for 8. In-memory big data Computing
Big Data
9. True
BigData Techniques 10. P2P
11. False

ANSWERS FOR DESCRIPTIVE QUESTIONS


l. The distributed computing is basically multiple processors
interconnected by communication links as opposed to parallel
computing models which usually work on shared memory (but
not always). Refer to Section 2.2 Distributed and Parallel
Computing for Big Data.
2. Hadoop is an open-source platform that provides analytical
technologies and computational power required to work with
such large volumes of data. Refer to Section 2.3 Introduction
to Big Data Technologies.
3. Cloud computing makes it possible for organisations to
dynamically regulate the use of computing resources and access
them as per the need while paying only for those resources that
are used. Refer to Section 2.4 Cloud Computing and Big Data.
4. The in-memory Big Data computing tool supports the processing
of high velocity data in real-time and also faster processing of the
stationary data. Refer to Section 2.5 In-Memory Technology
for Big Data.
5. To analyse datasets, there are many Big Data techniques
available. Refer to Section 2.6 Big Data Techniques.

Jj111SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
□ Wadkar, S., Siddalingaiah, M., &Venner, J. (2014). Pro Apache Ha
doop. Berkeley, CA: Apress.
□ White, T. (2011). Hadoop: the definitive guide. Sebastopol, CA:
O'Reilly.
... F.UNDAMENTAI::S :

N O T E S

E-REFERENCES
D Welcome to Apache'" Hadoop®! (n.d.). Retrieved April 22, 2017,
from http://hadoop.apache.org/
D What is Hadoop? (n.d.). Retrieved April 22, 2017,from https://www.
sas.com/en_us/insights/big-data/hadoop.html
D Hadoop& Big Data.(n.d.). Retrieved April 22, 2017, from https://
mapr.com/products/apache-hadoop/
CONTENTS

3.1 Introduction
3.2 Introduction to Business Analytics
Self Assessment Questions
Activity
3.3 Types of BA
Self Assessment Questions
Activity
3.4 Business Analytics Model
3.4.1 SWOT Analytical Model
3.4.2 PESTLE or PEST Analytical Model
Self Assessment Questions
Activity
3.5 Importance of Business Analytics
Self Assessment Questions
Activity
3.6 What is Business Intelligence (BI)?
Self Assessment Questions
Activity
3.7 Relation between BI and BA
Self Assessment Questions
Activity
3.8 Emerging Trends in BI and BA
Self Assessment Questions
Activity
3.9 Summary
3.10 Descriptive Questions
3.11 Answers and Hints
3.12 Suggested Readings & References
INTRODUCTORYCASELET
N O T E S

AMNESTY INTERNATIONAL

Amnesty International is a worldwide programme that includes


over seven million crusaders who fight for a free world with
equal human rights for all. Being a non-profit institution, the
organi sation has to rely on different donors and contributors,
who get to know about campaigns through activities, such as
street fund raising, telephone outreach, petitions and mailers.
When donors are involved, it is important to create a long-lasting
relationship with them. Like many non-profits, Amnesty
International has a Customer Relationship Management (CRM)
system to make the relationship life-cycle last longe1: The
organisation also required performance improvement using
contemporary data analytics procedures.

THECHALLENGE

Arow1d four years back, with the help of its in-house fund
raising consultants, Amnesty International started seeking an
analytics software to work parallel to the existing CRM systems.
The fund-raising consultants are responsible for gathering funds
and managing various kinds of donors. They are also required to
measure the donors' sentiments and interests based on multiple
inputs, such as various parameters and participatory ratios. For
such measurements, they were dependent on programmers for
analysing customers, directing specific campaigns at them based
on their interact.ions and contributions to the campaign and the
organisation. It was a tedious exercise and not always accurate.
There were regular gaps between the requirements consultants
asked for and what they were delivered.

THESOLUTION
Based on the inputs gained from the consultants, Amnesty In
ternational finalised an analytics tool with easy drag-and-drop
interface to carry out the analytics processes as envisaged by the
consultants.

The analytical tool was integrated with the CRM. Thus, using the
contemporary analytics software with CRM database became
eas ier, making the reporting features much more robust. Of
course, as a human rights organisation, Amnesty International
performs all data analytics in obedience with privacy rules and
protective integrity.
N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Describe business analytics and its types
- Explain business analytics model
- Recognise the importance of business analytics
- Elucidate the concept of Business Intelligence (BI)
- Describe the relation between BI and BA
- Identify the emerging trends in BI and BA

Ii■INTRODUCTION
The word 'Analytics' has multiple meanings and is open to
interpreta tion for business and marketing professionals. This term is
used dif ferently by experts and consultants in almost a similar
fashion. Ana lytics, as per the definition of the business dictionary, is
anything that involves measurement - a quantifiable amount of data
that signifies a cause and warrants an analysis that culminates into
resolution.

This chapter discusses about Business Analytics and its types. Next,
the chapter discusses about Business Ana.lytics (BA) model. This chap
ter further discusses about importance of Business Analytics. Further,


this chapter discusses about the concept of Business Intelligence (Bl)
and its relation with business analytics. In the end, this chapter dis
cusses about emerging trends of BI and BA.

INTRODUCTION TO BUSINESS
ANALYTICS
Business Analytics is a group of techniques and applications for stor
ing, analysing and making data accessible to help users make better
strategic decisions. Business Analytics is a subset of Business Intel
ligence, which creates competences for companies to contest in the
market efficiently and is likely to become one of the main functional
areas in most companies (More on BI later in this chapter).

Analytics companies develop the ability to support decisions through


analytical perception. The analytics certainly influence the business
by acquiring knowledge that can be helpful to make enhancements
or bring change. Business Analytics can be segregated into many
branches. Say, for a sales and advertisement company, marketing an
alytics are essential to understand about which marketing tactics and
strategies clicked with the customer and which didn't. With perfor
mance data of marketing branch in hand, Business Analytics become
an essential way for measuring the overall impact on the organisa
tion's revenue chart. These understandings direct the investments in
areas like media, events and digital campaigns. These allow us to un-
F.UNDAMENTAI::S :

N O T E S

derstand customer results clearly, such as lifetime value, acquisition,


profit and revenue driven by our marketing expenditure.

[I' SELF ASSESSMENT QUESTIONS


1. Business analytics is a subset of business analysis. (True/
False)
2. Analytics companies develop the ability to support decisions
through perception.

ACTIVITY
How can business analytics bring a change for a newspaper hawk
er? Think it out.

Ill TYPES OF BA
Going by the linguistic definition purely, there may be multiple elu
cidations of the term BA.However, in practical terms, there are four
types of BA that help an organisation in gauging out the customer
sentiments and then take respective actions:
D Descriptive analysis: It refers to "What is happening?" or "What
happened?" type analytics based on incoming data. Such analyt
ics is better studied by the dashboards and reports. Like, a coffee
shop experiencing heavy rush on a daythey least expected and
are ill-prepared to do anything about it.
D Diagnostic analysis: It refers to analysis of the past figures and
facts to derive the scenarios about what happened and why it hap
pened. The result of this analysis is often a pre-defined reporting
structure, such as root cause analysis (RCA) report. For example,
a root cause analysis may help in finding out the factors which
the above coffee shop owners fail to read and comprehend.
D Predictive analysis: It refers to analysis of probabilities. Predic
tive analysis tries to forecast on the basis of previous data and
sce narios. For example, a hotel chain owner might ramp down
pro motional offers during a restive season of rains in a coastal
area. This is based on the predictions that there is going to be
fewer footfalls due to heavy rain.
D Prescriptive analysis: This analysis type tells you about the
ac tions you should take. This is the most essential analysis type
and typically forms the standards and recommendations for the
next phase. For example, a doctor prescribes medicines to the
patient after researching, studying, evaluating and diagnosing
the cause of pain or irritation with the patient. Similarly,
organisations too, after drawing out the statements, resultants,
conclusions and oth-
N O T E S

er factors will take a step in ensuring that the factors affecting the
growth charts positively continue to exist, whereas the damaging
factors stay out of their future prospects.

f;f SELF ASSESSMENT QUESTIONS

3. A software firm has roped in a consultant to study the


financial leaks happening in their billing system. This is the
example of

4. A company needs to launch their new product, but is on a


limited marketing budget, and needs to figure out the best
possible market response with a minimum investment. The
analytics should help the company with studying

ACTIVITY
Is there any other analysis type you can think of other than above
four models? What would it be?

ill BUSINESS ANALYTICS MODEL


BA frequently utilises numerous quantitative tools to convert big data
into meaningful information for making sound business moves.
These tools can be further categorised into tools for data mining,
operations research, statistics and simulation. Statistics for instance,
can be help ful in gathering, articulating and understanding big
data as part of descriptive analytical model.

ABA model assists organisations in making a move which yields


fruit ful results.

Here we will discuss two most commonly used analytical models by


the analysts across the globe as a standard analysis factor - SWOT
and PESTEL analysis.

3.4.1 SWOT ANALYTICAL MODEL

SWOT analysis is amongst the most popular method of gauging


the organisational and corporate nerve of an organisation. SWOT
stands for Strengths, Weaknesses, Opportunities, Threats.

As evident from the abbreviation, an organisation uses SWOT analy


sis to figure out its greatest extremes - strengths to which it can
stand by even in toughest of times, weaknesses that may lead it to a
certain failure even in the greener pastures, opportunities that may
help in realising the organisation's full potential and finally the
threats to the
N O T E S

businesses that mayend up exploiting its weaknesses and may turn its
strengths into weakness. Figure 3.1shows the SWOT diagram:

Strengths Opportunities
• VVhat does yourorganisation dobetter • VVhat political, economic, social-cultural,
than others? or technology (PESl) changes are taking
• VVhat are your unique selling points? place that could be favourable to
• VVhat doyou competitors and customers you?
in your market perceive asyour • VVhere are there currently gaps in
strengths? the market or unfulfilled demand?
• VVhat isyour organisations competitive • What new innovation
edge? couldyour organisation bring to
the market?

Weakness Threats
• VVhat doother organisations • VVhat political, economic, soc,al-cultural,
dobetter than you? ortechnology (PESl) changes are taking
• VVhat elements of your business place that could beunfavourable to you?
add little or no value? • VVhat restraints to you face?
• VVhat docompetitors and • VVhat isyour competition doing that
customersin your market perceive could negatively impact you?
asyour weakness?

Figure 3.1: The SWOT Diagram


Source: https://s-media-cache-ak0.pinimg.com[i36x/88/b0/la/88b0laa805648a30 4c0a3bbd-
954cla5e.jpg

SWOT is often considered as a 360-degree tool to measure the pulse


and vitals of an organisation. Businesses that have been in market
for long should conduct SWOT analysis periodically to evaluate the
impact of the changing situations in the market, getting around the
newer business models and respond actively.

On the other hand, new starters should include SWOT as their plan
ning process. SWOT is not necessarily a pan-organisation-based pro
cess; rather each of the organisation's departments can have their
own dedicated SWOT, such as Marketing SWOT, Operational
SWOT, Sales SWOT, etc.

A great example of benefits of SWOT analysis could be the turn


around of modern day's largest company of the world by valuation
- Apple Inc. Apple was incorporated in 1995 after a long battle with
the existing stakeholders who had control over the shares and stocks.
Post return to the computing market, facing a mighty challenger in
Microsoft, Apple didn't take them head-on as most would've expect
ed. Rather, it realised the opportunities and laid back on the threats
part since they had 'nothing to lose'. Apple identified opportunities in
newer areas of the technology, while the world was busy hailing
com puters as the lone IT revolution torch-bearer.
. ..

N O T E S

3.4.2 PESTLE OR PEST ANALYTICAL MODEL

PESTLE stands for Political, Economic, Social, Technological,


Legal and Environmental.PESTLE analysis is a method for figuring
out ex ternal impacts on a business. In some countries, legal and
environ mental parts are combined in the social, political and
economic part. Hence they use PEST.

PEST analysis is an examination of the external environment in


which an organisation currently exists or is going to enter. It is a
handy tool for understanding the economic, socio-cultural, political
and techno logical environment that an organisation functions in.
The sample PEST analysis is shown in Figure 3.2:

PEST Analysis

POLITICAL ECONOMIC SOCIAL

• Domestic political • State of the domestic • Demographic shifts • Technological chang


stabihty and International • Accountant roles and • Automation
• New audit and tax economies changes • Recent technological
law • Inflation rate • Brand and nrm image developments
• regulatory bodies and • Competitor fee • Major events and • Intellectual property
processes changes inHuences issues
• New employment law • New practice • Marketing, public
and HR compliance opportunities relations, and
• Seasonallty Issues advertising

Figure 3.2: PEST Analysis


Source: htlps://www.smartdraw.com/pest-analysis/

Following is how PEST can reign in as an effective analytical charter


ready for most organisations:
□ Political factors: These are government regulations in different
countries related to employment, ta.'{, and environment, trade and
government stability.
□ Economic factors: These factors affect the purchasing power
and cost of capital of a corporation, such as economic growth,
inflation, currency exchange and interest rates.
□ Social factors: These influence the consumer's requirement and
the possible market size for an organisation's products and ser
vices. These factors include age demographics, population
growth and healthcare.
□ Technological factors: These influence the barricades to entry,
in vestment decisions related to buying and innovation, such as
in vestment incentives, automation and the adaptability quotient
for the technology.
F.UNDAMENTAI::S :

N O T E S

PEST factors can also be categorised as threats or opportunities in


SWOT analysis. It is ideal to complete a PEST analysis before
SWOT. Also, it is a point worth noticing that the four components of
the PEST model vary in meaning on the basis of business type. For
example, social factors are more important to a consumer-oriented
business at the customer's side of the supply chain. On the other
hand, politi cal factors reign in more to an aerospace manufacturer or
a defence contracting firm.

[f SELF ASSESSMENT QUESTIONS

5. SWOT stands for


6. SWOT is often considered as a 360-degree tool to measure
the pulse and vitals of an organisation. (True/False)

ACTIVITY

1. Do an honest SWOT of Big Data so far.


2. Can a strength identified in SWOT be a political challenge in
PEST? Support your answer with an example.

Iii IMPORTANCE OF BUSINESS ANALYTICS


The need of analytics arises from our basis day-to-day life. An
average person has to analyse the time factor from getting up from
the bed to getting ready to leave for office so as to reach on time in a
relaxed man ner. Not only that, it also includes analysing the best
possible route to avoid the traffic and save more time in order to
have an extra cup of coffee for the day! As evident, even a ballpark
analysis of the daily life often yields results that maybe assuring that
analytics actually are an efficient way of measuring and tracking
your results periodically.

This is more true for a business. BA helps organisations:


D To understand leads, audience, prospects and visitors
D To understand, improve and track the method that can be used
to impress and convert the first lead or prospect to a valuable
cus tomer.

Significance of BA:
D To get visions about customer behaviour: The prime
advantage of financing some BI software and expert is the fact
that it increas es your skill to examine the present customer-
purchasing trend. Once you know what your customers are
ordering, this informa tion can be used to create products
matching the present con sumption trends and, thus improve your
cost-effectiveness since you can now attract more valued
consumers.
. ..

N O T E S

□ To improve visibility: BA helps you in getting to a vantage point


in the organisational complexities where youcan have a better
vis ibility of the processes and make it likely to recognise any
parts requiring a fix or improvement.
□ To convert data into worthy information: A BI system is a
logical tool that can educate youto enable youin malting
successful strat egies for your corporation. Since such a system
identifies patterns and key trends from your corporation's data, so
it makes it easier for you to connect the dots between different
points of your busi ness that may seem disconnected otherwise.
Such a system also helps you comprehend the inferences drawn
from the multiple structural processes better and increase your
skill to recognise the right and correct opportunities for your
organisation.
□ To improve efficiency: One critical reason to consider a BI
system is increase in the efficacy of the organisation leading to
increased productivity. BI helps in sharing information across
multiple chan nels in the organisation, saving time on reporting
analytics and processes. This ease of sharing information reduces
redundancy of duties or roles within the organisation and
improves the preci sion and practicality of the data produced by
different divisions.

Consider a typical website that relies on visitor footfall and subse


quent click-based advertising revenues. Such an organisation needs
analytics more often than other organisations, who have a dedicated
business running in brick and mortar stores and who use their web
site only for marketing purposes.

BA is an important area that helps you in equipping with correct


weapons to make the correct business decisions. For example, if you
already expect some turmoil in one of your business sections, you
can do a SWOT of the section and impact the overall outcome
positively. Here, BA not only helped you in retaining a section full of
customers, but also helped you in avoiding a future conflict of similar
nature. BA arms youwith situational arsons - you get a machine gun
in the form of viral marketing campaigns when you are targeting a
mass audience for a given product, whereas in case of customer
withdrawal or ramp up, you can have your sniper ready to
specifically target them out.

SELF ASSESSMENT QUESTIONS


7. A BI system is a tool that can educate you to enable you
in making successful strategies for your corporation.
8. Business intelligence helps in sharing information across
multiple channels in the organisation, saving time on
reporting analytics and processes. (True/False)
F.UNDAMENTAI::S :

N O T E S

ACTIVITY

Prepare a report of the case where a business gained effectively


from the SWOT analysis.

Ill WHAT IS BUSINESS INTELLIGENCE (Bl)?


Business Intelligence (BI) is a set of applications, technologies and
ideal practices for the integration, collection and presentation of
busi ness information and analysis. The motto of BI is to facilitate
improved decision-making for businesses.

BI utilises computing techniques for the discovery, identification


and business data analysis, like products, sales revenue, earnings and
costs.

BI models provide present, past and projecting opinions of


structured internal data for goods and department. It provides
effective strategic operational insights and helps in decision-making
through predictive analytics, reporting, bench-marking, data/text
mining and business performance administration.

Common applications of BI:


D Performance and bench-marking measurement and overall prog
ress tracking towards achieving business goals
D Quantifiable analysis with the help of predictive modelling,
analyt ics, statistical analysis and business process modelling
D Joint plans allowing internal and external business units to
coop erate through data sharing and electronic data
interchange
D Usage of knowledge management programmes to recognise and
make insights and skills for regulatory agreement and learning
management

BI also includes explicit practices and procedures for applying inter


active data-amassing techniques, like:
D Examining the organisations and institutions
D Selection and preparation of interview candidates

D Creation and development of interview questions based on the


subject
D Preparation and lining up the interviews

BI-based solutions are most apt for industries with huge customer
base, higher competition levels and massive data volumes. Some of
the exclusive BI functions include the following:
D Examining sales trends
. ..

N O T E S

□ Following customer-purchasing habits


o Handling finances
□ Assessing sales and advertising campaign efficiency
□ Forecasting market demand
□ Examining vendor dealings
□ Evaluating staffing requirements and performance

SELF ASSESSMENT QUESTIONS


9. Business intelligence does not utilise the existing computing
techniques for the discovery, identification and business data
analysis. (True/False) False
10. based solutions are most apt for industries with
huge customer base, higher competition levels and massive
data volumes.

ACTIVITY
How can an election campaign benefit from BI? Make a case study
on it.

Iii RELATION BETWEEN BI AND BA


BI is an umbrella in broader sense that encompasses everything un
der it, like data analytics, visualisation that also includes BA. BA is
a subset of BL BI at the root level is the skill of converting business
data into knowledge to aid decision-malcing process. The
convention al method of doing this includes logging and probing the
data from past and using the overall outcome from the reading as the
standard for setting future benchmarks.

BA emphasises on data usage to get new visions, while conventional


BI uses a constant, recurring metric sets to drive strategies for future
business on the basis of historical data. If BI is the method of
logging the past, BA is the method to deal with the present and
forecast the future.

THE EVOLUTION OF BI vs. BA

Earlier, BI has been utilised to discuss the population, procedures


and applications used to get to and infer importance from the
information, for enhancing choices and understanding the
competence of focused choices. The quick development of BA
originates from this flaw, and is in a way the advanced type of BI
solution. In a business world with ev er-increasing speed, the user
should have the capacity to collaborate
:

. N O T E S

with data at the speed of business. Aninformation-driven


organisation sees its information as an asset, and hedges it out to
outperform the rivals. The more information the client has, the better
lead he or she has on the competito1 who could possibly be a threat
so far by now.

An ever-increasing number of individuals are being made a request


to interpret information in parts that are not entirely analytical. With
the significance of information-driven choices progressively turning
into an acknowledgment for less informed branches of organisation
al departments, the requirement for easier to use and quicker stages
develops. In addition, diagrams and charts indicating BA conclusions
are faster and more effective than written measurements and excel
sheets overladen with information.

The difference between BI and BA is that BI equips you with the in


formation whereas BA gives you the knowledge.

With the help of BA, you get to know the pain points of your busi
ness; your product's standing in the market, your strengths related to
business that put you ahead of the competition and the opportunity
which you are yet to explore. BA helps you in knowing your
business thoroughly. BI helps in bridging that gap between ground
reality and management perspective on a pan-organisational basis.

BI helps you in compounding your strong points collectively;


weeding out the weakness in an efficient manner and managing the
organisa tional business more efficiently. It helps you capitalise on
the lessons learned from the BA findings about the organisation.
Table 3.1shows the differences between BI and BA:

TABLE 3.1: DIFFERENCES BETWEEN BI AND BA


BI BA
Uses current and past data to Utilises the past data and separately
opti mise the current age analyses the current data with past
performance for success data as reference to prepare the busi
nesses for the future
Informs about what happened Tells why it happened
Tells you the sales numbers for Tells you about whyyour sales
first quarter ofa fiscal year or total numbers tanked in first quarter or
number of new users signed up on about the effectiveness of the
our platform newly launched user campaign for
making users refer other users to
our plat form
Quantifiable in nature. it can help More subjective and open to interpre
you in measuring your business in tations and prone to changes due to
visualisations. chartings and other ripples in organisational or strategic
data representation techniques structure
Studies the past of a company Predicts the future based on the
and ponders over what could've learning gained from the past, pres
been done better in order to have ent and projected business models
more control over the outcomes for a given term in the near future
. ..

N O T E S

Another new trend is the skill to combine multiple data projects in


one, while making it useful in sales, marketing and customer support.
That concept is also called CRM - Customer Relationship
Manage ment software, which sources raw data from every
division and de partment, compiles it for a new understanding that
otherwise would not have been visible from one point alone.

All this boils down to the interchangeable usage of the term


"business intelligence" and "business analytics" and its importance
in manag ing the relationship between the business managers and
data. Owners and managers now, as a result of such accessibility,
need to be more familiar with what data is capable of doing and
how they need to ac tively produce data to create lucrative future
returns. The significance of the data hasn't changed, its availability
has.

SELF ASSESSMENT QUESTIONS


11. BI at the level is the skill of converting material
resources and converting them to knowledge to aid decision-
making process.
12. BA emphasises on data usage to get new visions, while
conventional BI uses a constant, recurring metric sets to
drive strategies for future business on the basis of this
historical data. (True/False)

ACTIVITY
Create a case study on election campaign for a new party using BA
system and compare the outcomes with that of BI system.

11:j EMERGING TRENDS IN BI AND BA


Following are the contemporary trends in the BI and BA fields:
□ More power and monetary impact for data analysts: The ana
lysts are consistently charting the demand charts across many in
dustries. All thanks to the demand-driven analytical bandwagon
that has made the industry take cognizance of the data analysts
and led to a spike in other roles, like Information Research Scien
tists and Computer Systems Analysts.
□ Location analytics: Another major business driver in 2016 was
related to location and geospatial analytical tools that gave or
ganisations better market intelligence and placements in terms of
effective campaigns. For example, a company aiming geocentric
campaigns for specific customers.
□ Data at the rough edge: Businesses must look beyond the usual
sources of data besides their data centres since the data flows
now
F.UNDAMENTAI::S :

N O T E S

initiate outside the data from multiple sensor devices, and servers,
e.g. a spatial satellite or an oil rig in the sea.
D Artificial Intelligence (AI): This is a top trend as per multiple
studies with scientists targetting to make machines that do what
complex human reflexes and intelligence achieve. The analytical
work on such programmes is exponentially growing with AI and
machine-learning transforming the way we relate with the analyt
ics and data management.
D BI Centre of Excellence (CoE): Moving to a simpler, secure and
effective BI strategy isn't entirely the onus of IT. The difficulty
of the data management in huge companies is astounding, and
the need to strengthen it is becoming important. A growing
number of organisations are opting for BI and Analytical CoE to
substitute the implementation of self-serviced analytics. These
CoE centres will have a great role in applying an information-
driven culture and get the maximum advantage from a BI
solution. Through me diums like virtual forums ancl training, the
CoEs will authorise even laymen to include data in their decision-
making strategy. It is quite an efficient way of getting skilled
people, processes and technology aligned in a structured manner
at one place.
o Predictive analytics and impact on data discovery: By gather
ing more information, organisations will have the capacity to
build more detailed visual models that will help them to act in
more ac curate ways. For instance, having better information
models shows organisations more about what clients are
purchasing, and even what they are possibly going to purchase in
future. From CRM to sales or marketing deals, predictive
analytics and cutting edge BI are set to bring disruption.
D Cloud computing: Cloud computing is being absorbed into many
systems and will continue to grow. We've witnessed the division
of Cloud into multiple vendor systems and many companies are
utilising Cloud services to host the powerful data analytics tools.
A lot of customers are already using Microsoft Azure and
Amazon Redshift along with Cloud resources that provide
flexible handling and scalability for the data.
□ Digitisation: It is a process of turning any analogue image, sound
or video into a digital format understandable by the electronic de
vices and computers. This data is usually easier to store, fetch
and share than the raw original format (e.g. turning a tape
recorded into a digital song). The gains from digitising the
data-intensive processes are great with up to 90% cost cut and
much faster turn around times than before. Creating and
utilising software over manual processes allow the businesses
to gather and screen the data in real time, which assists the
managers to tackle issues be fore they turn critical.
. ..

N O T E S

SELF ASSESSMENT QUESTIONS


13. CoE stands for:
a. Centre for Excellence b. Centre of Excellence
c. Centre of Excel d. None of these
14. isaprocess of turning any analogue image,
sound, video into a digital format understandable by the
electronic devices and computers.

ACTIV1TY
What trend you think can be emerging the next in BI and BA
field? Discuss.

llisuMMARY
□ Business Analytics is a group of techniques and applications for
storing, analysing and making data accessible to help users make
better strategic decisions.
O The analytics certainly influences the business by acquiring
knowl edge that can be helpful to make enhancements or bring
changes.
□ In diagnostic analysis, analysis of the past figures and facts to
de rive the scenarios about what happened and why it happened
is done.
□ Business analytics frequently utilises numerous quantitative tools
to convert big data into meaningful contexts valuable for making
sound business moves.
□ PESTLE stands for Political, Economic, Social, Technological,Le gal
and Environmental (PESTLE) - a method for figuring out nu
merous external impacts on a business.
□ Business Intelligence (BI) is the set of applications, technologies
and ideal practices for the integration, collection and
presentation of business information and analysis.

El KEYWORDS
o Business analytics: It is the subset of Business Intelligence,
which creates competences for companies to contest in the
mar ket efficiently.
O PEST analysis: It is an examination of the external environ
ment in which an organisation currently exists or is going to
enter or start.
. F.UNDAMENTAI::S :

'
N O T E S

D Predictive analysis: A kind of analysis that is based on proba


bilities.
□ Prescriptive analysis: A kind of analysis that tells you about
what actions you should take.
□ SWOT: It stands for Strengths, Weaknesses, Opportunities
and Threats.

Q111DESCRIPTIVE QUESTIONS
1. Discuss the concept of BA.
2. Enlist and explain different types of BA.
3. Explain the different analytical models with the help of real-
time examples.
4. Discuss the importance of BA with suitable examples.
5. Describe the importance of BI.
6. Discuss the evolution and relation between BA and BI.

If 1■ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


Introduction to Business 1. False
Analytics
2. Analytical
Types of BA 3. Diagnostic
4. Predictive
Business Analytics Model 5. Strengths, Weaknesses,
Opportunities, Threats
6. True
Importance of Business 7. Logical
Analytics
8. True
What is Business 9.
False
Intelligeuce (Bl)?
10. Business Intelligence (BI)
Relation between BI and BA 11. Root
12. True
Emerging Trends in BI and 13. b. Centre of Excellence
BA
14. Digitisation
. ..

N O T E S

HINTS FOR DESCRIPTIVE QUESTIONS


l. Business Analytics is a group of techniques and applications for
storing, analysing and making data accessible to help users
make better strategic decisions. Refer to Section 3.2
Introduction to Business Analytics.
2. There are four types of BA that help an organisation in gauging
out the customer sentiments and then take respective decisive
actions. Refer to Section 3.3Types of BA.
3. The two most commonly used analytical models by the analysts
across the globe as a standard analysis factor - SWOT and
PESTLE analysis. Refer to Section 3.4 Business Analytics
Model.
4. BA helps you in getting to a vantage point in the organisational
complexities where you can have a better visibility of the
processes and make it likely to recognise any parts requiring a
fix or improvement. Refer to Section 3.5 lmportance of
Business Analytics.
5. Business Intelligence (Bl) is the set of applications,
technologies and ideal practices for the integration, collection,
presentation of business information and analysis. Refer to
Section 3.6 What is Business Intelligence (BI)?
6. BA and BI can be two of the most interchangeably used terms
but rarely explained in a way that doesn't put the end-user in a
much vaguer position than before. Refer to Section 3.7
Relation between BI and BA.

lltJ SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
□ Liebowitz, J. (2013). Big data and business analytics. Boca Raton
(FL): CRC Press.
□ Laursen, G.H., & Thorlund, J. (2017). Business analytics for
man agers: taking business intelligence beyond reporting.
Hoboken, NJ: John Wiley & Sons, Inc.

E-REFERENCES
□ What is big data analytics? - Definition from Whatls.com. (n.d.).
Retrieved April 25, 2017, from http://searchbusinessanalytics.
techtarget.com/definition/big-data-analytics
□ What is business analytics (BA)? - Definition from Whatls.com.
(n.d.). Retrieved April 25, 2017, from http://searchbusinessanalyt
ics.techtarget.com/definition/business-analytics-BA
□ Monnappa, A. (2017, March 24). Data Science vs. Big Data vs. Data
Analytics. Retrieved April 25, 2017, from https://www.simplilearn.
com/data-science-vs-big-data-vs-data-analytics-article
CONTENTS

4.1 Introduction
4.2 What is Data, Information and Knowledge?
Self Assessment Questions
Activity
4.3 Business Analytics Personnel and their Roles
Self Assessment Questions
Activity
4.4 Required Competencies for an Analyst
Self Assessment Questions
Activity
4.5 Business Analytics Data
Self Assessment Questions
Activity
4.6 Ensuring Data Quality
Self Assessment Questions
Activity
4.7 Technology for Business Analytics
Self Assessment Questions
Activity
4.8 Managing Change
Self Assessment Questions
Activity
4.9 Summary
4.10 Descriptive Questions
4.11 Answers and Hints
4.12 Suggested Readings & References
.. F.UNDAMENTAI::S BUSINESS i\.NALYTICS

.
INTRODUCTORY CASELET
N O T E S

CHALLENGES FACED BY A CLOUD SERVICE PROVIDER

A corporation XYZ Inc., based outside India, delivers managed


IT operations, hosted applications and cloud based services to
busi ness enterprises across the globe. It has got great ratings for
its brilliant service and customer care thanks to the inclusive
Service Level Agreements (SLAs) and the consistent focus on
improving the customer service experience.

XYZ Inc. provides its consumers a private and tailor made cloud
infrastructure to execute important applications, with the help of
latest cutting edge tools, which support the company to look
after customer needs while reducing management and system
compli cations.

Along with a zero-acceptance policy for downtime, max data se


curity is another core focus area of the company for which it has
two network connected data centers in metro cities working in
tandem with the first data center deputed as a backup/failover
recovery with other data center to create a secure and reliable
disaster recovery solution.

The organisation faces many challenges that other corporate IT


organisations experience nowadays like availability, reliability,
agility, security and shoe-string budget concerns besides cloud
and hosting service providers as well. Being a managed service
provider, XYZ Inc. must abide by the tighter SLAs than most
or ganisations deliver to their internal customers.
Being an organisation with products and services of this range,
XYZ certainly faces some challenges as described here:
D Growing operational productivity: XYZ needs to ensure
uni fied deployment of ongoing operations and customer
applica tions despite ever-augmenting resource requirements
from new and prevailing customers.
D Dropping operational expenses: The company had to
reduce costs in order to remain competitive with managing
all sorts of non-revenue linked maintenance sources.
□ Guaranteeing high accessibility: Reliable disaster recovery,
high availability and complete security of the data are
fewrea sons why customers have chosen XYZ as the service
provider.
BUSINESS :ANALY.TICS

N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Describe the meaning of the terms-data, information, and
knowledge
- Discuss the role of business analytics personnel
- List the required competencies for an analyst
- Recognise the challenges of business data analytics
- Describe how data quality management framework ensures
data quality
- Explain the technology used for business analytics
- Discuss change management in business analytics

Ill INTRODUCTION
Business analytics is a process to filter and analyse sets ofdata
which might be small bits of data, a file containing the data or a
large col lection of data generally known as a database. With the
growth in the data, a need of storing it at some appropriate
location arises from where it can be easily accessed and modified
irrespective of geograph ical location. Unlike small datasets which is
useful only for individual organisations, Big Data is useful for
various organisations. To store BigData, companies use cloud
technology, data warehousing, etc. This data is further retrieved
from its storage and analytics is applied on it to derive useful
information. The analytics involves the use of various statistical
methods such as measures of central tendency, graphs, etc. to derive
significant information from data. This useful information is
further used in businesses for decision making, growth, planning,
creating action plans and increasing overall profitability. The way
of sorting the data to derive useful information has given a new
purpose to business analytics.

In this chapter, you will first study about data, information and
knowl edge. Next, the chapter discusses business analytics personnel
and their roles. Further, the chapter discusses the required
competencies for an analyst. Next, the chapter details upon business
analytics data and the importance of ensuring data quality. Towards
the end, the chapter dis cusses technology for business analytics and
change management.

WHAT IS DATA, INFORMATION AND


KNOWLEDGE?
Data, to put simply, is the raw material that does not make any
definite sense unless you process it to any meaningful end. It can be
anything from a collection of numbers, text and unrelated symbols. It
needs to be processed with a context, before being logically viable.
F.UNDAMENTAI::S :

N O T E S

Examples of data
2,4,6,8
Mercury, Jupiter, Pluto

The above data alone does not represent the true picture. Maybe the
sequence above is simply the table of two or a sequence denoting
the difference of two between numbers. The names may just be the
names of conference rooms in an organisation rather than being plan
et names, unless you give it a logic and define the reasoning for its
ex istence, the data alone does not have a standalone existence by
itself.

Information is the result that we achieve after the raw data is pro
cessed. This is where the data takes the shape as per the need and
starts making sense. Standalone data has no meaning. It only
assumes meaning and transitions into information upon being
interpreted. In IT terms, characters, symbols, numbers or images are
data. These are joint inputs which a system running a technical
environment needs to process in order to produce a meaningful
interpretation.

Information can offer answers to questions like which, who, why,


when, what and how. Information put into an equation should look
like:

Information = Data +
Meaning Examples of
Information
2,4,6,8 are the results of first four multiples of 2.
Mercury, Jupiter, Pluto are the names of planets.

When we allocate a situation, or meaning, only then the data


becomes information.

Data is raw, information is processed and knowledge is gained.

Knowledge is something that is inferred from data and information.


Actually, knowledge has a far broader meaning than the typical defi
nition. Knowledge is an assembly of meaningful information whose
intent is to be valuable. Knowledge is a deterministic process.

Knowledge can be of two types:


D Obtaining and memorising the facts
D Using the information to crack problems

The first type is regularly called the explicit knowledge meaning a


knowledge that can be simply transferred to others. Explicit knowl
edge and its offspring can be kept in a certain media format for
exam ple encyclopedia and textbooks.

The second type is termed as the tacit knowledge referring to the


type of knowledge that iscomplex and intricate. It is gained simply
by pass ing on to others and requires elevated and advance skills
in order to
BUSINESS :ANALY.TICS

N O T E S

be comprehended. For example, it will be tough for a foreign tourist


to understand local customs or rituals of a specific community
located in a country whose language is different than the tourist's
language. In such a case, the tourist needs to be conversant with the
language or requires additional resources in order to understand the
rituals. Similarly, the ability to speak a language or use a computer
or simi lar things requires knowledge that cannot be gained explicitly
and is rather learned through experience.

HOW ARE DATA, INFORMATION AND KNOWLEDGE LINKED?

Data signifies an element or statement of procedures without being


related to other things, for example. It is raining. Information symbo
lises a relationship of some type, perhaps cause and effect that act as
a bridge between the data and information. The topics are
hierarchical in the following order, as shown in Figure 4.1:

B be. Information be.a, Knowledge

.....

Figure 4.1: Transforming Data into Knowledge

For example, the temperature fell 15 degrees followed by rains.


Here, the inference based on the data becomes information.
Knowledge signifies a design that links and usually provides a high-lev
el view and likelihood of what will happen next or what is described.
For example, if humidity levels are high and the temperature drips
considerably, the atmosphere is pretty much unlikely to hold the
mois ture and the humidity, hence it rains. The pattern is reached on
the basis of comparing valid points emanating from data and
information resulting into the knowledge or sometimes also referred
to as wisdom.
Wisdom exemplifies the understanding of essential values personi
fied within the knowledge that are foundation for the knowledge in
its current form. Wisdom is systematic and includes an understanding
of all interactions that happen between raining, temperature gradients,
evaporation, changes, air currents and raining.

SELF ASSESSMENT QUESTIONS

1. Information put into an equation should look like:


Information = + Meaning
2. Explicit knowledge and its offspring can be kept in a certain
media format for example in encyclopedias and textbooks.
(True/False)
F.UNDAMENTAI::S :

N O T E S

ACTIVITY

Suppose you have to explain a school going kid the difference be


tween data, information and knowledge. Describe the method and
technique you will use.

BUSINESS ANALYTICS PERSONNEL AND


THEIR ROLES
A business analyst is anyone who has the key domain experience and
knowledge related to the paradigms being followed. He/she often
needs to sport multiple hats related to the field he/she is in. A
business analyst can be anyone, from an executive to a top-level
project direc tor given that they have grasp of the system, its
techniques and func tionality - since all they represent is the business
their organisation is offering to customers.

KEY ROLES AND RESPONSIBILITIES OF A BUSINESS ANALYST

Requirements are the essential part of creating successful IT solu


tions. Defining, documenting and analysing requirements that are de
veloped from a business analyst's perspective help in demonstrating
what a system can do. The skills of a business analyst are shown in
Figure 4.2:

Figure 4.2: Skills of a Business Analyst

Described below are a few of the key requirements and responsibili


ties of a business analyst in managing and defining requirements:
D Gathering the requirements: Requirements are a key part of
IT systems. Inadequate or unfitting requirements often lead to
RESOURCE CONSIDERATIONS

N O T E S

a failed project. The business analyst fixes the requirements of a


project by mining them from stakeholders and from current and
future users, through research and interaction.
□ Expecting requirements: A business analyst who has expertise
in his/her field knows that in the dynamic world of IT, things can
change quickly even before they can expect the change. Plans de
veloped at starting are always subject to alteration, and expecting
requirements that might be needed in the future is key to success
ful results.
□ Constraining requirements: While complete requirements are
must for a successful project, the emphasis should be the essential
business needs, and not the personal user preference, functions
based on the outdated processes or trends, or other unimportant
changes.
D Organising requirements: Requirements often come from mul
tiple sources that sometimes may contrast with other sources.
A business analyst must segregate requirements into associated
categories to efficiently communicate and manage them. Require
ments are organised into types as per their source and applica
tion. An ideal organisation averts project requirements from over
looked, and thus leads to an optimum use of budgets and time.
□ Translating requirements: A business analyst must be skilled at
interpreting and converting the business requirements effectively
to the technical requirements. It involves using powerful
modeling and analysis tools to meet planned business goals with
real-world technical solutions.
□ Protecting requirements: At frequent intervals in a project's
life cycle, the business analyst protects the user's and business
needs by confirming the functionality, precision and
inclusiveness of the requirements developed so far compared to
the requirements gathered in the initial documents. Such
protection reduces the risk and saves considerable time by
certifying that the requirements are being fulfilled before
devoting further time in development.
□ Simplifying requirements: The main role of a business analyst
is to simplify tasks and maintain easier functionality. Completing
the business objective is the aim of every project; a business
analyst recognise and evades unimportant activities that are not
helpful in resolving the problem or achieving the objective.
□ Verifying requirements: A business analyst is the most
informed person in a project about the use cases; hence,
theyfrequently val idate the requirements and discard
implementation that do not help in growing the business
objective to culmination. Require ment verification is
completed through test, analysis, inspection and
demonstration.
D Managing requirements: Usually, an official requirements pre
sentation is followed by the review and approval session, where
F.UNDAMENTAI::S :

N O T E S

project deliverables, costs and duration estimates and schedules


are decided and the business objectives are rechecked. Post ap
proval, the business analyst shifts to requirement managing
events and activities for the rest of the project lifecycle.
D Maintaining system and operations: Once all the requirements
are completed and the solution is delivered, the business analyst's
role shifts to post implementation maintenance to ensure that de
fects if any, do not occur or are resolved in the agreed SLA time
lines; any enhancements that are to be made to the project, or per
forming change activities to make the system yield more value;
similarly, the business analyst is also responsible behind many
other activities post implementation such as operations and main
tenance, or giving system authentication procedures, deactivation
plans, maintenance reports and other documents like reports and
future plans. The business analyst also plays a great role in study
ing the system to regulate when replacement or deactivation may
be required.

SELF ASSESSMENT QUESTIONS


3. Inadequate or unfitting requirements often lead to of
project.
4. A business analyst must requirements into associated
categories to efficiently communicate and manage them.

ACTIVITY
As a business analyst, prepare a report on your analytical study


of Sony Corporation, currently undergoing turmoil for serving too
many areas in business fields.

REQUIRED COMPETENCIES FOR AN


ANALYST
The business analyst role is considered as a bridge between business
stakeholders and IT. Business analysts need to be great in verbal and
written communications, diplomatic, experts with problem solving
acumen, theorists with the ability to involve with stakeholders to
com prehend and answer to their needs in a dynamic business
environ ment. This includes dealing with senior members of
management and challenging interrogations sessions to confirm that
the time is well spent and value for money development can
commence.

Business analysts need not necessarily be from IT background al


though it certainly helps having a basic understanding IT systems
and how they work. Sometimes, business analysts come from a
program-
RESOURCE CONSIDERATIONS

N O T E S

ming or other technical background often from within the business


- carrying a thorough information of the business field which can be
likewise very useful. To be called as a successful business analyst,
you ought to be a multi-skilled person who is adaptable to an ever-
chang ing environment. The following are some of the most
common skills that a decent business analyst should have:
a Understanding the objectives: Being able to understand the path
and commands is important. If you can not understand what and,
more significantly, why you are assigned to do something,
chances you can not deliver what is required are high. Do not
hesitate in asking questions or additional information if you have
any doubts.
□ Having good communication skills: Sounds obvious but it is
nec essary to have good verbal communication skills preferably
in a global environment, where multitudes of stakeholders,
manage ment and resources from diverse backgrounds will
collaborate on a single platform to discuss, debate and finalise the
requirements which would incidentally be captured by you. It is
necessary for you to have that comprehension level along with
the eloquence to deliver your conceptions or clear any doubts,
which you have. You should be able to make your point evidently
and explicitly. Com municating the data and the information at
the appropriate level is important - as some stakeholders require
more detailed informa tion than others due to the varying levels
of understanding.
□ Manage stakeholder meetings: While email also acting as an
au dit trail is a fair method to facilitate communication,
sometimes it turns out to be not enough. Old school F2F
discussions and meet ings for detailed deliberation over the
problems and any queries are still a popular way of carrying out
effective analysis. Most of the times, you end up discovering
more about your project from a physical presence of all
stakeholder where all collaborators tend to be open about
debating circumstances.
□ A good listener: You are better off listening more than youspeak
and jotting down the notes and talceaways from the meetings.
Having good listening skills require patience and virtue to under
stand and listen to the stakeholder, which gives them a feeling of
being heard and not being overlooked or overpowered by a dom
inating analyst. Such projects often end up in mess sooner than
they should be. Your listening and information absorbing skills
are important to make you an effective analyst. Not only listen,
but understand the situation, question only where you think you
are being condescended upon by the stakeholders passing off
unnec essary off-business requirements and ignoring the actual
require ments that can help in making of an efficient system. You
can at tend personality development training to get the control
over voice modulation, dialect and pitch moderation along with
an effective body language with business presentation skills.
F.UNDAMENTAI::S :

N O T E S

D Improving the presentation skills: As a business analyst, you


are supposed to be presentable at any time round the clock. As a
busi ness analyst, you will often lead workshops or pitch a work
piece to the stakeholders, or to internal project team. It is
important to give due consideration to the content of your
presentation and en sure that it matches the objectives to be
meet - since there is no point of presenting the implementation
methods if the meeting is about gathering requirements. These
presentations not only rep resent information but also act as a
good way to get more clarity or information from stakeholders in
case you are looking for further details on a specific part of the
project.
D Atime manager: A business analyst is responsible for
maintaining the timeframes of the project as well as the corporate
schedules. BA should ensure that the project meets the pre-agreed
project milestones along with daily tracking schedules being
fulfilled by the development team. Business analyst should
prioritise activi ties separating critical ones from the others that
can wait, and fo cus on them.
D Lite1·ary and documenting skills: Requirements documents,
spec ifications, reports, analysis and plans. Being a business analyst,
you are supposed to deliver numerous types of documentations
that will go on to become project and legal documents later on.
So, you need to ensure that your documents are created concisely,
and at a comprehensible level for the stakeholders. Avoid specific
jargons to a particular field as they may not be understood by all
stake- holders and later may create confusion or other
complexities with their interpretations. Starting as an
inexperienced business ana lyst, you will gradually learn to write
requirement documentations and reports, but having strong
writing skills is enough to give you a head start over the others
since it will lead to unambiguous require- ments documentation.
D Stakeholder management: It is important that you know how
to deal with stakeholders and know how much power and
impact they have on your project. Stakeholders can either be
your best friends/supporters or your greatest critics. An
accomplished busi ness analyst will have the skill to investigate
the degree of man agement every stakeholder needs and how
they ought to be inde pendently dealt.
D Develop your modelling skills: As the expression goes, a photo
paints a thousand words. Procedures (such as process modeling)
are compelling tools to pass on a lot of data without depending
on the textual part. A visual portrayal enables you to get an
outline of the issue or project so that youcan see what functions
well and where the loopholes lie.
RESOURCE CONSIDERATIONS

N O T E S

SELF ASSESSMENT QUESTIONS


5. To be called as a successful business analyst, you ought to be
a multi-skilled person who is adaptable to an ever-changing
environment. (True/False)
6. A business analyst is not responsible for maintaining the
timeframes of the project as well as corporate schedules.
(True/False)

ACTMTY
You are a veteran business analyst, responsible for coaching a
new batch of management trainees in an organisation. Layout the
course plans and methods you will utilise to train them about the
standards and the knowledge.

Iii BUSINESS ANALYTICS DATA


Any approach for analytics must adjust to changes in the way people
work inside their business settings, particularly with the developing
size of data volumes. Arranging data that is redone in a way that
bodes well for every business customer requires infusing content
with con text before augmenting the estimation of relevant filtering
and rep resentation. Enhancing the enormous amounts of data and
making a presentation of significant learnings for every business
consumer's needs shows up with many difficulties. We will
segregate those prob lems as data analytics challenges-creating
algorithms that will gath er, analyse, group, channel, categorise and
at last filter the meaning and also persistently retrain the machine,
cutting and dicing this data in the view of individual needs and
conveying it in a way that is most useful relying upon a person's
perspective (area, time, gadget and so on). Some of data analytics
challenges are as follows:
0 Content variety and quality: Information sources are no longer
entirely organised. Business folks depend on a pool of
information objects that mix customarily structured information
with various types of artefacts, for example, transactional system
databases and in addition Web-based social networking channels,
like Facebook, Twitter, Linkedln, Web journals, wikis, etc. each
of which must be surveyed for logical importance and
incorporated inside different data models.
For quality, the bits of information that can be mined from an
infor mation source like a database or an online networking Web
page may have distinctive levels of relevance for various sorts of
data consumers in different places of an organisation. One
example is information gathered for announcing the item
launches for senior officials, a moved-up lookout of positive or
negative beliefs might
F.UNDAMENTAI::S :

N O T E S

be adequate, while the product manager may search for insights


with respect to potential item defects that can be quickly remedi
ated.
D Content organisation: Forming the data inputs begins with a
set of meaning and semantics, but business requirements change
over time, so the models need to be flexible with capacity to
provide allowances in relation to taxonomic models, tag inputs
and match them based on incidental content. However, dissimilar
levels of information sparseness, density, freshness and quality
affect the capability to unify the data and require increased
sophistication.
D Connectivity: Any information source may have different levels
of importance inside a wide range of business settings. For
instance, remarks about a bike's drivability might be more
important com ing from a vehicle enthusiast blog owner which
can be checked through Twitter. That poses two difficulties -
firstly, linking infor mation artifacts to various business domains,
while the second in cludes deriving dynamic linkages,
connections and relevance be yond settled ordered models. The
last challenge likewise implies striving to advance an
understanding of how data sets are utilised by various people and
adjusting analytical models respectively.
o Personalisation challenges: More important than separating
through substantial volumes of data resources taken from a vari
ety of sources is that a wide range of channels must be set up to
recognise different filters of business value relying on who cus-
tomers are. For instance, a sales delegate may be informed about
a few particular contacts from their client base to help in gener
ating leads. Similar data sources can be refined to give sales and
marketing executives with subjective information about their top
clients, help to recognise potential threats from competitors and
inform about techniques for continuing with expansion inside
ver- tical markets.
D Finding correlations in a dynamically changing business
world: Pattern detection in data correlations may specify
developing trends. For example, investigating the correlation
between Web searches about influenza symptoms, and medicines
and geograph ical places over a period can help in forecasting the
patterns for influenza infections.

SELF ASSESSMENT QUESTIONS


7. Any information source may have different of
importance inside a wide range of business settings.
8. detection in data correlations may specify developing
trends.
BUSINESS :ANALY.TICS ••

N O T E S

ACTIVITY

If a raw sample data from a research institute lands at your


depart ment, what will be your first reaction in order to polish up
the data?

iii ENSURING DATA QUALITY


Data is formed during the progression of a single business method
and flows throughout an organisation as it goes through the multi
ple phases of one or more business procedures. As data moves
from one place to another, it converts and presents itself in
supplementary forms, which unless governed and managed properly,
can lose its ve racity. Although each data type needs a separate plan
and method for supervision, there is a general framework that can be
used to efficient ly manage all data types. The data quality
management framework comprises of three mechanisms: control,
monitor and improve.

CONTROL

The most ideal approach to deal with the nature of information in a


data framework is to guarantee that only the information which meets
the standard models is permitted to enter the framework. This can
be accomplished by setting up solid controls at the front end of every
data inflow system, or by putting validation runs in the integration
layer which is in charge of moving information from one system to
the other. However, this is not generally plausible or financially
practical when, for instance, information is captured physically and
after that later captured in a framework/system,or when changes to
applications are excessively costly, especially with programming
involved with commercial off-the-shelf (COTS). In one specific case,
an organisation ruled against executing changes to one of its primary
data capture COTS applications that would have authorised stricter
information controls. They depended rather on preparing, observing
and giving an account of the utilisation of the framework to help
them enhance their business procedure, and accordingly, experienced
heightened data quality. In any case, organisations that have solid
quality controls at the data influx entry points have experienced
exceptionally viable data quality administration.

MONITOR

It is natural to assume the data to be of higher quality provided


there are strong data controls installed at the entry gate of the
system. As processes are developed and enhanced, folks
responsible for data change managing and the systems age up and
the quality controls are not necessarily maintained to keep up with
the anticipated data qual ity phases. This creates a need for
intermittent monitoring of the data quality by running
authentication rules against the existing stored
N O T E S

data to ensure that the data quality matches the desired levels. Ad
ditionally, information captured from one system to another compels
the company to monitor the data frequently to confirm consistency
across multiple systems. Data quality monitoring enables the organi
sation to actively discover issues before they affect the decision-mak
ing process.

At present, organisations increasingly rely on innovative data visu


alisation methods and analytics to deliver increased business value.
However, when those efforts are hindered by issues related to data
quality, the trustworthiness of their whole analytics strategy comes
into question. Since conventionally, analytics is considered as a pre
sentation of wide range of data points, it is falsely presumed that data
quality issues can be ignored since they would not influence the
broad ranges. The 5Cs for ensuring data quality are shown in Figure
4.3:

Correctness Measure the degree of data accuracy

Completeness Measure the degree to which all required data is present

CwTency Measure the degree to which data is refreshed or made


available at the time it is needed

Measure the degree to which data adheres to standards


Conformity and howwell it is represented in an expected format

Measure the degree to which data is in sync or uniform


Consistency across the various systems in the enterprise

Figure 4.3: 5Cs of Data Quality

IMPROVE

When the data quality checks report a decline in quality, a few


correc tive measures can be deployed. As described above, training
and ad justing processes and system enhancements involve both
people and technology. Usually, an improvement plan which is
implemented right after the first instance of quality dip comprises
data cleansing, which can be completed via automation or manually
by business users. If the business can self-define the rules to
improve data, then data purging programs can be easily created to
mechanise the data enhancement method. Next step - business
validation - makes sure that the data regains itsrequired quality
levels. Habitually, organisations end the data quality enhancement
program after a single round of positive validation which is a wrong
step. An important step that is missed is improving data quality
controls to ensure that the same issues do not recur by doing a full
RCA of the issues and quality controls. Applying these steps is more
critical when a project consists of master data or
BUSINESS :ANALYTICS

N O T E S

reference, such as product, client or market data. Besides, organisa


tions implementing an integrative solution will gain from having this
extra exertion since it aids quality data flow throughout the
enterprise in an adaptable solution.

Besides technical challenges, often there are organisational hindranc


es that must be dealt with. This is evident in organisations with
huge vastness and diversity of data, which is often kept by unalike
depart ments with contradictory priorities. Hence, a mixture of
stakehold er management, data governance and careful planning
are required, along with the right approach and solution.

[f SELF ASSESSMENT QUESTIONS

9. t h ed a t a requires close look at


the data parameters and controlling the overall aspects of
data to ensure inquality.
10. In a close environment, data quality is achievable and can be
achieved without adhering to data metrics. (True/False)


Prepare a report on popular tools used for measuring data quality.

TECHNOLOGY FOR BUSINESS


ANALYTICS
As a push to make analysis more significant and unmistakable to the
business client, solutions are concentrating on particular vertical ap
plications and customising the outcomes and business audience in
terfaces. For usability, less complex and compelling arrangement,
and ideal value, analytics are being installed in bigger systems.
Therefore, issues like information gathering, storage and processing
related to analytics are overall increasingly viewed as critical issues
in system design. In endeavour to expand the capability of analytics
in a busi ness procedure, provisions are being developed that go
beyond the client facing applications, working in background to
applications in sales, supply chain perceptiveness, advertising, value
improvement, and workforce analysis. For this sole purpose,
business analytics (BI) includes tools in the following categories:
□ AQL - Associative Query Logic
D Business planning
□ Business process re-engineering
□ Competitive analysis
□ Data mining (DM), Data farming and data warehouses and so on.
N O T E S

Technologies and trends variations in technologies are possibly the


most noticeable BI component in the IT industry. We might think
that the need of data volume to make a specific decision has
decreased over time either due to the overall shift in management or
due to as sumptions with terms with higher significance, such as
insight, knowl edge and ideas.

While taking the human factor in mind, the change between reactive
and proactive decision making is defined by the complexity level of
the fields between advanced analytics and BI. Summary reports, sta
tistics and queries, and low-latency dashboards are built on chrono
logical information. There is a mid-ground for simple analytics, e.g.,
algebraic or trending predictions that give estimated answers about
expectations in terms ofsales, production, etc. Advanced analytics are
much more refined, support techniques such as statistical analysis,
forecasting, prediction and correlation, whereas trend analysis simply
infers the existing data to project the next quarter. A refined predic
tive model takes seasonality, correlations between strong and weak
quarters, and historical sales outlines into account.

Let's take a look at decision making from another point of view. Say
we want to examine our brain while taking a decision. From a logi
cal viewpoint, when our brain encounters a task it has no idea about,
it attempts to create rational assumptions guessing the input, likely
outcomes vs. actions to be taken, and attempts to find the best an
swer. When the brain encounters the same level of problem again, it
re-imagines the outcomes and methods deployed as in the old task,
before trying to figure out the right answer to the current problem, as
sesses what worked earlier and what did not. After being subjected to
a certain amount of similar or varying tasks, brain becomes familiar
to cracking a specific type of task. Consequently, the time of re-
examin ing the older solutions and finding the right solution for the
new task reduces significantly.

Alongside the issues of supportive human policymaking patterns, the


structural setup ofa BI system should be prudently measured. A
num ber of sources with printed study specify that intelligence works
finest when planned as a joint effort by involving people. This effort
needs to be correctly synchronised in terms of urgencies,
responsibilities, pro cedures and at the same time, intelligence setup
should backup and inspire an effective flat exchange of data among
contributors.

There are multiple cases where a state-of-the-art business intelli


gence technology failed to deliver on the expectations because of the
unwillingness of the persons to take care of the data hungry system
and accomplish additional actions that are required from time to
time. Though, taking the learning curve into account, capabilities and
pat terns intricate ways of human capacity and learning still exceed
ma chine learning in countless areas. People have never been more
able
N O T E S

to understand, and use specific technologies. The next generations


are finding technologies less intimidating and assume the techno-hu
man connection as regular and undisputable.

Now the business analytics functions as per the standards prescribed


by BABOK (Business Analysis Body of Knowledge). It is a compila
tion of most commonly utilised efficient practices in business
analysis across the globe. These standards keep on evolving and
incorporate new changes dynamically in the form of versions. It is a
framework that describes the knowledge, skills and capabilities
required to ac complish business analysis efficiently. Software
development method ologies like Agile and SCRUM are commonly
occurring standards that help in creating an iterative informative
solution for the system which is composed of several layered steps of
dealing with the SDLC and as sociated phases. Coming to application
tools, business analysts across the world utilise applications like MS
Word, Excel, Visio, PowerPoint & Project and many such tools in
order to put their best foot forward. These tools are effective and
clear in presenting information closest to depiction as wanted by the
analyst and hence elate the overall levels of analytical operational
standards.

SELF ASSESSMENT QUESTIONS

11. AQL stands for


a. Associative Query Logic
b. Associated Query Logic
c. Association of Query Logic
d. A<;sociative Query in Logic

Create a presentation on data mining tools and show it in your


class.

ii:■MANAGING CHANGE
There are numerous reasons why change is fraught with stares -
our characteristic necessity of having a sense of security around
existing processes and comfort zone is often tough to break which
further helps in decreasing a contemporary change's probability of
accom plishment. For example, many Windows XP user, most of
them being elderly bank employees in India, were intimidated on
hearing Micro soft discontinuing support for XP, since they had to
learn new OS from scratch and that for them could have taken
considerable time, if not long. Instead, they found ways of doing
existing work efficiently with available resources and with the help
of consultants hired to drain the fear factor that had them on their
toes. Change management in field
••

N O T E S

of business analytics often interrelates and precedes/succeeds other


phases as shown in Figure 4.4:

• Compare planned
• Register and study and actual indi
corporate data cators
• Follow the budget

Change
Manageme • Evaluate the
• Implement a efficiency of
achieved targets
system • Make
exact
decisions

Figure 4.4: Change Management Phases

There should be multiple phase auditors to ensure that the roles and
responsibilities of one phase assigned to a business analyst do not
seep into the other phases, affecting the overall outcomes and
messing up the overall project execution.

As a business analyst, you often come across initiatives or projects


that act as the defining watershed moments and lead to massive
changes within an organisation. In few cases, you are subjected to
stay at the frontal attack lines, be it getting together the requirements
from cyn ical investors to reviewing the solution which was put in
place a bit too early and now is facing strong resistance. To get your
job done efficiently in such circumstances, you need to comprehend
how well a change is received by the susceptible individuals and how
to lead the people through the change. Let us discuss a few of the
topics related to change management a business analyst should abide
to.

ALL CHANGE IS PERSONAL

An organisational change always occurs at the root individual level -


this is the first thing you need to learn and understand. Each
employee or person will respond to the change in a dissimilar way
based on their worldview, culture, understanding and relevance of the
change relat ed to their responsibilities, their current lifestyles and
other factors.

CHANGE IS NOT TEAM BOUND

Independent studies have found that visible and active leadership


from the managerial team is the principal donor to a positive change.
Individuals of an organisation obviously look up to their senior man-
N O T E S

agement for leadership on the reputation of activities and to know


the need for doing actions. If the change leaders are not seen to be
frequently involved and supportive to the process, the change faces
the high probability of failure, since callousness of senior people will
lead the people at lower levels to believe that change is not worth or
much required.

When business analysts are involved in transition and operational


activities, they need to assess the method being used to support the
individual's viewpoint and make suggestions to ensure all applicable
investors can successfully bring and implement the change.

THERE IS MORE TO CHANGE MANAGEMENT THAN ONLY


COMMUNICATION AND TRAINING

At the point when the vast majority consider helping people to adjust
to a change, the two most commonly used methods are Training and
Communication. Both are important tools that are expected to help
individuals work through the change procedure, and help address the
awareness and ability/knowledge areas. Nonetheless, they are not ad
equate to completely back the implementation of a change.

Managers through an entire period of the association with the regions


that are affected by a change should be adequately supported so
they themselves can onboard to play a part in the change itself before
they are requested to help their staff. These people do not simply
require training on the solution however they need to understand
what to do in order to help their staff conquer over any issues they
confront.

Business analysts frequently perform stakeholder enquiry to track


every group required in a project. We frequently evaluate things like
state ofmind, impact and engagement. These qualities can be utilised
as a feature of a larger context to evaluate how people are manag
ing the change, and what to do if some of them are impervious to it.
An official roadmap guide will concentrate on the support by
drawing in key partners routinely to ensure that they stay onboarded
and in volved with the change.

CHANGE MANAGERS AS BUSINESS ANALYSTS

Change administration is a different field than business analysis;


how ever, the two are extremely correlative. While a few
organisations now have devoted change management assets, business
analysts will reg ularly be included in the planning and usage of
change management provided their forefront inclusion with
stakeholders throughout the project. In case there are not dedicated
change management resourc es or pre-defined change management
obligations, the business an alyst has a chance to help the project
successfully meet its goals by knowing the basics of change
management and applying them in their exercises. At a point when
your organisation is implementing another
I :

N O T E S

BI activity, chances of progress are incredibly improved when


change management is an integral piece of the activity.
Change management encourages communication from the start of
the activity, getting resistant clients to acceptance and even
enthusiasm, which significantly improves the effective selection of
the new func tionality. Also, it does not end when the innovation
goes live; change management exercises keep on helping with
adoption and client capac ity until the technology is completely
incorporated into the business.
Another use of change management is to guarantee that partners
and stakeholders, like BI groups and business lines are cooperating
to guarantee that the correct data is captured for future business re
quirements. Change management encourages the understanding of
the business needs of the organisation by uniting leaders from vari
ous offices and departments, which empowers BI and Development
to concentrate on next trends and envision the restatement of
business data needs.
Project and change management are different and correlative
activ ities that use multiple skill sets. Project management drives
the spe cialised side of a technology side, concentrated on
guaranteeing that the solution is appropriately designed and works as
required. Change management is centered around the people side,
preparing clients for the change and attempting to ensure that the
newprocedures are adaptable and usable. According to a study
carried out by an individ ual research group, emerging best
practices are intended for change management to be incorporated
with project management. They sat isfy many diverse functions
collaborating as a single entity up for suc cessful implementation.
Developing change managers and leaders across the organisation can
be extraordinarily helpful in improving change management endeav
ors. These could be leaders from many functional areas and
various departments, leaders who have managerial and operational
aptitude, are educated about hierarchical process, and know about
how to set up the course to the effective and enthusiastic adoption of
new proce dures and practices. Having business analysts as
change leaders en ables improved BI implementation in the
following ways:
D Ensuring that business ranks have a reliable group or person
who shares data within the business domain and throughout the
organ isation, and adopts the understanding of the development
and BI team.
D Increasing BI and development, understanding of business re
quirements across the organisation, resulting in better data
find ing and catching the right data for the corporation.

SELF ASSESSMENT QUESTIONS


12. management encourages communication from the
start of the
activity.
BUSINESS :ANALYTICS

N O T E S

ACTIVITY
Your existing medical project requires some sudden changes due
to a large influx of disorganised sample data. Not only that, it also
requires change in system dynamics being used so far to manage
the existing volumes of data. How will you proceed to ensure an
effective change management being carried out without affecting
operations?

IIJjsuMMARY
D Data, to put simply, is the raw material that does not make any
definite sense unless you process it to any meaningful end.
□ Information is the result which we achieve after the raw data is
processed.
□ Standalone data has no meaning rather it only assumes
meaning and transitions into information upon being
interpreted.
□ Knowledge is something that is inferred from the data and infor
mation.
□ A business analyst is anyone who has the key domain experience
and knowledge related to the paradigms being followed.
D Business analysts need not necessarily be from the IT background
although it certainly helps having a basic understanding IT sys
tems and how they work.
D When the data quality checks report a decline in quality, a few
cor rective measures can be deployed.
D Change administration is a different field than business analysis;
however, the two are extremely correlative.

m KEYWORDS
D Business analyst: Anyone who has the key domain experience
and knowledge related to the paradigms being followed.
D Explicit knowledge: A type of knowledge that can be simply
transferred to others.
□ Information: It is the result that we achieve after the raw
data is processed.
D Stakeholder management: It is a process of dealing with
stake holders and understanding how much power and impact
they have on your project.
□ Tacit knowledge: A type of knowledge that is complex and
intri cate and is gained simply by passing on to others and
requires elevated and advance skills in order to be
comprehended.
N O T E S

illll DESCRIPTIVE QUESTIONS


1. Discuss the relation between data, information and knowledge.
2. Explain the role and responsibilities of a business analyst.
3. Enlist and describe the skills required to be a good business
analyst.
4. Discuss the ways of ensuring data quality.

111■ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


What is Data, Information and Knowledge? 1. Data

2. True
Business Analytics Person-
nel and their Roles 3. Failure

4. Segregate
Required Competencies
for an Analyst 5. True

6. False
Business Analytics Data
7. Levels
8. Pattern
Ensuring Data Quality 9. Monitoring, improvement
10. False
Technology for Business 11. a. Associative Query Logic
Analytics
Managing Change 12. Change

HINTS FOR DESCRIPTIVE QUESTIONS


1. Data, to put simply, is the raw material that does not make any
definite sense unless you process it to any meaningful end.
Refer to Section 4.2What is Data, Information and
Knowledge?
2. A business analyst is anyone who has the key domain
experience and knowledge related to the paradigms being
followed. Refer to Section 4.3 Business Analytics Personnel
and their Roles.
3. Business analysts need to be great in verbal and written
communications, diplomatic, experts with problem solving
acumen, theorists with the ability to involve with the
stakeholders to comprehend and answer to their needs in a
dynamic business environment. Refer to Section 4.4 Required
Competencies for an Analyst.
BUSINESS :ANALYTICS

N O T E S

4. As data moves from one place to another, it converts and


presents itself in supplementary forms, which unless governed
and managed properly, can lose its veracity. Refer to Section
4.6 Ensuring Data Quality.

lltJ SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
□ Laursen, G.H., & Thorlund, J. (2017). Business analytics for
man agers: taking business intelligence beyond reporting.
Hoboken, NJ: Wiley. Isson, J. P (2013). Win with advanced
business analytics: creating business value from your data.
Hoboken, NJ: John Wiley & Sons.

E-REFERENCES
□ Risk, S. (n.d.). Business Analytics less Data Quality equals Bad
Decisions. Retrieved April 26, 2017, from https://www.blue-
gran ite.com/blog/business-analytics-less-data-quality-equals-bad-
de cisions
□ Data Quality for Business Analytics by David Loshin - BeyeNET
WORK. (n.d.). Retrieved April 26, 2017, from http://www.b-eye-
net work.com/view/15539
CONTENTS

5.1
Introduction
5.2
Visualising and Exploring Data
5.2.1 Dashboards
5.2.2 Column and Bar Charts
5.2.3 Data Labels and Data Tables Chart Options
5.2.4 Line Charts
5.2.5 Pie Charts
5.2.6 Scatter Chart
5.2.7 Bubble Charts
5.2.8 Miscellaneous Excel Charts
5.2.9 Pareto Analysis
Self-Assessment Questions
Activity
5.3 Descriptive Statistics
5.3.1 Central Tendency (Mean, Median and Mode)
5.3.2 Variability
5.3.3 Standard Deviation
Self-Assessment Questions
Activity
5.4 Sampling and Estimation
5.4.1 Sampling Methods
5.4.2 Estimation Methods
Self-Assessment Questions
Activity
5.5 Introduction to Probability Distributions
Self-Assessment Questions
Activity
CONTENTS

5.6
Summary
5.7 Descriptive Questions
5.8 Answers and Hints
5.9 Suggested Readings & References

INTRODUCTORY CASELET
N O T E S

CAB SERVICE COMPANY USING DESCRIPTIVE ANALYTICS


FOR BETTER CUSTOMER SATISFACTION

To reap the maximum benefits of social media marketing, a new


ly launched cab service company deploys the analytical expertise
of a consultancy firm. The firm has recommended an extended
social media campaign followed by a series of introductory offers
and joining gifts in the form of free travel and exclusive cash
back offers for first fewcustomers. The firm has offered to help
with the social media operations along with the reputation
management, in case some disgrw1tled customers throng the
social forums to voice out their opinions or other cab companies
plan to bog it down by targeting a malicious false-review
campaign against the company.

The cab company is on a strict marketing and advertising


budget and needs the analytics to stay true to their potential. A
misfired campaign may result in a detrimental image as well as
the revenue loss for the company. The statistics and analysis of
the consultan cy firm needs to be spot on in order to create a
niche in the market for a domain where already there are
several players. They need to make sure that the customers are
taken into confidence along with the existing players and
retained for a long time. The con sultancy will study the
current market and stats around the area where the company is
planning to deploy their cabs. Based on the data gathered, the
consultancy will go into technical detailing like occurrences of
low-travel days, weather dependent phases and predicting
traffic, movements, random happenings and a work around to
deal with them.
:

N O T E S

@ LEARNING OBJECTIVES
After studying this chapter, you will be able to:
>- Explain about visualising and exploring data
>- Describe descriptive statistics
>- Define sampling and estimation
>- Elucidate probability distributions

h■INTRODUCTION
Descriptive analytics is the most essential type of analytics and
estab lishes the framework for more advanced type of analytics. This
sort of analysis involves "What has occurred in the corporation" and
"What is going on now?" Let us consider the case of Facebook.
Facebook user produce content through comments, posts and picture
uploads. This information is unstructured and is produced at an
extensive rate. Facebook stats reveal that 2.4 million posts equivalent
to around 500 TB of information are produced every minute. These
jaw-dropping figures have offered popularity of another term which
we know as Big Data.

Comprehending the information in its raw configuration is


trouble some. This information must be abridged, categorised and
displayed in an easy to understand way to let the managers to
comprehend it. Business Intelligence and data mining
instruments/methods have been the accepted components of doing
so for bigger organisations. Practically every association does
some type of outline and MIS re porting using the information
base or simply spreadsheets.

There are three crucial approaches to abridge and describe the raw
data:
D Dashboards and MIS reporting: This technique gives
condensed data giving information on "What has happened",
"What's been going on?" and "How can it stand with the plan?"
D Impromptu detailing: This technique supplements the past
strat egyin helping the administration to extract the information
as re quired.
D Drill-down reporting: This is the most complex piece of
descrip tive analysis and gives the capacity to delve further into
any report to comprehend the information better.

This chapter first discusses the processes of visualising and exploring


data. Next, the chapter discusses about descriptive statistics. Further,
the chapter discusses about sampling and estimation. Towards the
end, the chapter discusses about probability distributions.

N O T E S

Jf# VISUALISING AND EXPLORING DATA


Data visualisation is the method of depicting data (typically in larg er
quantities) in graphical or visual form. Researchers observed that
data visualisation improves decision-making, provides managers with
better analytic capabilities that reduce the dependence on IT profes
sionals, and improves collaboration and information sharing.

Raw data is important, particularly when one needs to identify ac


curate values or compare individual numbers. However, it is quite
difficult to identify trends, patterns and find exceptions, or compare
groups of data in tabular form. The human brain does a surprisingly
good job in processing visual information-if presented in an effective
way.

Data visualising provides a way of data collaboration at all business


levels and can disclose surprising relationships and patterns.

Data visualisation is also important both for building decision models


and for interpreting their results. Toidentify the appropriate model to
use, we would normally have to collect and analyse data to
determine the type of relationship (linear or non-linear, for example)
and esti mate the values of the parameters in the model. Visualising
the data will help to identify the proper relationship and use the
appropriate data analysis tool. Furthermore, complex analytical
models often yield complex results. Visualising the results helps in
understanding and gaining insight about model output and solutions.

5.2.1 DASHBOARDS

Making data visible and accessible to employees at all levels is a


hall mark of effective modern organisations. A dashboard is a visual
pic ture of a group of specific business measures. It is similar to the
dash board of an automotive, such as a car, which displays fuel level,
speed, seat signs, temperature, and so on. Dashboards deliver
important key synopses of valuable business data to efficiently
manage a business function or process. Dashboards might include
tabular as well as visu al data to allow managers to quickly locate the
key data.

5.2.2 COLUMN AND BAR CHARTS

MS Excel refers to the vertical bar charts as column and horizontal


bar charts as bar charts. Column and bar charts are valuable for
equating categorical or series specific data, for demonstrating
differences be tween value sets, and for displaying percentages or
proportions of a whole.
N O T E S

Figure 5.1 shows column and bar charts:

7 0 l 2 3 4 5 6 7
6.3
6
2012
5

4
2013
3

2014 6.3

0
2OU 2013
2014

Figure 5.1: Column and Bar Chart


Source: https://v,,vw.aploris.com/support/documentation/bar-and-line-charts

5.2.3 DATA LABELS AND DATA TABLES CHART OPTIONS


MS Excel provides options for including the numerical data on
which charts are based within the charts. Data labels can be added to
chart elements to show the actual value of bars. Data tables can also
be added; these are usually better than data labels, which can get
quite messy. Both can be added from the Add Chart Element Button
in the Chart Tools Design tab, or also from the Quick Layout button,
which provides standard design options. Figure 5.2 shows data labels
and data tables chart:

250 35%

30%
200
- - - - 25%

- -
- -
150 20%

100 - r-,
- 15%

10%
50
5%

0 I 0%
J F M A M J J A s 0 N D
1 PeopleCount 145 109 105 100 1.45 109 130 140 150 193 185 171
I-%Labor Cost 20% 21% 23% 23% 24% 25% 24% 25% 24% 26% 28% 29%

Figure 5.2: Data Labels and Data Tables Chart


Source: http://datapigtechnologies.com/blog/index.php/the-trouble-with-chart-data-tables/

5.2.4 LINE CHARTS

Line charts are a useful way of displaying data for a given period.
You may enter multiple series of data in line charts; however, it can
be come difficult to interpret ifthe size of data values differs
exponential-

N O T E S

ly. In such a case, it would be advisable to create individual charts for


different data series. Figure 5.3shows line charts:

Product Line Global Revenue


...,Quantum Compute.. ◊ Atom Synthesizer .,._ Proton Cannon
100

.."
. !!
80



i 60 ◊

.s
."?,
a:
.. 0.--c..
$:I 1S
20 .,a'

0
Jan Feb Mar Apt' May Jun Jul Aug Sep Oct Hov Dec

Figure 5.3: Line Charts


Source: http://www.advsofteng.com/gallery_line.html

5.2.5 PIE CHARTS


For many types of data, we are interested in understanding the
rela tive proportion of each data source to the total. A pie chart
shows this by dividing a circle into pie-shaped areas displaying the
relative part. Newage 3Dpie charts can get confusing at times
because of their nar row representation in case of huge data variables.
This is because the third dimension also represents something
especially on a coordinate graph. Hence, pie charts are preferred
only in two dimensional form for effective and simpler data
representation. Figure 5.4displays a pie chart:

Website visits

Email marketing (31%)


Google+ (5%)

Pinterest (7%) Referrals (3%)

Figure 5.4: Pie Charts


Source: http://,vww.f1f9.com
N O T E S

5.2.6 SCATTER CHART

Scatter charts demonstrate the connection between two variables. To


create a scatter chart, we require variable pairs and observations
re lated to them. For example, students in a class might have
grades for both a midterm and a final exam. Figure 5.5 shows a
scatter chart:

---------------------------Chart Title
Scatter Chart-
Value Boxes
100

... .
---Tooltips

,.
80

catterMarl<ers

! 60
• • -- Plot
i ••
,
• •••••
Area
i----------Legend

?O
• Scale
(sca1e-x or scale-y)

--- -- ---,,-- - ,-= =----------·-Guides


AxitUn 4

---Tick Marks
.:: -----------------------------------•ScaleLabel•
-----------------==-ScaleTitle

Figure 5.5: Scatter Chart


Source: https://www.zingchart.com/docs/chart-types/scatter-plots/

5.2.7 BUBBLE CHARTS


A bubble chart is a chart related to scatter chart, in which the data
marker size corresponds to a third variable; thus, it is a method to
display three variables in 2D space. Figure 5.6shows a bubble
chart:

Top 10 Films by Worldwide Grosses

r-,
1996 1997 19911 1999 2000 2001 2002 c003 2004 2005 2006 2007 2008 2009
.Turi< • s,.,.w.,., Epsode1 • lhe A>aonl<>mMenoce
e Horry PotterondtheSoroorer"s - • lhel<>rdof tho Rhgs:lhe Two T. . .,.
• lhe 1.on:to1theRhos:lhe 1te11.mo1the no • SIYek2
PTotesolthoc..tboM:OeodMan"s0-..st PTotosd thoc.,t,beon: /II.Worl<fs Er4
• Horry PottwondtheO.derol the- lhe0¥kl(rActt

Figure 5.6: Displaying Bubble Charts


Source: https://community.devexpress.com/blogs/ctodx/archive/2008/10/28/dxperience-v2008-
vol-3-bubble-charts-for-winforms-and-asp-net.aspx

5.2.8 MISCELLANEOUS EXCEL CHARTS


Excel provides several additional charts for special applications.
These additional types of charts (including bubble charts) can be se
lected and created from the Other Charts button in the Excel ribbon.

N O T E S

These include the following:


□ A stock chart allows you to plot stock prices, such as the daily
high, low, and close. It may also be used for scientific data such
as tem perature changes.
□ A surface chart shows 3-D data.
□ A doughnut chart is similar to a pie chart but can contain more
than one data series.
□ A radar chart allows you to plot multiple dimensions of several
data series.

5.2.9 PARETO ANALYSIS

Pareto analysis is a term named after Vilfredo Pareto, an Italian


econ omist. In 1906, he realised that a large portion of the total
wealth is held by a comparatively small number of the people in
Italy. The Pa reto principle is often seen in many business situations.
For example, higher percentage of sales may come usually from a
small percentage of customers, a higher percentage of defects
originate from relatively smaller batches of the product or a high
percentage of stock value be longs to a small percentage of selective
items. As a result, the Pareto principle is also often called the "80-20
rule," referring to the generic situations.

[f SELF ASSESSMENT QUESTIONS


1. Data gives a way of data collaboration at all business
levels and can disclose surprising relationships and patterns.
2. Dashboards might include as well as data to allow
managers to quickly locate key data.
3. Bubble chart is a method to display three variables in 2D
space. (True/False)

ACTIVITY
Prepare a report on data visualisation tools available on the Web
other than the tools discussed in the chapter.

if• DESCRIPTIVE STATISTICS


Statistics, as defined by David Hand, past president of the Royal Sta
tistical Society in the UK, is both the science of uncertainty and the
technology of extracting information from data. Statistics involves col
lecting, organising, analysing, interpreting and presenting data. You
are familiar with the concept of statistics in daily life as reported in
newspapers and the media, for example, baseball batting averages,
N O T E S

airline on-time arrival performance, and economic statistics such as


the Consumer Price Index.

Statistical methods are essential to business analytics and are used


throughout this book. Microsoft Excel supports statistical analysis in
two ways:
1. With statistical functions that are entered in worksheet cells
directly or embedded in formulas.
2. With the Excel Analysis Toolpak add-in to perform more
complex statistical computations. We wish to point out that
Excel for the Mac does not support the Analysis Toolpak.

A population consists of all items of interest for a particular decision


or investigation-for example, all individuals in the United States who
do not own cell phones, all subscribers to Netflix, or all stockholders
of Google. A company like Netflix keeps extensive records on its
cus tomers, making it easy to retrieve data about the entire population
of customers. However, it would probably be impossible to identify
all individuals who do not own cell phones.

A sample is a subset of a population. For example, a list of


individuals who rented a comedy from Netflix in the past year would
be a sample from the population of all customers. Whether this
sample is repre sentative of the population of customers-which
depends on how the sample data is intended to be used-may be
debatable; nevertheless, it is a sample. Most populations, even the
finite ones, are usually too large to practically or effectively deal
with. For example, it would be unreasonable as well as costly to
survey the TV viewers' population of the United States. Sampling is
also necessary when data must be ob tained from destructive testing
or from a continuous production pro cess. Thus, the process of
sampling aims to obtain enough information to create a legal
interpretation about a population. Market research ers, for example,
use sampling to gauge consumer perceptions on new or existing
goods and services; auditors use sampling to verify the accuracy of
financial statements; and quality control analysts sample production
output to verify quality levels and identify opportunities for
improvement.

UNDERSTANDING STATISTICAL NOTATION

We typically label the elements of a dataset using subscripted vari


ables, x1, x2, •.. and so on. In general, xi represents the ith observation.
In statistics, it is common to use Greek letters, such as cr (sigma), JL
(mu), and TT (pi), to represent population measures and italic letters
such as by x (x-bar), s, and p for sample statistics. We will use N to
represent the number of items in a population and n to represent the
number of observations in a sample. Statistical formulas often
contain a summation operator, (Greek capital sigma), which means
that the terms that follow it are added together. Thus, a.
Understanding these

N O T E S

conventions and mathematical notations will help you interpret and


apply statistical formulas.

5.3.1 CENTRAL TENDENCY (MEAN, MEDIAN AND MODE)

Central tendency is the measurement of a single value that attempts


to describe a set of data by identifying the central position within
that set of data. Measurement of central tendency is also called as
mea sures of central location. Some common terms used as valid
measures of central tendency are as follows:
a Mean
a Median
a Mode
□ Midrange

MEAN

The mathematical average is called the mean (or the arithmetic


mean), which is the sum of the observations divided by the total
number of observations. The mean of a population is shown by the
µ,, and the sample mean is denoted by. If the population contains N
observations x1, x2, ... xN, the population mean is calculated as

The mean of n observations sample, x1, x2, .•. x", ,is calculated as
n
xi
Xi==l
n

Note that the calculations for the mean are the same whether we are
dealing with a population or a sample; only the notation differs. We
may also calculate the mean in Excel using the function AVERAGE
(data range).
One property of the mean is that the sum of the deviations of each
observation from the mean is zero:

This simply means that the sum of the deviations above the mean is
the same as the sum of the deviations below the mean. Thus, the
mean "balances" the values on either side of it. However, it does not
suggest that half the data lie above or below the mean.
N O T E S

MEDIAN

The measure of location that specifies the middle value when the
data are arranged from least to greatest is the median. If the number
of observations is odd, the median is the exact middle of the sorted
num bers - i.e. the 4 observation. If the number of observations is
even, say 8, the median is the mean of the two middle numbers - i.e.
mean of 4th and 5th observation. We can use the Sort option of MS
Excel to order the data as per the rank and then find the median. The
Excel function MEDIAN (data range) could also be used. The
median is meaningful for ratio, interval and ordinal data. As opposed
to the mean, the medi an is not affected by outliers.

MODE

A third method of measuring the location is called mode. It is the ob


servation/number/series that occurs the maximum number of times.
The mode is valuable for datasets containing smaller number of
unique values. You can easily identify the mode from a frequency
dis tribution by identifying the value having the largest frequency or
from a histogram by identifying the highest bar. You may also use
the Excel function MODE.SNGL (data range). For frequency
distributions or grouped data, the modal group is the group with the
greatest frequency.

MIDRANGE

A fourth measure of location that is used occasionally is the


midrange. This is simply the average of the greatest and least values
in the data set.

5.3.2 VARIABILITY

A commonly used measure of dispersion is the variance. Basically,


variance is the squared deviations average of the observations from
the mean.

The bigger the variance is, the more is the spread of the observations
from the mean. This indicates more variability in the observations.
The formula used for calculating the variance is different for popula
tions and samples.

The formula for the variance of a population is:

where xi is the value of the ith item, N is the number of items in the
population, andµ. is the population mean.

N O T E S

The variance of a sample is calculated by using the formula

8
=
2 -'-,--'-1 _

n-1

where n is the number of items in the sample and is the sample mean.

Example: A population has four observations: {1, 3, 5, 7}. Find the


variance.

(A) 2 (B) 4 (C) 5 (D) 6 (E) None

Solution: The answer is (C). First, we need to compute the


population mean.

µ., = (1 + 3 + 5 + 7) / 4 = 4
Insert all known values into the formula for the variance, as shown
below:

cr2 = cr (Xi - µ., )2 I N


cr2 = [ (1 - 4 )2 + (3 - 4 )2 + ( 5 - 4 )2 + ( 7 - 4 )2 ] / 4
cr2 = [ ( -3 )2 + ( -1 )2 + ( 1 )2 + ( 3 )2 ] / 4
cr2 = [ 9 + 1 + 1 + 9 ] / 4 = 20 / 4 = 5

5.3.3 STANDARD DEVIATION

The square root of the variance is the standard deviation. For a popu
lation, the standard deviation is computed as:

a = u-'-,--'-1 ---
N

and for samples, it is

s= \l-'-,--'-1 ---

n-1

Thestandard deviation is usually easier to understand than the vari


ance because of similarity in its measure units that are same as the
data units. Thus, it can be more easily related to the mean or other
statistics measured in the same units.

The standard deviation is a popular measure of risk, particularly in


financial analysis, because many people associate risk with volatility
in stock prices. The standard deviation measures the tendency of a
fund's monthly returns to vary from their long-term average (as For
tune stated in one of its issues, "... standard deviation tells you what
to expect in the way of dips and rolls. It tells you how scared you'll
be."). For example, a mutual fund's return might have averaged 11%
N O T E S

with a standard deviation of 10%. Thus, about two-thirds of the time


the annualised monthly return wasbetween 1% and 21%. By contrast,
another fund's average return might be 14% but have a standard de
viation of 20%.Its returns would have fallen in a range of -6% to
34% and, therefore, is riskier.

Example: A random sample consists of four observations: {1, 3, 5,


7}. Based on these sample observations, what is the best estimate of
the standard deviation of the population?

(A) 2 (B) 2.58 (C) 6 (D) 6.67 (E) None

Solution: The answer is (B). First, compute the sample mean.

x = (1 + 3 + 5 + 7) / 4 = 4
Then, we insert all the known values into formula for calculating the
SD of a sample, as shown below:

s = sqrt [ (xi - x)2 / (n - l)]


s= sqrt { [ ( 1 - 4 ) + ( 3 - 4 ) + ( 5 - 4 ) + ( 7 - 4 )2 ] / ( 4 - 1 ) }
2 2 2

s = sqrt { [ ( -3 )2 + ( -1 )2 + ( 1 )2 + ( 3 )2 ] / 3 }
s = sqrt { [ 9 + 1 + 1 + 9 ] / 3 } = sqrt (20 / 3) = sqrt ( 6.67 ) = 2.58
STANDARDISED VALUES

A z-score, or standardised value, provides a measure of the distance


of the observation away from the mean, irrespective of the measure
ment units. In a data set, z-score for the ith observation is calculated
as follows:

We subtract the sample mean from the ith observation, xi' and divide
the result by the sample standard deviation. The numerator denotes
the distance that xi is away from the sample mean; a negative value
designates that x,is at the left of the mean, and a positive value
means it lies at the right. By dividing by the standard deviation, s, we
scale the distance from the mean to express it in units of standard
devia tions.

Thus, a z-score of 1.0 means that the observation is one standard de


viation to the right of the mean; a z-score of -1.5 means that the ob
servation is 1.5standard deviations to the left of the mean. Thus, even
though two data sets may have different means and standard devia
tions, the same z-score means that the observations have the same
relative distance from their respective means.
Z-scores can be computed easily on a spreadsheet; however, Excel
has a function that calculates it directly, STANDARDISE (x, mean,
stan dard_dev).
z = x, -x
' s

N O T E S

COEFFICIENT OF VARIATION

The coefficient of variation (CV) provides a relative measure of the


dispersion in data relative to the mean and is defined as

CV = Standard Deviation/Mean

Often, the coefficient of variation is multiplied by 100 to be


expressed as a percentage.

This statistic is useful when comparing the variability of two or more


data sets when their scales differ.

The coefficient of variation offers a relative risk to return measure.


The smaller the coefficient of variation, the smaller the relative risk is
for the return provided. The reciprocal of the coefficient of variation,
called return to risk, is often used because it is easier to interpret.
That is, if the objective is to maximise return, a higher return-to-
risk ratio is often considered better. A related measure in finance is
the Sharpe ratio, which is the ratio of a fund's excess returns
(annualised total returns minus Treasury bill returns) to its
standard deviation. If several investment opportunities have the
same mean but different variances, a rational (risk-averse) investor
will select the one that has the smallest variance. This approach to
formalising risk is the basis for modern portfolio theory, which seeks
to construct minimum-variance portfolios.

SELF ASSESSMENT QUESTIONS

4. involves collecting, organising, analysing, interpreting


and presenting data.
5. Sampling is also clearly necessary when data must be
obtained from destructive testing or from a continuous
production process. (True/False)
6. The is meaningful for ratio, interval and ordinal data.

ACTIVITY
Prepare a report on the relationship between statistical analytical
concepts and their usage in analytical sciences in the simplest
man ner possible.

Iii SAMPLING AND ESTIMATION


The primitive sampling steps require an effective sampling plan to be
designed, that will produce representative samples of the populations
under scrutiny. A sampling plan is a description of the approach
that is used to obtain samples from a population prior to any data
collec tion activity.
:

N O T E S

Asampling plan states the:


D Objectives of the sampling activity
D Target population
D Population frame (the list from which the sample is selected)
D Method of sampling
D Operational procedures for collecting the datad
D Statistical tools that will be used to analyse the data

Example: A sampling plan for a market research study


Suppose that a company in America wants to understand how golf
ers might respond to a membership program that provides discounts
at golf courses in the golfers' locality as well as across the country.
The objective of a sampling study might be to estimate the
proportion of golfers who would likely subscribe to this programme.
The target population might be all golfers over 25 years old.
However, identify ing all golfers in America might be impossible. A
practical population frame might be a list of golfers who have
purchased equipment from national golf or sporting goods
companies through which the discount card will be sold. The
operational procedures for collecting the data might be an e-mail link
to a survey site or direct-mail questionnaire. The data might be
stored in an Excel database; statistical tools such as PivotTables and
simple descriptive statistics would be used to seg ment the
respondents into different demographic groups and estimate their
likelihood of responding positively.

5.4.1 SAMPLING METHODS

Many types of sampling methods exist. Sampling methods can be


sub jective or probabilistic. Subjective methods contain judgment
sam pling, in which expert judgment is used for selecting the sample
and convenience sampling, in which easier to collect samples are
selected (e.g., survey all customers who visited this month).
Probabilistic sam pling includes items selection from the sample
using a random proce dure and it is necessary to derive effective
statistical conclusions.
The most common probabilistic sampling approach is simple
random sampling. A random sampling requires choosing items from
a popula tion such that every subset of a given sample size has an
equal oppor tunity to get selected. Simple random samples can be
easily obtained if the population data is kept in a database.
Other methods of sampling include the following:
D Systematic (periodic) sampling: Systematic sampling (or
Periodic sampling) is a sampling plan that selects each specified
nih item from the population. For example, to sample 200 names
from a list of 400,000, first name can be randomly selected from the
first 2000,

N O T E S

and then every 2000th name can be selected. This approach can
be used for sampling telephone supported by an automated dialler
used to dial numbers in an orderly manner. However, systemat ic
sampling is complex compared to random sampling as for any
given sample, every possible sample of a given size of the
popula tion has no equal chance of getting selected. In few
situations, this method can bring weighty bias if the population
has some basic pattern. For example, sampling the orders
received on each Sun day may not produce an illustrative sample
if consumers tend to order more or less i=on other days.
□ Stratified sampling: It applies to populations divided into natu
ral subsets (strata). For example, a large city may be divided into
political districts called wards. Each ward has a different number
of citizens. A stratified sample would choose a sample of
individu als in each ward proportionate to its size. This approach
ensures that each stratum is weighted by its size relative to the
population and can provide better results than simple random
sampling if the items in each stratum are not homogeneous.
However, issues of cost or significance of certain strata might
make a disproportion ate sample more useful. For example, the
ethnic or racial mix of each ward might be significantly different,
making it difficult for a stratified sample to obtain the desired
information.
D Cluster sampling: It refers to dividing a population into clusters
(subgroups), sampling a cluster set, and conducting a complete
survey within the sampled clusters. For instance, a company
might segment its customers into small geographical regions. A
cluster sample would consist of a random sample of the
geographical re gions, and all customers within these regions
would be surveyed (which might be easier because regional lists
might be easier to produce and mail).
□ Sampling from a continuous process: Selecting a sample from
a continuous manufacturing process can be accomplished in two
main ways. First, select a time at random; then select the next n
items produced after that time. Second, randomly select n times;
select the next item created after each of these times. The first
approach generally ensures that the observations will come from
a homogeneous population; however, the second approach might
include items from different populations if the characteristics of
the process should change over time, so caution should be used.

5.4.2 ESTIMATION METHODS


Sample data provides the basis for many useful analyses to support
decision making. Estimation involves evaluating the value of an
unfa miliar population constraint-such as a population proportion,
popu lation mean, or population variance-using sample data.
Estimators as measures are used to approximate the population
parameters; e.g., we use the mean sample x to approximate a
population mean µ,. The
N O T E S

sample variance s2 estimates a population variance cr2, and the


sample proportion p estimates a population proportion rr. A point
estimate is a number resulting from a sample data used in estimating
the value of a population parameter.

UNBIASED ESTIMATORS

It seems quite intuitive that the sample mean should provide a good
point estimate for the population mean. However, it may not be clear
why the formula for the sample variance we read previously, has a
denominator of n - 1, particularly because it is different from the for
mula for the population variance. In these formulas, the population
variance is computed by
N
I;(x. - ,,)2
d2 = •- l _
N

Whereas, the sample variance is computed by the formula

s2 = _;-_1 _

n-1
Whyso?Statisticians develop many types of estimators, and from a
theoretical as well as a practical perspective, it is important that they
estimate the population parameters truly as they are expected to es
timate. Say, we perform a test where we frequently sampled from a
population and calculated a point estimate for a population parame
ter. Each individual point estimate varies from population parameter;
though, the long-term average (probable value) of all the likely point
estimates would be identical to the population parameter, hopefully.
If the likely value of an estimator is equal to the population
parameter it is supposed to estimate, the estimator is credited as
impartial else the estimator is called biased and will yield incorrect
results.

Luckily, all estimators we discussed are unbiased and are expressive


for decision-making linking the population parameter. Statisticians
have shown that the denominator "n -1" used in computing s2 is nec
essary to provide an unbiased estimator of cr2. If we simply divide by
the number of observations, the estimator tends to underestimate the
true variance.

ERRORS IN POINT ESTIMATION

One of the drawbacks of using point estimates is that they do not


pro vide any indication of the magnitude of the potential error in the
es timate. A newspaper reported that college professors were the
best paid workforces in the area, with an average pay of $150,004.
However, it was found that average pays for two local universities
were less than $70,000. How did this happene? It was revealed that
the sample size taken was very small and included a large number of
highly-paid

N O T E S

medical school faculty; as a result, there was a significant error in the


point estimate that was used.

When we sample, the estimators we use-such as a sample mean,


sample proportion or sample variance-are actually random variables
that are characterised by some distribution. By knowing what this
dis tribution is, we can use probability theory to quantify the
uncertainty associated with the estimator. To understand this, we first
need to dis cuss sampling error and sampling distributions.

Different samples from the same population have different charac


teristics-for example, variations in the mean, standard deviation,
frequency distribution and so on. Sampling error occurs as samples
are only the subset of the total population., Sampling errors can be
lessened but not completely avoided. Another error called non-sam
pling happens when the sample does not represent the target popula
tion effectively. This is generally a result of poor sample design)
such as using a convenience sample when a simple random sample
would have been more appropriate or choosing the wrong population
frame. To draw good conclusions from samples, analysts need to
eliminate non-sampling error and understand the nature of sampling
error.

Sampling error depends on the size of the sample relative to the


pop ulation. Thus, determination of sample size to be taken is
basically a statistical issue based on the precision of the estimates
required to infer a valuable assumption. Also, from a rational point
of view, one should also deliberate the sampling price and create a
trade-off be tween cost and information obtained.

UNDERSTANDING SAMPLING ERROR

Suppose that we estimate the mean of a population using the sample


mean. How can we determine how accurate we are? In other words,
can we make an informed statement about how far the sample mean
might be from the true population mean? We could gain some insight
into this question by performing a sampling experiment.

SAMPLING DISTRIBUTIONS

We can quantify the sampling error in estimating the mean for any
unknown population. Todo this, we need to characterise the sampling
distribution of the mean.

SAMPLING DISTRIBUTION OF THE MEAN

The means of all possible samples of a fixed size n from some


popu lation will form a distribution that we call the sampling
distribution of the mean. The histograms are approximations to the
sampling dis tributions of the mean based on 25samples.
Statisticians have shown two key results about the sampling
distribution of the mean. First one,
N O T E S

the standard deviation of the sampling distribution (called the stan


dard error of the mean), is computed as:

Standard Error of the Mean = cr/


where cr is the standard deviation of the population from which
the individual observations are drawn and n is the sample size.
From this formula, we see that as n increases, the standard error
decreases, just as our experiment demonstrated. This suggests that
the estimates of the mean that we obtain from larger sample sizes
provide greater ac curacy in estimating the true population mean.
In other words, larger sample sizes have less sampling errors.

CONFIDENCE INTERVALS

Confidence interval estimates provide a way of assessing the accuracy


of a point estimate. It is a value range between which the population
parameter value is assumed to be correctly estimating the true (indef
inite) population parameter along with a probability. This probability
is called the level of confidence, denoted by 1 - a, where a is a number
between O and 1.

The level of confidence is usually expressed as a percent. (Note that


if the level of confidence is 90%, then a =0.1.) The margin of error
depends on the level of confidence and the sample size. For example,
suppose that the margin of error for some sample size and a level of
confidence of 95% is calculated to be 2.0. One sample might yield a
point estimate of 10. Then, a 95%confidence interval would be [8,
12]. This means that if the sample mean is 10, we can be 95% sure
that population mean will lie between 8 and 12. Though, this interval
may or may not include the true population mean.

ADDITIONAL TYPES OF CONFIDENCE INTERVALS

Confidence intervals may be computed for other population con


straints such as standard deviation or variance and for variances in
the means or proportions of two populations. The concepts are like
the types of confidence intervals we have discussed, but many of the
formulas are rather complex and more difficult to implement on a
spreadsheet.

PREDICTION INTERVALS

Another type of interval used in estimation is a prediction interval.


A prediction interval is one that provides a range for predicting the
value of a new observation from the same population. This is
different from a confidence interval, which provides an interval
estimate of a population parameter. A confidence interval is related
with sampling distribution of a statistic, whereas a prediction interval
is related to the distribution of the random variable itself.

N O T E S

When the population standard deviation is unknown, a 100(1 -a)%


prediction interval for a new observation is:

Note that this interval is wider than the confidence interval by the
additional value of 1 under the square root. This is because, in
addi tion to estimating the population mean, we must also account
for the variability of the new observation around the mean.

SELF ASSESSMENT QUESTIONS

7. A sampling plan states:


a. The objectives of the sampling activity
b. The population frame
c. The method of sampling
d. All of these
8. The most common probabilistic sampling approach is simple
sampling.
9. A sample would choose a sample of individuals in each
ward proportionate to its size.

ACTIVITY

Create a powerpoint presentation on quantitative methods and


show it in yow· class.

INTRODUCTION TO PROBABILITY
DISTRIBUTIONS
The concept of probability is prevalent everywhere, from stock mar
ket predictions and market research to weather forecasts. In a busi
ness, managers need to know the likelihood that a new product will
be profitable or the chances that a project will be completed on time.
Probability quantifies the uncertainty that we encounter all around us
and is an important building block for business analytics applications.
Probability is the likelihood that an outcome occurs. Probabilities are
expressed as values between 0 and 1, although many people convert
them to percentages. The statement that there is a 10%chance that oil
prices will rise next quarter is another way of stating that the proba
bility of a rise in oil prices is 0.1.

The closer the probability is to 1, the more likely it is that the


outcome will occur. Before we discuss probability, let's get
familiarised with its terminology.
:

N O T E S

Experiment: An experiment is a process that results in an outcome.


Anexperiment can be as straightforwardas tossing a coin or a
complex one such as conducting a market research study, observing
weather conditions or the stock market.
Outcome: The outcome of an experiment is the result that we observe.
The group of all likely outcomes of an experiment is the sample
space. For instance, ifwe roll two fair die, the possible outcomes are
the num bers 2 through 12.

A sample space may consist of a small number of separate outcomes


or an infinite number of outcomes.

Probability may be defmed from one of the following three perspec


tives:
D First, if the process that generates the outcomes is known, prob
abilities can be deduced from theoretical arguments; this is the
classical definition of probability.
D The second approach to probability, called the relative frequency
definition, is based on empirical data. The probability that an out
come will occur is simply the relative frequency associated with
that outcome.
D Finally, the subjective definition of probability is based on judg
ment and experience, as financial analysts might use in predicting
a 75% chance that the DJIA will increase 10% over the next year,
or as sports experts might predict, at the start of the football sea
son, a 1-in-5 chance (0.20 probability) of a certain team making it
to the final.

The definition to use depends on the specific application and the


avail able information. We will see various examples that draw upon
each of these perspectives.

PROBABILITY RULES AND FORMULAS

Suppose, we label the n outcomes in a sample space as 01' 02,... On,


where Oi represents the ith outcome in the sample space.

And let P(O) be the probability related with the outcome O;.

Two elementary facts:


D The probability associated with any outcome must be between 0
and 1, or
0 < P(O) < 1 for each outcome O;
D The sum of the probabilities over all possible outcomes must be l.
Or

N O T E S

An event is a group of> loutcomes from a sample space. An


example of an event would be rolling a 7 or an 11 with two die. This
leads to the following rules:
□ Rule 1: The probability of any event is the sum of the
probabilities of the outcomes that comprise that event.
□ Rule 2: If A is any event, the complement of A, denoted Ac,
con sists of all outcomes in the sample space not in A. The
probability of the complement of any event A is P(Ac) = 1 -
P(A).
□ Rule 3: The union of two events contains all outcomes that
belong to either of the two events. To illustrate this with rolling
of two die, let A be the event {7,11} and B be the event {2, 3,
12}.

The union of A and Bis the event {2, 3, 7,11,12}. The probability
that some outcome in either A or B (i.e., the union of A and B)
occurs is denoted as P(Aor B). Finding this probability depends on
whether the events are mutually exclusive or not. Twoevents are
mutually exclu sive if they have no outcomes in common. The events
A and B in this example are mutually exclusive. When events are
mutually exclusive, the following rule applies:
□ If events Aand B are mutually exclusive, then P(Aor B) = P(A) +
P(B)
□ If two events A and B are not mutually exclusive, then P (A or B)
=P(A) + P(B) - P (A and B). Here, (A and B) represents the
inter section of events A and B, that is, all outcomes belonging to
both AandB.

CONDITIONAL PROBABILITY

Conditional probability is the probability of occurrence of one event


A, given that another event B is known to be true or has already oc
curred. Conditional probabilities are useful in analysing data in
cross-tabulations, as well as in other types of applications. Many
com panies save purchase histories of customers to predict future
sales. Conditional probabilities can help to predict future purchases
based on past purchases.
The conditional probability of an event A given that event B is
known to have occurred is:

P(A\ B) = P(A and B)


P(B)

We read the notation P(AIB) as "the probability of A given B."

RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

Some experiments naturally have numerical outcomes, such as a roll


of the dice, the time it takes to repair computers, or the weekly
change in a stock market index. For other experiments, such as
obtaining con-
:

N O T E S

sumer response to a new product, the sample space is categorical. To


have a consistent mathematical basis for dealing with probability, we
would like the outcomes of all experiments to be numerical.
Arandom variable is a numerical description of the outcome of an
experiment. If we have categorical outcomes, we can associate an
arbitrary nu merical value to them. For example, if a consumer likes
a product in a market research study, we might assign this outcome
a value of 1; if the consumer dislikes the product, we might assign
this outcome a value of 0. Random variables are usually denoted by
capital italic letters, such as X or Y.

Random variables may be discrete or continuous. A discrete


random variable is one
for which the number of possible outcomes can be counted. A con
tinuous random variable has outcomes over one or more
continuous intervals of real numbers.
A probability distribution is the likely values that a random variable
may adopt along with the probability of carrying these values. A
prob ability distribution can be continuous or discrete, depending on
the nature of the random variable it represents.

We may develop a probability distribution using any one of the three


perspectives of probability.
□ First, if we can quantify the probabilities associated with the val
ues of a random variable from theoretical arguments; then we can
easily define the probability distribution.
□ Second, we can calculate the relative frequencies from a sample
of empirical data to develop a probability distribution.
□ Finally, we could simply specify a probability distribution using
subjective values and expert judgment. This is often done in cre
ating decision models for the phenomena for which we have no
historical data.

Researchers have identified many common types of probability


distri butions that are useful in a variety of applications of business
analytics. A working knowledge of common families of probability
distributions is important for several reasons. First, it can help you to
understand the underlying process that generates sample data. We
will investi gate the relationship between distributions and samples
later. Second, many phenomena in business and nature follow some
theoretical dis tribution and, therefore, are useful in building decision
models. Final ly, working with distributions is essential in computing
probabilities of occurrence of outcomes to assess risk and make
decisions.

DISCRETE PROBABILITY DISTRIBUTIONS


For a discrete random variable X, the probability distribution of the
discrete outcomes is called a probability mass function and is
denoted

N O T E S

by a mathematical function, f(x). The symbol x; represents the ith value


of the random variable X and f(x) is the corresponding probability.

BERNOULLI DISTRIBUTION

The Bernoulli distribution depicts a random variable with two possi


ble outcomes, with each having a constant probability of occurrence.
A success can be any outcome you define. For example, in
attempting to boot a new computer just off the assembly line, we
might define a success as "does not boot up" in defining a Bernoulli
random vari able to characterise the probability distribution of a
defective product. Thus, success need not be a favourable result in
the traditional sense.

BINOMIAL DISTRIBUTION

The binomial distribution models n independent replications of a


Ber noulli experiment, each with a probability p of success. The
random variable X represents the number ofsuccesses in these n
experiments. Let us consider a telemarketing example, suppose we
call n = 10 cus
tomers, each of which has a probability p = 0.2 of making a purchase.
Then the probability distribution of the number of positive responses
obtained from 10 customers is binomial. Using the binomial distribu
tion, we can calculate the probability that exactly x customers out of
the 10 will make a purchase. The value of x will always be between
0 and 10. A binomial distribution might also be used to model the
results of sampling inspection in a production operation or the effects
of drug research on a sample of patients.

POISSON DISTRIBUTION

The Poisson distribution is a discrete distribution used to model the


number of occurrences in some unit of measure-for example, the
number of customers arriving at a Subway store during a weekday
lunch hour, the number of failures of a machine during a month,
num ber of visits to a Web page during 1 minute, or the number of
errors per line of software code. The Poisson distribution assumes no
limit on the number of occurrences (meaning that the random
variable X may assume any non-negative integer value), that
occurrences are in dependent and that the average number of
occurrences per unit is a constant, A (Greek lowercase lambda). The
expected value of the Pois son distribution is 1, and the variance also
is equal to )1,.

UNIFORM DISTRIBUTION

The uniform distribution depicts a constant random variable for


which all outcomes between some maximum and minimum value are
alike. The uniform distribution is often assumed in business analytics
applications when little is known about a random variable other than
reasonable estimates for minimum and maximum values. The param-
N O T E S

eters are chosen judgmentally to reflect a modeller's best guess about


the range of the random variable.

NORMAL DISTRIBUTION

The normal distribution is a continuous distribution that is described


by the familiar bell-shaped curve and is perhaps the most important
distribution used in statistics. The normal distribution is observed in
many natural phenomena. Test scores such as the SAT, deviations
from specifications of machined items, human height and weight and
many other measurements are often normally distributed.

The normal distribution is characterised bytwo parameters: the mean,


µ,, and the standard deviation, cr. Thus, as µ, changes, the location of
the distribution on the x-axis also changes, and as cr is decreased or
increased, the distribution becomes narrower or wider, respectively.

DATA MODELLING AND DISTRIBUTION FITTING

In many applications of business analytics, we need to collect sam


ple data of important variables, such as customer demand, purchase
behaviour, machine failure times, service activity times, etc. to gain
an understanding of the distributions of these variables. We can also
construct frequency distributions and histograms and compute basic
descriptive statistical measures to better understand the nature of the
data. However, sample data are just that-samples.

Using sample data may limit our ability to predict uncertain events
that mayoccur because potential values outside the range of the sam
ple data are not included. A better method is to identify the
probability distribution of the sample data by retrofitting a theoretic
distribution to the data and verifying it.

To select an appropriate theoretical distribution that fits the sample


data, we might begin by examining a histogram of the data to look
for the distinctive shapes. If the histogram is symmetric with a peak
in the middle, the distribution is normal. If the histogram is very
positively skewed with no negative values, the distribution may be
exponential. Similarly, a very positively skewed histogram with the
density drop ping to zero at the edge indicates a lognormal
distribution.

Various forms of the gamma, Weibull, or beta distributions could be


used for distributions that do not seem to fit one of the other
common forms. This approach is not, of course, always accurate or
valid, and sometimes it can be difficult to apply, especially if sample
sizes are small. However, it may narrow the search down to a few
potential dis tributions.

Summary statistics can also provide clues about the nature ofa distri
bution. The mean, median, standard deviation and coefficient of vari
ation often provide information about the nature of the distribution.

N O T E S

For instance, normally distributed data tend to have a fairly low coef
ficient of variation (however, this may not be true if the mean is
small).

For normally distributed data, we would also expect the median and
mean to be approximately the same. For exponentially distributed
data, however, the median will be less than the mean. Also, we
would expect the mean to be about equal to the standard deviation,
or, equiv alently, the coefficient of variation would be close to 1. We
could also look at the skewness index. Normal data are not skewed,
whereas lognormal and exponential data are positively skewed. The
following example of Analysing Airline Passenger Data will help
better in un derstanding the distribution of a normal data.

An airline operates a daily route between two medium-sized cities


us ing a 70-seat regional jet. The flight is rarely booked to capacity
but often accommodates business travellers who book at the last
minute at a high price. The histogram shows a relatively symmetric
distri bution. The mean, median, and mode are all similar, although
there is some degree of positive skewness. It is important to
recognise that this is a relatively small sample that can exhibit a lot
of variability compared with the population from which it is drawn.
Thus, based on these characteristics, it would not be unreasonable to
assume a nor mal distribution for developing a predictive or
prescriptive analytics model.

SELF ASSESSMENT QUESTIONS


10. Probabilities are ex.pressed as values between O and 10.
(True/ False)
11. probabilities can help to predict future purchases
based on past purchases.
12. A variable is a numerical description of the outcome
of an experiment.

ACTMTY
Outline your plans if you are assigned an opportunity to study,
eval uate and come out with an execution plan for a newly
launched store chain that is planning to maximise their sales.

4J■suMMARY
□ Descriptive analytics is the most essential type of analytics and
es tablishes the framework for more advanced type of analytics.
□ Data visualisation is the method of showing data in a graphical
manner to provide insights that help take better decisions.
:

N O T E S

D Raw data is important, particularly when one needs to identify ac


curate values or compare individual numbers.
D Dashboards deliver important key synopses of valuable business
data to efficiently manage a business function or process.
D Excel refers to the vertical bar charts as column and horizontal
bar charts as bar charts.
D Data labels can be added to chart elements to show the actual val
ue of bars.
D Pie charts are preferred only in two dimensional form for
effective and simpler data representation.
D The measure of location that specifies the middle value when
the data sets are arranged from the least to the greatest is the
median.
D Conditional probability is the probability of occurrence of one
event A, given that another event B is known to be true or has
already occurred.

El KEYWORDS

D Cluster Sampling: It refers to dividing a population into clus


ters (subgroups), sampling a cluster set, and conducting a com
plete survey within the sampled clusters.
D Dashboard: It is a visual picture of group of specific business
measures.
□ Line chart: A type of chart that is used to display data
pertain ing to a given period.
D Mean: It is the sum of the observations divided by the total ob
servations.
D Scatter chart: A type of chart that is used to demonstrate
the connection between two variables.

if• DESCRIPTIVE QUESTIONS


1. Discuss the importance of data visualisation with the help of
suitable examples.
2. What do you understand by descriptive statistics? How mean,
median and mode are calculated in statistics?
3. Describe sampling and estimation with suitable examples.
4. Explain the concept of probability distribution. Also, enlist the
rules and formulas used in probability.

N O T E S

iJ:i ANSWERS AND HINTS


ANSWERS FOR SELF-ASSESSMENT QUESTIONS

Topic Q. No. Answers


Visualising and Exploring Data 1. Visualising
2. Tabular and visual

3. Tn1e
Descriptive Statistics
4. Statistics
5. True
6. Median
Sampling and Estimation 7. d. All of these
8. Random
9. Stratified
Introduction to Probability Dis 10. False
tributions
11. Conditional
12. Random

HINTS FOR DESCRIPTIVE ANSWERS


1. Data visualisation is the method of showing data (typically
in larger quantities) in an expressive manner to provide
understandings that will help in taking better decisions. Refer to
Section 5.2 Visualizing and Exploring Data.
2. Statistics involves collecting, organising, analysing, interpreting,
and presenting data. Refer to Section 5.3Descriptive Statistics.
3. A sampling plan is a description of the approach that is used to
obtain samples from a population prior to any data collection
activity. Refer to Section 5.4 Sampling and Estimation.
4. Probability quantifies the uncertainty that we encounter all
around us and is an important building block for business
analytics applications. Refer to Section 5.5 Introduction to
Probability Distributions.

if■SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
□ Sheikh, N. M. (2013). Implementing analytics: a blueprint for de
sign, development, and adoption. Amsterdam: Elsevier.
□ Atzmuller, M., & Roth-Berghofer, T. R. (2016). Enterprise big data
engineering, analytics, and management. Hershey: IGI Global.
't :

N O T E S

E-REFERENCES
D Descriptive, Predictive, and Prescriptive Analytics Explained.
(2016, August 05). Retrieved May 01, 2017, from https://halobi.
com/2016/07/descriptive-predictive-and-prescriptive-analytics-ex
plained/
D Big Data Analytics: Descriptive Vs. Predictive Vs. Prescriptive.
(n.d.). Retrieved May 01, 2017, from http://www.information
week.com/big-data/big-data-analytics/big-data-analytics-descrip
tive-vs-predictive-vs-prescriptive/d/d-id/1113279
D What is descriptive analytics? - Definition from Whatls.com.
(n.d.). Retrieved May 01, 2017, from
http://whatis.techtarget.com/defini tion/descriptive-analytics
CONTENTS

6.1 Introduction
6.2 Predictive Modelling
6.2.1 Logic Driven Models
6.2.2 Data Driven Models
Self Assessment Questions
Activity
6.3 Introduction to Data Mining
Self Assessment Questions
Activity
6.4 Data Mining
6.4.1 Methodologies
6.4.2 Classification
6.4.3 Regression
6.4.4 Clustering (K-means)
Artificial Neural Networks
Self Assessment Questions
6.5 Activity
6.6 Summary
6.7 Descriptive Questions
6.8 Answers and Hints
Suggested Readings & References
INTRODUCTORYCASELET
N O T E S

SAMSUNG WON OVER THE MARKET SENTIMENTS USING


PREDICTIVE ANALYTICS

Global mobile major Samsung electronics introduced a phone


called Note 7 around October 2016. Although futuristic in speci
fications with class leading performance, this phone turned out
to be the darkest blot in the otherwise clean bowl of the Samsung
smartphone assembly lines. The phone had critical battery fail
ure issues which even resulted in fewphone explosions across the
world. Airlines across the world banned passengers from board
ing the flight if they were found to be carrying Note 7 with them.
Samsung restricted the charging to 60% with a firmware upgrade
but in the end, it became a matter ofso much ridicule for the com
pany with decreased levels of brand confidence and customers
fleeing, that led the company to ultimately recall all the phones it
sold and put the lid on the project Note 7 forever - a total loss of
$18billion.

However, rather than taking it as an incident to beat the bush


around with and pinning the blame on quality control, vendors
and everyone else, Samsung took it in a positive stride. They fig
ured out the real issue with the battery, fixed the gaps and ex
ploited the existing market sentiments cleverly by emphasising
on their battery issues openly and steps they took to fix that goof
up and not staying behind in accepting and recalling the defec
tive brand like a true professional consumer driven company. Re
sults? Their competitors too had to follow the suit and declare
the safety features of their devices along with other specifications
and next phone launch of Samsung - Galaxy 88 got rave reviews
and accolades across the technical diaspora and forums. Samsung
achieved this by applying predictive analytics on the data collect
ed related to the issues of the Note 7 phone. The company pre
dicted the existing anger and expected the scornful views of the
loyal base of consumers - and gave them quite a fewindustry-
first reasons to make them believe to their consumer friendly
image again - from issuing credit notes to exchanging devices
with 87 Edge device with extra offers to issuing apology notes
and to lead ing the only complete recall in history of mobiles -
Samsung won over the market sentiments simply by predicting
the outpour and anger of the customers way before it could get
worse.
PREDICTIVE :ANALYTICS

N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Explain predictive modelling
- Describe the concept of data mining
- Explore different data mining methodologies

CM ■ INTRODUCTION
In the previous chapter, you have learned about descriptive
analytics analyses a database to provide information on the trends
of past or current business events that can help managers, planners,
leaders, etc., to develop a road map for future actions. Descriptive
analytics performs an in-depth analysis of data to reveal details
such as fre quency of events, operation costs, and the underlying
reason for fail ures. It helps in identifying the root cause of the
problem. On the other hand, Predictive analytics is about
understanding and predicting the future and answers the question
'What could happen?' by using statis tical models and different
forecast techniques. It predicts the near fu ture probabilities and
trends and helps in what-if analysis. In predic tive analytics, we use
statistics, data mining techniques, and machine learning to analyse
the future. Figure 6.1 shows the steps involved in predictive
analytics:

r
Level ol
Insight

Whal Happened? Why Did II Happen? WhalW I Happen?

---- BlMaturity----

Figure 6.1: Predictive Analytics


Som·ce: http://www.witinc.com/predictive-analytics.id.355.htrn

In this chapter, you will first learn about about predictive modelling.
Further, the chapter discusses about the concept of data mining. To
wards the end, the chapter discusses about different data mining
methodologies such as classification, regression, clustering (K-
means) and artificial neural networks.
:

N O T E S

rif• PREDICTIVE MODELLING


Predictive modelling is the method of making, testing and
authenti cating a model to best predict the likelihood of a
conclusion. Several modelling procedures from artificial
intelligence, machine learning and statistics are present in
predictive analytics software solutions. The model is selected on
the basis of testing, authentication and as sessment using the
detection theory to predict the likelihood of an outcome in a given
input data amount. Models can utilise single or more classifiers to
decide the probability of a set of data related to an other set. The
different models available for predictive analytics soft ware enables
the system develop new data information and predictive models.
Each model has its own strengths and weakness and is best suited
for types of problems.
Predictive analysis and models are characteristically used to predict
future probabilities. Predictive models in business context, are used
to analyse historical facts and current data to better comprehend cus
tomer habits, partners and products and to classify possible risks and
prospects for a company. It practices many procedures, including
sta tistical modelling, data mining and machine learning to aid
analysts make better future business predictions.
D Predictive modelling is at the heart of business decision making.
D Building decision models more than science is an art.
D Creating an ideal decision model demands:
♦ Good understanding of functional business areas
♦ Knowledge of conventional and in-trend business practices
and research
♦ Logical skillset
D It is always recommended to start simple and keep on adding to
the models as required.
The greatest set of changes and advances in predictive modelling are
coming to fruition due to the increase in unstructured
information content archives, video, voice and pictures-joined
with quickly im proving analytical methods. Basically, predictive
modelling requires organised data-the kind which is found in social
databases. To make unstructured data indexes valuable for this sort
of examination, or ganised data must be extricated from them first.
One case is sentiment analysis from Web posts. Data can be found in
client posts on forums, online journals and different sources that
foresee consumer loyalty and deals trends for new items. It would
be about impossible, in any case, to attempt to assemble a predictive
model specifically from the content in the posts themselves. An
extraction step is required to get usable data as keywords,
expressions and importance from the con tent in the posts. At that
point, it's conceivable to search for the con-
PREDICTIVE :ANALYTICS

N O T E S

nection between cases of the instances "issues with the item", for ex
ample, and increase in customer service calls.

Predictive models are representations of the relationship between


how a member of a sample performs and some of the known
charac teristics of the sample. The aim is to assess how likely a
similar mem ber from another sample is to behave in the same
manner. This model is used a lot in marketing. It helps identify
implied patterns which indicate customers' preferences. This model
can even perform calcu lations at the exact time that a customer
performs a transaction.

Predictive analytics methods depend on the quantifiable variables,


controlling metrics to forecast future performance or outputs given
many quantifiable methods.

A predictive analytics model combines many predictors or quantifi


able variables. This method allows for the data collection and prepa
ration of a statistical model, to which extra data can be added as and
when available.

The accumulation of higher data volumes creates a nifty predictive


model, trusting the larger data sets which produce more dependable
forecasts based on the data volume examined. Moreover, trusting the
actual data to power predictive analytics models marks better accu
rateness of the predicting process.

The various business processes on predictive modelling are as follows:


1. Creating the model: A software based solution allows you to
make a model to multiple algorithms on the dataset.
2. Testing the model: Test the predictive model on the dataset.
In some situations, the testing is done on previous data to the
effectiveness of a model's prediction.
3. Authenticating the model: Authenticate the model results by
means of business data understanding and visualisation tools.
4. Assessing the model: Assessing the best suited model from the
used models and selecting the appropriate model tailored for the
data.

The predictive modelling process includes executing one or more al


gorithms on the dataset subjected to prediction. This is a recurring
process and often includes model training, using several models on
the same dataset and lastly getting the appropriate model based on
the business data.

6.2.1 LOGIC DRIVEN MODELS

Logic driven models are created on the basis of inferences and postu
lations which the sample space and existing conditions provide.
Creat ing logical models requires solid understanding of business
functional
N O T E S

areas, logical skills to evaluate the propositions better and knowledge


of business practices and research.

To understand better, let's take an example of a customer who vis


its a restaurant around six times in a year and spends around t5000
per visit. The restaurant gets around 40% margin on per visit
billing amount.
The annual gross profit on that customer turns out to be 5000X6X0.40
= n20001-.

30% of the customers do not return each year, while 70% do return to
provide more business to the restaurant.

Assuming the average lifetime of a customer (time for which a con


sumer remains a customer) 1/.3 =3.33 years. So, the average gross
profit for a typical customer turns out to be 12000 x 3.33 =t39,960

Armed with all the above details, we can logically arrive at a conclu
sion and can derive the following model for the above problem state
ment:

Economic value of each customer (V) = (R X F x X M)/D

where,

R = Revenue generated per customer

F = Frequency of visits per year


M = Profit margin
D = Defection rate (Non-returning customers each year)

So, as you can see, logical driven predictive models can be derived
for a number of situations, conditions, problem statements and a lot
other scenarios where predictive analytical models provide a
futuristic view on the basis of validation, testing and evaluation to
guess the likeli hood of an outcome in a given set amount of input
data.

6.2.2 DATA DRIVEN MODELS

A data-driven model is based on the data analysis of a specific sys


tem. The main data-driven model concept is to find links between the
state system variables (input and output) without clear knowledge of
the physical attributes and behaviour of the system. The data driven
predictive modelling derives the modelling method based on the set
of existing data and entails a predictive methodology to forecast the
fu ture outcomes. Acompany expecting losses in the current quarter
due to the poor market performance and sentiments is a classic
example of data driven predictive modelling. You have the data and
you know about the data inferences. You need not predict anything
related to data, unlike Logic driven models. You are simply
predicting the out-
PREDICTIVE :ANALYTICS

N O T E S

comes based on the data. Refer to the caselet in this chapter for data
driven modelling - Samsung's case with their product and their en
suing actions as a good example of data driven predictive modelling.

SELF ASSESSMENT QUESTIONS


1. Predictive analysis is all about predicting outcomes. (True/
False)
2. is at the heart of business decision making.
3. Logical models differ from data driven models based on the
size and type of input variables available. (True/False)
4. Economic value of each customer (V)=

ACTIVITY
Create a data driven model using MS Excel to denote the
variation in a product's sales for last 3 years.

CiMM INTRODUCTION TO DATA MINING


Data mining is a growing business analytics field focused on
better understanding of the features and designs among variables
in huge databases using a variety of analytical and statistical tools.
Most of the tools discussed in earlier chapters, such as data
visualisation, data summarisation, pivottables, correlation, regression
analysis, etc., can be used in data mining extensively. However, as
the amount of data has grown exponentially, many other statistical
and analytical meth ods have been developed to identify
relationships among variables in large data sets and understand
hidden patterns that they maycontain. Figure 6.2 shows the four
stages of data mining:

Four stages of data mining


-□ATAEXPLORATION/ '!oDELING
0
.DATASOURCES
These range from DEPLOYING MODELS
GATHERING
databaselonews wires, Users create Take anaction based
This stageinvolves
and areconsidered amodel, test it, and on the results from the
a problem defimt1on thesampling and
then evaluate models
transformationof data

Figure 6.2: Four Stages of Data Mining


Source: http://searchsqlserver.techtarget.com/definition/data-mining

Data mining can be considered part descriptive and part prescriptive


analytics. In descriptive analytics, data-mining tools help analysts
to identify patterns in data. Excel charts and PivotTables, for exam
ple, are useful tools for describing patterns and analysing data sets;
however, they require manual intervention. Regression analysis and
forecasting models help us to predict relationships or future values of
variables of interest.
:

N O T E S

In most business applications, the purpose of descriptive analytics is


to help managers predict the future or make better decisions that will
impact future performance, so we can generally state that data
mining is primarily a predictive analytic approach. Some core ideas
in data mining are as follows:
D Classification: Classification is the most essential type of data
analysis. The beneficiary of an offer can react or not react. A
can didate for a loan can repay on time, late, or opt for non-
payment. A credit card charge can be normal or deceitful. A data
packet go ing on a network system can be good or bad. A bus in
fleet can be accessible for service or inaccessible. A patient can
recover, still be sick, or expire. A typical assignment in data
mining is to analyse the information where the classification is
unknown or will hap pen later.
Similar data where the classification is known are utilised to cre
ate rules, which are then subjected to the data with the unknown
order. We will study about classification in more detail further in
the chapter.
D P1·ediction: Prediction resembles classification, aside from that
we are attempting to foresee the estimation of a numerical
variable (e.g., measure of procurement) as opposed to a class
(e.g., buyer or non-buyer). Obviously, in classification, we are
attempting to fore see a class, yet the term forecast in this book
alludes to the forecast of the constant variable estimation. Once in
a while, in the data mining terms, estimation and regression are
utilised to refer to the forecast of the value of a continuous
variable, and prediction might be utilised for both continuous and
segmented.
D Affiliation rules and recommendation systems: Huge databases
of client transactions advance themselves to the relationship anal
ysis among things acquired, or "what runs with what."
Association rules are intended to discover such broad association
designs among things in large databases. The principles can
then be utilised as a part of an assortment of ways. For
instance, su permarkets can utilise such data for item
arrangement. They can utilise the rules for week by week
special offers or for packaging items.
Association rules contracted from a medical facility database on
patients' manifestations amid successive hospitalisations can help
discover "which side effect is trailed by what other side effect"
and help anticipate future indications for returning patients. On
line suggestion frameworks, for example, those utilised on Ama
zon.com and Netflix.com, utilise Collaborative Filtering, a strate
gy that uses individual clients' inclinations and tastes given their
past purchases, rating, browsing, or whatever other quantifiable
conduct characteristic of inclination, and other clients' histories.
As opposed to classification that creates rules general to an entire
populace, collaborative filtering creates "what runs with what" at
•• • ' .
N O T E S

the individual client level. Henceforth, collaborative filtering is


uti lised as a part of numerous suggestion frameworks that intend
to convey customised proposals to users with an extensive variety
of preferences.

SELF ASSESSMENT QUESTIONS


5. Data miningis a practice of scrubbing out the datafrom
various sources for further evaluation and analytical
purposes. (True/ False)
6. Which of the following is/are the tools used in data mining?
a. Data visualisation
b. Data summarisation
c. Correlation
d. All of these
7. Predictive analysis deals with data mining in the same way
business analytics deals with raw data. (True/False)
8. The third stage in data mining is _
9. Data mining is solely predictive analytical strategy since
descriptive and prescriptive analytics deal with data only
after receiving it and predictive analysis forecasts the data
outcomes. (True/False)

ACTIVITY
Create a PowerPoint presentation on techniques used in data min
ing and show it in your class.

ill DATA MINING METHODOLOGIES


Databases can accommodate vast quantities of data that aids in de
cision making. As discussed earlier, data mining is a set of tools and
techniques that help organisations to perform this task. Some com
mon approaches used in data mining include the following:
□ Data Exploration and Reduction: This often involves identifying
groups in which the elements of the groups are in some way
simi lar. This approach is often used to understand differences
among customers and segment them into homogenous groups.
For exam ple, a department store recognised four lifestyles of its
customers:
♦ "Kacy," an old-style, classic dresser quality lover and less
risk taker;
♦ "Brenda," hybrid of tradition and contemporary and classic
but with a modern touch;
♦ "Victoria," a modern, contemporary brand lover customer;
and finally
:

N O T E S

♦ "Alex," the fashion oriented customer who seeks the newest


and best.
Such segmentation is useful in design and marketing activities to
better target product offerings. These techniques have also been
used to identify characteristics of successful employees and im
prove recruiting and hiring practices.
D Association: Association is the analysis of databases to recognise
naturalvariable associations and create buying recommendations
or target marketing rules. For example, Netflix uses association
to understand what types of movies a customer likes and provides
recommendations based on the data. Amazon.com also makes
rec ommendations based on past purchases.
o Cause-and-effect modelling: Cause-and-effect modelling is the
process of developing analytic models to describe the relationship
between metrics that drive business performance-for instance,
profitability, customer satisfaction, or employee satisfaction. Un
derstanding the drivers of performance can lead to better deci
sions to improve performance. For example, the controls group
of CGL Inc. evaluated the relationship between contract-renewal
rates and overall satisfaction. They concluded that 91 % contract
renewals were of the customers who were either very satisfied or
satisfied, and higher defection rate for not satisfied customers.
Their model foretold that a one-percent-point surge in the gen
eral satisfaction score was worth $12 million in renewals of
yearly service contracts. As a result, they identified decisions that
would improve customer satisfaction. Regression and correlation
analy sis are key tools for cause-and-effect modelling.

6.4.1 CLASSIFICATION
Classification is the process of analysing data to predict howto
classify a new data element. An example of classification is spam
filtering in an e-mail client. By examining textual characteristics
of a message (subject header, key words, and so on), the message
is classified as junk or not. Classification methods can aid predicting
if a credit-card charge may be fake, risk details of a loan applicant, or
whether expect ing a consumer response to an advertisement.
Classification is about predicting a positive conclusion based on a
given input and algorithm. The algorithm attempts to determine the
relationships between the attributes that will make it feasible to fore
cast the outcome. Next an unseen data set is given to the algorithm,
called prediction set, containing the same set of attributes, excluding
the prediction attribute. The algorithm examines the input and yields
a prediction. The accuracy of the prediction describes about the ef
ficiency of the algorithm. For example, the training set in a medical
database would have applicable patient information captured earlier
in which the prediction attribute is the patient's heart problem.
PREDICTIVE :ANALYTICS

N O T E S

Figure 6.3demonstrates the prediction sets and training of such a da


tabase:

Training set
Aee Heart rate Blood oressure Heart oroblem
65 78 150/70 Yes
37 83 112/76 No
71 67 108/65 No

Prediction set
Ae Heart rate Blood pressure Heart problem
43 98 147/89 7
65 58 106/63 ?
84 77 150/65 ?

Figure 6.3: Showing Training set and Prediction Set for


Medical Database
Among a few types of data representation known, classification nor
mally uses forecast principles to express learning and knowledge.
Pre diction standards are communicated as IF-THEN guidelines,
where the antecedent (IF part) comprises a conjunction of conditions
and the rule subsequent (THEN part) predicts a specific expectations
trait for an item that fulfils the forerunner. Utilising the above
example, a rule expecting the first row in the training set might be
represented as:
IF (Age=65 AND Heart rate>70) OR (Age>60 AND Blood pres
sure>140/70) THEN Heart problem=yes
Most of the time, the prediction rule is monstrously bigger in com
parison to the case specified above. In conjunction, each condition is
isolated by the OR keyword and therefore, defines tinier standards
catching attributes relationship. Fulfilling any of these smaller rules
implies that the resultant is the expectation. Each smaller rule is
shaped with AND's which encourages narrowing down relations be
tween attributes.
How well forecasts are done is measured as rate of predictions hit
against the total number of forecasts specified. A good rule should
have a hit rate more prominent than the event of the prediction attri
bute. At the end of the day, if the algorithm is attempting to foresee
rain in Seattle and it downpours 80% of the time, any algorithm
could without much of a stretch can hit a rate of 80%. Subsequently,
80% is the base prediction rate that any algorithm ought to achieve in
this sit uation. The ideal solution is a rule with 100% forecast hit rate,
which is hard, if not impossible, to accomplish. In this manner, apart
from some certain issues, classification by definition must be cracked
by approximation based algorithms.

6.4.2 REGRESSION

Regression analysis is an instrument for creating statistical and math


ematical models that define relations between a dependent
variable
N O T E S

(should be ratio variable, not categorical) and one or more


descriptive or independent numerical (ratio or categorical) variables.

Two broad categories of regression models are used often in business


settings are (1) Regression models of cross-sectional data and (2) Re
gression models of time-series data, in which the independent vari
ables are time or some function of time and the focus is on predicting
the future. Time-series regression is an important tool in forecasting.

A regression model involving a single autonomous variable is called


simple regression while a regression model involving two or more
au tonomous variables is called multiple regression.

Simple linear regression involves finding a linear relationship be


tween one independent variable, X, and one dependent variable, Y.
The relationship between two variables can assume many forms. The
relationship may be linear or nonlinear, or there may be no relation
ship at all. Because we are focusing our discussion on linear
regression models, the first thing to do is to verify that the
relationship is linear. We would not expect to see the data line up
perfectly along a straight line; we simply want to verify that the
general relationship is linear. If the relationship is clearly nonlinear,
then alternative approaches must be used, and if no relationship is
evident, then it is pointless to even consider developing a linear
regression model.

To determine if a linear relationship exists between the variables, we


recommend that you create a scatter chart that can display the rela
tionship between variables visually as shown in Figure 6.4:

. ··•--·-· •
• • •
• • •

(a) Linear (b)Nonlinear (c) No relationship

Figure 6.4: Displaying Relationship Between Variables

Linear regression models are not appropriate for every situation.


Ascatter chart of the data n'light show a nonlinear relationship, or the
residuals for a linear fit might result in a nonlinear pattern. In such
cases, we might propose a nonlinear model to explain the
relationship. For instance, a second-order polynomial model would
be:

Sometimes, this is called a curvilinear regression model. In this


mod el, [31 represents the linear effect of X on Y, and [32 represents
the curvilinear effect. However, although this model appears to be
quite
PREDICTIVE :ANALYTICS

N O T E S

different from ordinary linear regression models, it is still linear in


the parameters (the betas, which are the unknowns that we are trying
to estimate). In other words, all terms are a product of a beta
coefficient and some function of the data, which are simply
numerical values. In such cases, we can still apply least squares to
estimate the regression coefficients. Curvilinear regression models
are also often used in fore casting when the independent variable is
time.

6.4.3 CLUSTERING (K-MEANS)

Cluster analysis, (data segmentation),is a set of techniques that aim


to group or categorise an object collection (i.e., remarks or records)
into clusters or subsets in a way that those inside each cluster are
more closely related than objects of different clusters. The objects
inside clusters should display high similarity, while objects of
different clus ters will stay dissimilar. Cluster analysis reduces the
data overhead since it can take number of observations, such as
questionnaires or customer surveys, and decrease the information
into smaller easier to interpret similar groups. The segmentation of
customers into small er groups, for example, can be used to
customise advertising or pro motions. As opposed to many other
data-mining techniques, cluster analysis is primarily descriptive, and
we cannot draw statistical infer ences about a sample using it. In
addition, the clusters identified are not unique and depend on the
specific procedure used; therefore, it does not result in a definitive
answer but only provides new ways of looking at data. Nevertheless,
it is a widely used technique. There are twomajor methods of
clustering-hierarchical clustering and k-means clustering.

In hierarchical clustering, the data is not divided into a specific


cluster in one step. Instead, number of partitions take place, running
from a single cluster covering all n clusters objects, each having a
lone object. Hierarchical clustering is further divided into
agg[omerative grouping methods, which continue by fusions of then
objects series into groups and divisive clustering methods, which
isolate n objects consecutively into higher groupings. Figure
6.5shows the concept of agglomerative grouping methods and
divisive clustering methods:

Agglomerative(AGNES)

Figure 6.5: Concept of Agglomerative Grouping Methods


and Divisive Clustering Methods
Source: http://hbanaszak.mjr.uw.edu.pl(rempTxt/ClusterAnalysis/Hierarchical%20Cluster
ing-Introduction.htm
N O T E S

K-means is one of the modest, intuitive learning algorithms that can


crack the well-known grouping problem. The procedure consists of a
simple and easy wayto categorise a dataset through a fixed number of
clusters (say, k clusters). The primary hint is to describe k centroids,
one for each cluster.

Clustering is a group partitioning process of data points into smaller


clusters. E.g. the items in a supermarket are grouped in categories
(cheese, butter and milk are as dairy products). Naturally, this is a
qualitative partitioning. A quantitative approach should be to calcu
late certain product features, like milk percentage and products with
high milk percentage grouped as one. In common, we have n data
points x., i=1...n to be partitioned ink clusters. The aim is to allocate
a cluster to each data point. K-means is a clustering method with the
purpose of finding the µ.,i, i= 1...k positions of the clusters that mini
malise the data points distance from the cluster. K-means clustering
solves: k
argmin L L = L
L
k
d(x, µ.;) argmin [[x - µ,;[[;
c i=l XEC-, c i=l XE<q

ci =set of points belonging to cluster i.

The K-means clustering uses the Euclidean distance square d(x,


µ.,)=llx- µ.,i11/ This problem is in fact NP-hard, so the K-means algo
rithm aims to find the universal minimum, perhaps getting trapped in
a different solution.

The goal of this algorithm is to find groups in the data, with the
num ber of groups represented by the variable K. The algorithm
works it eratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered
based on feature similarity. The results of the K-means clustering
algorithm are:
1. The centroids of the K clusters, which can be used to label
new data
2. Labels for the training data (each data point is assigned to a
single cluster)

As opposed to characterising groups before studying the data,


clus tering enables you to discover and dissect the groups that
have naturally shaped. Every centroid of a group is a collection of
the highlighted values which characterise the subsequent groups.
Eval uating the centroid feature weights can be utilised to
subjectively decipher what sort of group each cluster
communicates with and represents. These centroids ought to be set
skillfully as different ar eas cause diverse outcomes. Along these
lines, the better decision is to place them as much as reasonably
could be expected, far from each other. The following step is to
take each group having a place
••


N O T E S

towards a given dataset and associate it with the closest centroid.


At the point, when no point is pending, the initial step and an ear
ly groupage is finished. Now we have to re-compute knew centroids
as barycenters of the groups coming about because of the previous
step. After we have these k new centroids, another binding should
be done between similar informational index focuses and the clos
est new centroid. A loop has been produced, citing which, we may
see that the k centroids change their area stepwise until no more
changes are done. Simply put, the centroids don't move any more.
Lastly, this algorithm focuses on minimising the objective function
which is a squared error function in this case.

BUSINESS USES

The K-means clustering algorithm is employed to discover the


groups which have not been unequivocally labeled in the data. This
can be used to affirm business assumptions about group types that
exist or to recognise unclear groups in complex datasets. Once the
algorithm is executed with characterisation of groups, any new data
can be effort lessly allotted to the right group.

This is a flexible algorithm that can be utilised for a group. A few


use cases types are:
□ Behavioral segmentation:
♦ Segment purchase history and activities on application, web
site, or platform
♦ Define interests based roles
♦ Profiling based on activity monitoring
□ Inventory categorisation:
♦ Group inventory by sales activity and manufacturing metrics
□ Sorting sensor measurements:
♦ Detect activity in motion sensors
♦ Group images and separate audio
♦ Identify health monitoring groups
□ Detecting bots or irregularities:
♦ Separating valid activity groups from bots
♦ Grouping valid activity to clean up outlier detection
The K-means clustering algorithm practices iterative improvement to
yield a final result. The inputs of algorithms are the number of K
clus ters and the dataset which is a collection of each data point
features. The algorithm initiates with early evaluations for the K
centroids, which can either be randomly selected or generated from
the dataset.
N O T E S

Example: Consider the following data set containing scores of two


variables of seven individuals:

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be clustered into twogroups. Let the Aand B values
of the two individuals farthest apart (using the Euclidean distance
cal culation), define the initial cluster means:

Individual Mean Vector (Centroid)


Group 1
Group 2 1 (1.0, 1.0)
4 (5.0, 7.0)

The remaining individuals are now inspected serially and assigned to


their closest clusters, in Euclidean distance to the cluster mean. The
mean vector is revaluated each time for a new member's entry. This
leads to the following steps:

Cluster 1 Cluster 2
Mean Vector Mean
Step Individual Individual
(Centroid) Vector
(Centroid)
1 1 (1.0. 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2. 1.5) 4 (5.0, 7.0)
3 1, 2,3 (1.8. 2.3) 4 (5.0, 7.0)
4 1,2,3 (1.8. 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8. 2.3) 4, 5,6 (4.3, 5.7)
6 1, 2,3 (1.8. 2.3) 4,5, 7 (4.1, 5.4)

Now the initial partition is no more the same and the two clusters
cur rently have the following features:

Individual Mean Vector (centroid)


Cluster 1 1,2,3 (1.8, 2.3)
Cluster 2 4,5,6.7 (4.1, 5.4)

But since not everyone has been assigned to the respective cluster,
we cannot say for sure. So, we relate everyone's distance from its
own cluster mean and of the opposite cluster:

Individual Distance to mean (cen Distance to mean (centroid)


troid) of Cluster 1 of Cluster 2
1 1.5 5.4
PREDICTIVE :ANALYTICS

N 0 T E s

2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Person 3 is closer to the mean of the opposite cluster (Cluster 2) than


its own(Cluster 1). Simply, each person's distance to its own cluster
mean should be lower than the distance to the other cluster's mean
(unlike person 3). Thus, person 3 is moved to Cluster 2 ensuing a
new partition:

Individual Mean Vector (centroid)


Cluster 1 1,2 (1.3, 1.5)
Cluster 2 3,4.5.6, 7 (3.9, 5.1)

The recurring movement would continue from this new partition till
no 1nore relocations remain to occur. However, in this example, each
person gets nearer to its own cluster mean than the other cluster and
the recurrence stops, choosing the latest partitioning as the final clus
ter solution.

6.4.4 ARTIFICIAL NEURAL NETWORKS

Neural systems were designed according to the cognitive


subjective procedures of the mind. They can anticipate new
perceptions from the existing ones. A neural system comprises
interconnected handling components additionally called units,
nodes (hubs), or neurons. The neurons inside the system
cooperate, in parallel, to create an output function. Since the
calculation is performed by the collective neurons, a neural system
can deliver the output function regardless of the pos sibility that a
portion of the individual neurons are breaking down (the system is
strong and fault tolerant).

As a rule, every neuron inside a neural system has a related activation


number. Additionally, every association between neurons has a
weight related with it. These amounts recreate their partners in the
natural brain: firing rate ofa neuron and quality of a synapse. The
actuation of a neuron relies on upon the initiation of other neurons
and the weight of the edges that are associated with it. The neurons
inside a neural system are typically arranged in layers. The quantity
of layers inside the neural system, and the quantity of neurons inside
each layer nor mally matches the way of the examined spectacle.
After the size has been resolved, the system is generally subjected to
training. Here, the system gets a sample training input with its related
classes. It then ap plies an iterative procedure on the input to regulate
the weights of the system so that its future forecasts are ideal. After
the training stage,
N O T E S

the system is prepared to perform predictions in new groups of data.


Neural systems can frequently deliver extremely accurate predictions.
In any case, one of their most prominent disapproval is the way that
they speak to a black box approach to deal with research as they don't
divulge information into the fundamental way of the phenomena.

Neural frameworks can be used to predict the time data arrangement,


for instance, climate data. A neural framework can be expected to
rec ognise outline the data and make an output free of disorder.

As a complex algorithm, the neural system is naturally propelled by


the structure of the human mind. A neural system gives an extremely
basic model compared to the human mind.

Generally utilised for data arrangement, neural systems prepare past


and current information to gauge future values - finding any
complex relationships hidden in in the data - ina route closely
resembling that utilised by the human cerebrum.

Figure 6.6 demonstrates the neural-network structure of algorithm


and its three layers:

Past Data NeuralNetworks Predictor

-········...........
--......,.·--••••J Output layer (Predictions)

HiddenLayers (PredictionFunctionsl
lime
Input layer (Training Date)
Past

Figure 6.6: The Neural-network Str·ucture of Algorithm


and its Three Layers
Source: http://ecee.colorado.edu/-ecen483l/1ectures/MLPnet.gif

The input layer feeds past data values into the next (hidden) layer.
The black circles denote the hubs of the neural system. The hidden
layer stores a few complex functions that make predictors; those
functions are oblivious from the client. An arrangement of hubs
(black circles) at the hidden layer speaks to mathematical functions
called neurons, that alter the data information. The output layer
gathers the predic tions from the hidden layer and delivers the
outcome.

Here's a more rigorous look at how a neural system can deliver an


anticipated output from the input information. The hidden layer is
the key part of a neural system on account of the neurons it contains;
they work in coordination to do the significant estimations and create
PREDICTIVE :ANALYTICS

N O T E S

the output. Every neuron takes a group of input values; each is


related with a weight (more about that in a minute) and a numerical
value called bias. The output of every neuron is a function of the
output of the weighted aggregate of each input in addition to the bias.

Most neural systems use mathematical functions to initiate the neu


rons. A function in math is a connection between an input set and an
output set, with each input relating to an output. (For example, con
sider the negative function where an entire number can be input and
the outcome is its negative equivalent.). Basically, a function in math
works like a black box- it takes an input and produces an output.

Neurons in a neural system can utilise sigmoid functions to correlate


inputs to outputs. Asigmoid function is known as a logistic function,
if used in same manner and its equation resembles the following:
f(mput)=-1-
l +e••'

Here f refers to the activation function which activates the neuron,


and e denotes a mathematical constant that possesses an approximate
value of 2.718. The sigmoid functions are used in neurons as these
functions have positive derivatives and are easy to compute. More
over, these are continuous, can act as types of smoothing and
bound ed functions. This association of unique characteristics of
sigmoid functions is important for the workings of a neural network
algorithm
- mainly when a derivative calculation (such as the weight related
to each input to a neuron) is required. Neural networks possess high
accuracy, irrespective of significant amount of noise of data. This is a
major advantage when the hidden layer can still determine associa
tions in the data despite the presence of noise.

SELF ASSESSMENT QUESTIONS

10. Classification is a predictive analytical strategy aimed to


forecast the data. (True/False)
11. modelling is the process of developing analytic
models to describe the relationship between metrics that
drive business performance.
12. A simple linear regression analysis differs from a non-linear
analysis on the fact that nonlinear (or curvilinear) regression
is used more to predict the outcomes when one of the
independent variables happen to be time. (True/False)
13. A regression model involving a single autonomous variable is
called ---
14. K-means clustering is similar to neural networks, the only
difference being the approach and method involved in
devising the solution. (True/False)
:

N O T E S

ACTIVITY
A consumer products company has collected some data relating to
the advertising expenditure and sales of one of its products:
Advertising cost Sales
$300 $7000
$350 $9000
$400 $10000
$450 $10600
Figure out the model that would best depict the above data in
the least number of steps.

liiJsuMMARY
D Predictive modelling is the method of making, testing and authen
ticating a model to best predict the likelihood of a conclusion.
D Predictive analysis and models are characteristically used to pre
dict future probabilities.
D Predictive models are representations of the relationship between
how a member of a sample performs and some of the known
char acteristics of the sample.
D Predictive analytics methods depend on the quantifiable variables,
controlling metrics to forecast future performance or outputs.
□ Logic driven models are created on the basis of inferences and
postulations provided by the sample space and existing
conditions.
□ A data-driven model is based on the data analysis of a specific
sys tem.
□ Regression analysis and forecasting models help us to predict
rela tionships or future values of variables of interest.
D Association rules are intended to discover such broad
association designs among data in large databases.
D Regression analysis is an instrument for creating statistical and
mathematical models that define relations between a dependent
variable (should be ratio variable, not categorical) and one or
more descriptive or independent numerical (ratio or categorical)
vari
a bles.

aD
KEYWORDS
Association rules: These rules are used to discover broad
asso ciation designs in large databases.
D Cause-and-effect modelling: It is the process of developing
an alytic models to describe the relationship between metrics
that drive business performance.
••


N O T E S

D Descriptive analytics: In this type of analytics, analysts help


to identify patterns in data with the help of data-mining tools.
□ Logic driven models: These are created on the basis of infer
ences and postulations provided by the sample space and exist
ing conditions.
□ Predictive models: These are used to analyse historical facts
and current data to better comprehend customer habits, part
ners and products and to classify possible risks and prospects
for a company.
□ Simple linear regression: It involves finding a linear relation
ship between one independent variable and one dependent
variable.

rj• DESCRIPTIVE QUESTIONS


1. Explain the concept of predictive modelling.
2. What are logic driven models? Discuss with appropriate
examples.
3. Describe the concept of data mining. Enlist its four stages.
4. Discuss the differences between classification and prediction.
5. Explain some approaches in data mining.
6. Describe the concept of regression analysis.

ca ANSWERS AND HINTS

ANSWERS FOR SELF ASSESSMENT QUESTIONS

Predictive modelling
7. Tnte
8. (Rx Fx M)/D
Introduction to Data Mining 9. Tn1e
10. d. All of these
7. Tnte
8. modeling
9. False
Data Mining Methodologies 10. False
11. Cause-and-effect
12. Tnte
13. simple
14. False
:

N O T E S

HINTS FOR DESCRIPTIVE QUESTIONS


1. Predictive modelling is the method of making, testing and
authenticating a model to best predict the likelihood of a
conclusion. Refer to Section 6.2 Predictive Modelling.
2. Logic driven models are created on the basis of inferences and
postulations which the sample space and existing conditions
provide. Refer to Section 6.2 Predictive Modelling.
3. Data miningis a growing business analytics fieldfocused on better
understanding of the features and designs among variables in
huge databases using a variety of analytical and statistical tools.
Refer to Section 6.3 Introduction to Data Mining.
4. Classificationisthe most essential typeof data analysis. Prediction
resembles classification, aside from that we are attempting to
foresee the estimation of a numerical variable. Refer to Section
6.3 Introduction to Data Mining.
5. Some common approaches in data mining include data
exploration and reduction, association and cause-and-effect
modelling. Refers to Section 6.4Data Mining Methodologies.
6. Regression analysis is an instrument for creating statistical and
mathematical models that define relations between a dependent
variable (should be ratio variable, not categorical) and one or
more descriptive or independent numerical (ratio or
categorical) variables. Refer to Section 6.4Data Mining
Methodologies.

Ci:• SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Bari, A., Chaouchi, M., & Jung, T. (2014). Predictive Analytics for
Dummies. Hoboken, NJ: John Wiley & Sons, Inc.
D Finlay, S. (2014). Predictive Analytics, Data Mining and Big Data:
Myths, Misconceptions and Methods. Basingstoke: Palgrave Mac
millan.
D Larose, D. T., & Larose, C. D. (2015). Data Mining and Predictive
Analytics. Wiley.

E-REFERENCES
D Predictive analytics. (2017, May 09). Retrieved May 16, 2017, from
https://en.wikipedia.org/wiki/Predictive_analytics
D What is predictive analytics? - Definition from Whatls.com.
(n.d.). Retrieved May 16, 2017, from
http://searchbusinessanalytics.
techtarget.com/definition/predictive-analytics
D Impact, I. P., & World, P.A. (n.d.). Predictive Analytics World.
Re trieved May 16, 2017, from
http://www.predictiveanalyticsworld.
com/predictive_analytics.php
CONTENTS

7.1 Introduction
7.2 Overview of Prescriptive Analytics
7.2.1 Prescriptive Analytics brings a lot of Input into the Mix
7.2.2 Prescriptive Analytics Comes of Age
7.2.3 How Prescriptive Analytics Functions
7.2.4 Commercial Operations and Viability
7.2.5 Research and Innovation
7.2.6 Business Development
7.2.7 Consumer Excellence
7.2.8 Corporate Accounts
7.2.9 Supply Chain
7.2.10 Governance, Risk and Compliance
Self Assessment Questions
Activity
7.3 Introduction to Prescriptive Modeling
7.3.1 The Waterfall Model
7.3.2 Incremental Process Model
7.3.3 Rapid Application Development (Rad)
Model Self Assessment Questions
Activity
7.4 Non-linear Optimisation
Self Assessment Questions
Activity
7.5 Summary
7.6 Descriptive Questions
7.7 Answers and Hints
7.8 Suggested Readings & References
INTRODUCTORYCASELET
N O T E S

CREDIT CARD COMPANY USING PRESCRIPTIVE ANALYSIS


TO SERVE ITS CUSTOMERS IN A BETTER WAY

This caselet illustrates the use of prescriptive analytics in our


day to-day life. The following incident happened to a person
named Bill, whose credit card company started offering
electronic cou pons from retailers which could be downloaded to
the customer's card. Then, the respective customers would
automatically receive discounts from the retailer whenever a
purchase would be made using the card. But, not being a regular
eater of fast food items, Bill had added a fast food coupon to his
card in case he would be getting late to his office, he can
purchase a quick meal for him to save his time and money. With
that, he signed off from his account and forgot about the entire
process.

After many weeks had passed, one day Bill's cell phone received
a notification while he was driving the car. Upon opening the no
tification, Bill was surprised to get an alert message from the fast
food vendor, which notified that the place where Bill was current
ly travelling had a restaurant where he could use his coupon. Ini
tially though shocked, Bill had always heard about the existence
of this cutting-edge technology, but had not known that one day
he might be benefited from this technology. This technology is a
sheer example of what retailers can do in the future bycombining
the geo-location ability of the phone along with any other
informa tion, which they had acquired from their customers.

Bill was too excited to be rewarded positively for sharing all of


his data with the credit card company. A bit uncomfortable
initially with the possibility of sharing all his details with the
credit card company, currently, Bill is very much pleased that the
company is using innovative methods like prescriptive analytics
to serve its customers better.

In this caselet, you can see how the credit card uses prescriptive
analytics to link customers with their requirements. After all the
information of an individual is being shared with the company, it
can use many mathematical modeling and statistics methods to
find actionable insights, which can again be used to help
customer to get better results.
N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Describe the meaning of prescriptive analytics
- Explain the prescriptive modeling
- Discuss about non-linear optimisation

ill INTRODUCTION
After studying predictive and descriptive analytics steps in the previ
ous chapters of the business analytics process, one should be in a
good position to take the final step, i.e., prescriptive analytics. This
analysis will provide a prediction or a forecast of what future trends
in the business may look like.

For example, there can be significant statistical measures of higher


or lower sales; profitability trends accurately measured in dollars for
new market prospects; or measured cost savings from a future joint
venture. In the event that the organisations know where the future
lies by foreseeing the patterns, it can best arrange to exploit conceiv
able plans that the patterns may offer. The third step of the business
analytics process is prescriptive analytics1 which involves the
applica tion of decision science, operations research methodologies
and man agement science to make optimal utilisation of available
resources.

Prescriptive analytics methods and techniques are mathematically


based algorithms designed to take variables and other parameters into
a qualitative framework and generate an optimal or real-time solution
for complex problems. Such methods can be utilised to ideally
distrib ute a company's limited assets to take the best preferred
advantage of opportunities it has found in the anticipated future
patterns. The lim itations on human and financial assets turn away
organisations from pursuing each opportunity. Utilising prescriptive
analytics allows an organisation to designate limited assets to
accomplish goals as ideally as possible. Prescriptive analytics is
simply a computerised method for applying calculation and
interpretation and providing valuable in sights from various data
sources.

By the end of this chapter, readers will understand how various class
es of analytics-predictive and descriptive-can lead to prescriptive
analysis. This chapter will first discuss the meaning of prescriptive
analytics. Next, the chapter discusses prescriptive modeling. In the
end, the chapter discusses non-linear optimisation.
N O T E S

■ OVERVIEW OF PRESCRIPTIVE
ANALYTICS
Prescriptive analysis answers 'What should we do?,' on the basis of
complex data obtained from descriptive and predictive analyses.
By using the optimisation technique, prescriptive analytics
determines the finest substitute to minimise or maximise some
equitable finance, marketing and many other areas. For example, if
we have to find the best wayofshipping goods from a factory to a
destination, to minimise costs, we will use prescriptive analytics.
Figure 7.1shows a diagram matic representation of the stages
involved in the prescriptive analyt ics:

I
1
Future

M•fukM =1191-
+
-
Repeat

l Prescriptive Analytics J
Unstructured

Figure 7.1: Prescriptive Analytics

Data, which is available in abundance, can be streamlined for growth


and expansion in technology as well as business. When data is anal
ysed successfully, it can become the answer to one of the most
import ant questions: how can businesses acquire more customers
and gain business insight? The key to this problem lies in being able
to source, link, understand and analyse data.

All the companies need to address their data challenges to support


their decision making capabilities or may risk themselves in falling
be hind in this highly competitive landscape. Today, businesses are
col lecting, storing, analysing and interpreting more data as
compared to the previous years, and this trend is continuing at an
alarming rate to gain momentum. According to many leading
professors and research ers, this is the era of a Big Data revolution.
In any case, it is not the amount of information that is progressive.
Rather, the revolution has got something to do with the data.

Since a lot has been written on Big Data, we will focus on analytics,
which will help companies transform the finance function by offering
forward looking insights and help them devise a solution appropriate
for the optimal course of action, improve the ability to communicate
and collaborate with other companies at a lower cost of ownership.
N O T E S

These transformative characteristics will lead to better


performance improvements in business sectors.

Prescriptive analytics go beyond predictions, workforce


optimisations and decision options. It is usually used to analyse
complex data to analyse huge complex data to forecast outcomes,
offer decision op tions and show alternative business impact. This
method also consists of many scientific and mathematical methods
used for understanding how alternative learning investments impact
the bottom line. More over, this analytics can also help enterprises to
take decisions on how to take advantage of a future scenario or
reduce a risk in the future course of time and represent the
implication of each decision option.

In real life, prescriptive analytics can automatically and continuously


process new data to improve forecast accuracy and offer better de
cision options. For instance, prescriptive analytics can be utilised to
profit healthcare key arranging. By utilising data analytics, one can
harness operational information which includes population statistic
patterns, financial information, and population health patterns, to a
more exact arrangement and contribute future capital, such as, equip
ment usage and new facilities.

7.2.1 PRESCRIPTIVE ANALYTICS BRINGS A LOT OF


INPUT INTO THE MIX

Prescriptive analytics instruments articulate upgrades of business


results by joining business rules, historical information, factors, nu
merical models, requirements and machine-learning calculations.
Prescriptive analytics, much the same as the predictive analytics, are
especially utilised as a part of circumstances where there are a ton of
factors, choices, limitations and data points to viably assess without
utilising any help from technology. While experimenting in the real
world scenario, prescriptive analytics are overly risky, or bit
expensive or probably takes too much time.

Sophisticated analytical models and simulations can keep running


with well-known and randomised factors to mention next strides,
show any if/then situations and pick up a superior understanding of
the scope of conceivable results.

Some examples of business processes where we can apply prescrip


tive analytics include pricing, operational resource allocation, inven
tory management, supply chain optimisation, production planning,
utility management, sales lead assignment, transportation and distri
bution planning, marketing mix optimisation and financial planning.
For instance, in the airline ticket pricing framework, prescriptive ana
lytics are utilised to get insight into complex demand levels, travel el
ements and booking timings to acquire more potential travelers with
costs computed to upgrade benefits additionally without demoralise
sales. Another noticeable contextual analysis case for investigation is
N O T E S

the utilisation of prescriptive analytics in UPS in enhancing package


delivery courses. Prescriptive analytics applications have been in op
eration for a long while.

7.2.2 PRESCRIPTIVE ANALYTICS COMES OF AGE


Prescriptive analytics is an absolute necessity for any company to ex
ecute key marketing strategies. It highlights ideal choices, to give the
effect of those choices, bringing about a key and plainly
characterised path ahead. The uplifting news for organisations
looking for an upper hand is that the stunning measure of
information now accessible is the thing that really controls
prescriptive analytics.

Prescriptive analytics take subjective choices in the target region


utilising the abundance of information to structure the basic deci
sion-making process. The approach dissects potential choices, col
laborations amongst choices and influences on these decisions. The
approach then uses this data to help graph the best activities/choices.
It is conceivable in the view of advancements in processing speed
and the subsequent advancement of complex scientific calculations
con nected to the shifted data sets (big data).

7.2.3 HOW PRESCRIPTIVE ANALYTICS FUNCTIONS


Utilising prescriptive analytics is a complex and time-taking process
that investigates all viewpoints to sustain the decision-making pro
cess, including:
□ Identifying and breaking down every single potential choice
□ Defining potential connections and associations between each of
these choices with each other
□ Identifying variables that could affect each of these choices (posi-
tively or negatively)

Prescriptive analytics handle processes every one of these viewpoints


and maps out (at least one) potential results for each of the choices -
bringing about a customised model. Elements sustaining the model,
including information volume and quality, could affect the exactness
of the model (as they would in descriptive and predictive analytics).

Prescriptive analytics utilises procedures like optimisation, game


theory, simulation, and decision-analysis techniques. A procedure as
opposed to a defined event, prescriptive analytics can constantly and
consequently prepare new information to enhance predictive preci
sion and give better decision choices.

7.2.4 COMMERCIAL OPERATIONS AND VIABILITY


Enhanced operations are an essential utilisation of prescriptive ana lytics.
Most organisations have concentrated intensely on finding the
N O T E S

correct cost levels and working model to adequately empower them


to develop. Prescriptive analytics add another measurement to oper
ational and business adequacy by giving directors a chance to foresee
what structures, messages and targets will yield ideal outcomes giv
en the organisation's remarkable parameters, and after that choose
which way will give the biggest returns. There are numerous other
business applications of prescriptive analytics, such as:
□ Optimising spend and rate of profitability (ROI) through exact
cus tomer profiling
□ Providing important data for brand planning and go-to-market
procedures
□ Maximising campaign productivity, sales force arrangement and
promotional activities
□ Predicting and proactively overseeing market events
□ Providing significant data for territory examination, customer
deals and medical data

A one-estimate fits-all business model is no longer reasonable; the


eventual fate of a focused sales model is focused on customised mes
saging.

7.2.5 RESEARCH AND INNOVATION


Research and advancement are frequently speculating games, how
ever, prescriptive analytics can be a noteworthy differentiator for any
organisation occupied with R&D exercises in a competitive industry
including:
□ Demonstrating, anticipating and enhancing results from item util
ity
□ Understanding sickness (or different zones of intrigue) patterns/
movement
□ Establishing ideal trial conditions through focused patient cohorts
a Increasing customer adherence to the item and diminishing com
pliance
a Understanding necessities for customised drug and different ad
vancements
□ Determining and setting up focused items and interventions
□ Determining and setting up an ideal trial conditions through
fo cused patient cohorts

7.2.6 BUSINESS DEVELOPMENT


Understanding what new items are required, what differentiating
components will make one item sell better than the other, or which
N O T E S

markets are demanding which items are key zones for prescriptive
analytics including:
D Identifying and settling on choices about circumstances/rising
ranges of unmet need
D Predicting the potential advantage

D Proactively following industry trends and actualising techniques


to get an advantage
□ Exploiting data analytics to distinguish particular buyer popula
tions and regions that ought to be focused on
D Leveraging data analytics to distinguish key advancements for
item improvement that will produce the biggest return for the in
vestment
D Identifying likely purchasers to cut business improvement costs
altogether, and imagine a scenario where situations for items,
mar kets and purchasers could be an unmistakable differentiator
for developing organisations

7.2.7 CONSUMER EXCELLENCE


Understanding buyer needs and having the capacity to tailor offer
ings (items or services) are basic figures in business development.
Prescriptive analytics can be utilised to improve purchaser excellence
in a huge number of ways including:
D Predicting what purchasers will need and settling on key
choices that address those necessities
D Segmenting purchasers and recognising and focusing on custom
fitted messages to them
D Stayingon top of competition and deciding (e.g., marketing,
brand ing) about items that will prompt more desirable items and
higher sales

7.2.8 CORPORATE ACCOUNTS


Corporate account functions can immensely use prescriptive analyt
ics to improve their capacity to settle on choices that help drive inter
nal excellence and outer strategy:
D Internal excellence
♦ Viability and direction for non-item related activities; what
choices ought to be made and what is the effect.
♦ Viability and direction for item related activities; what
choices ought to be made and what is the effect.
N O T E S

□ External-facing key direction


♦ Utilising important data to demonstrate item esteem and build
up a market valuing
♦ Utilising examination to build up a targeted on coupon strategy
♦ Recognising ideal price point alternatives and the effect of
those choices on the income model for the item
♦ Better understanding the whole price cycle from rundown
cost to repayment (counting all rebates and refunds) to inform
the ideal pricing system
♦ Utilising an important competitor data to build up estimating
and get market access

7.2.9 SUPPLY CHAIN


Prescriptive analytics can likewise furnish, supply chain capacities
with an upper hand through the capacity to predict and make deci
sions in a few basic areas including:
□ Forecasting future demand and pricing (e.g., supplies, material,
fuel and different components affecting cost to guarantee proper
supply)
□ Utilising prescriptive analytics to illuminate stock levels,
schedule plants, route trucks and different components in the
supply chain cycle
□ Modifying supplier threat bymining unstructured information re
garding value-based information
□ Better understanding historical demand examples and product
course through supply chain channels, anticipating future exam
ples and settling on choices on future state procedures

7.2.10 GOVERNANCE, RISK AND COMPLIANCE


Governance, risk and compliance are elements of expanding signifi
cance crosswise over practically every industry. Prescriptive
analytics can help associations accomplish consistence through the
capacity to anticipate up-coming dangers and settle on the proper
mitigation choices.

Governance, risk and compliance are functions of increasing impor


tance across almost every industry. Prescriptive analytics can help
organisations achieve compliance through the ability to expect forth
coming risks and make proper mitigation decisions. Utilisation of
pre scriptive analytics in the region of governance, hazard and
compli ance incorporates:
□ Improving internal review effectiveness
□ Notifying third-party arrangement and management
N O T E S

D Classifying patterns related with outlandish spend (e.g., total


spend working on this issue of pharma)
D Applying very much learned compliance controls

SELF ASSESSMENT QUESTIONS

1. Prescriptive analytics can be utilised in improving services of


the healthcare industry. (True/False)
2. Prescriptive analytics take choices in the target region,
utilising the abundance of information to structure the basic
process.
3. analytics can be a noteworthy differentiator for any
organisation occupied with R&D exercises in a competitive
industry.
4. Prescriptive analytics can help associations to remain
consistent in anticipating the upcoming dangers and settling
on the proper mitigation choices. (True/False)

ACTIV1TY

Assign a group of students with the task of collecting information


of the money spent by the residents of a town to keep their area
pol lution free and clean. All the data needs to be collected and
docu mented cleanly in a spreadsheet. The students need to find
out the probability of the amount of money the residents would be
spend ing for the same purpose in the future course of time frame.

INTRODUCTION TO PRESCRIPTIVE
MODELING
Prescriptive analytics methods are not just concentrating on Why,
How, When, and What; additionally they also prescribe acceptable
behavior for taking advantage of the situation. Prescriptive analytics
every now and then proved itself as a benchmark for an
organisation's analytics development. Segments of prescriptive
analytics are:
a Evaluate and choose better ways to deal with work
b. Target business goals and conform all restrictions

Prescriptive models guide everybody precisely and have a tendency


to be substantial. These models require a great deal of documenta
tion, and are costly. Prescriptive methodologies are basically "project
insurance". Prescriptive decision models help leaders recognise the
best arrangement. There are three kinds of prescriptive process mod
els in business. They are:
D The Waterfall Model
N O T E S

□ Incremental Process Model


□ RADModel

7.3.1 THE WATERFALL MODEL

The waterfall model is additionally called the 'Linear sequential


mod el' or 'Great life cycle model'. In this model, each stage is
completely finished before the start of the following stage. This
model is utilised for little projects. In this model, input is taken after
each stage to guar antee that the venture is on the correct way. The
testing phase begins simply after the development gets finished.

The advantages of using the waterfall model are as follows:


□ The waterfall model is easier to implement and simple to use.
□ It avoids overlapping of each phase.
□ This model works for small projects as the prerequisites are un
derstood extremely well.
□ This model is favored for those activities where quality is more
imperative when contrasted with the cost of the venture.

The disadvantages of using the waterfall model are as follows:


□ This model is not suitable for complex and object-oriented ven
tures.
□ The issues with this model are uncovered, until the product test
ing stage.
□ The measure of risk is quite high.

7.3.2 INCREMENTAL PROCESS MODEL

The increment model is the evolutionary model in which a product is


implemented and tested incrementally. The process sequence used is
built, implemented, integrated and tested. Successive builds are there
until the product is complete. The product is in operation mode and
the model provides stepwise development. It retains the discipline in
troduced by the waterfall model at each build. Model can be used at
all stages of the life cycle.

The advantages of using the incremental model are as follows:


□ This model delivers items quicker and is cost effective.
□ The testing and debugging is easier in this model.
□ It generates software rapidly and ahead of schedule amid the
product life cycle.
N O T E S

D The handling of risk is easier as risky items can be determined and


managed during each iteration.

The disadvantages of using the incremental model are as follows:


D The cost of the last item may cross the cost assessed in the begin
ning.
D This model requires good design and planning.
D This model requires a precise definition of the whole system be
fore it gets broken down and built incrementally.
D The cost involved in this model is more than the cost involved in
the waterfall model.
D The requests of clients for extra functionalitiesafter each augmen
tation causes. an issue amid the framework design.

7.3.3 RAPID APPLICATION DEVELOPMENT (RAD) MODEL

RAD is a Rapid Application Development model which is based


upon prototyping and iterative development without any specific
plan. It emphasises on gathering requirements of customers with the
help of workshops or focus groups. The RAD model comprises
following stag es:
D Business modeling: It describes the flow of information among
business functions and is modeled in a manner that answers the
following questions:
♦ What information is generated?
♦ Who generated the information?
♦ Where does the information flow?
♦ Who processes the information?
D Data modeling: It describes the flow of information defined as
part of the business modeling phase that is refined into a group of
data objects required for supporting the business. The character
istics of each object are ascertained and the relationships between
these objects are defined.
D Process modeling: It refers to data objects described in the data
modeling phase that are transformed for achieving the flow of in
formation required to implement a business function. Processing
descriptions are built to add, modify, delete, or retrieve a data ob
ject.
D Application generation: It assumes the use of 4GT (Fourth
Gener ation Techniques) which comprise a wide range of
software tools. Each tool allows the software engineer to specify
software at a high level. Automated tools are used for facilitating
the construction of software.
N O T E S

□ Testing and turnover : It helps in reducing the overall testing time


as many of the program components have already been tested.
New components must be tested and all interfaces must be fully
exercised.

SELF ASSESSMENT QUESTIONS


5. Segments of prescriptive analytics are:
a. Evaluate and choose better ways to deal with work
b.
6. Identify the disadvantage of the Waterfall model from the
following statements:
a. The waterfall model is easy.
b. It avoids overlapping of each phase.
c. This model works for small projects.
d. It is a poor model for long activities.
7. RAD is a Rapid Development model.
8. The waterfall model is also called the

ACTIVITY
Create a datasheet related to the allocation of budgets for creating
a construction building in your locality. Use prescriptive analytics
to find out the annual budget allocation for the maintenance of the
construction building in the next 5 years in your locality.

di NON-LINEAR OPTIMISATION
You already know that there are numerous numerically programed,
nonlinear techniques and methodologies intended to produce ideal
business execution arrangements. The greater part of them require
cautious estimation of parameters that could possibly be exact, espe
cially given the exactness required of an answer that can be so dubi
ously subordinate upon parameter precision. This accuracy is further
confounded in business analytics by the vast information records that
ought to be figured into the model-building effort. To conquer these
impediments and be more comprehensive in the utilisation of sub
stantial information, regression software can be used. Curve Fitting
programming can be utilised to create predictive analytical models
that can likewise be used to help in settling on prescriptive analytical
decisions.

Prescriptive investigation gives exact choices on the action plan for


future achievement. One of the conspicuous utilisations of prescrip-
N O T E S

tive analytics in advertising is the streamlining issue of showcasing


spending designation. The business issue is to make sense of the ide
al amount of spending that should be designated from the aggregate
advertising spending plan for each of the promoting media like TV,
press, Web video and so forth to maximise the income. The spending
advancement issue is understood either through Linear or Nonlinear
Programing (NLP) which relies on upon whether:
0 The objective purpose is linear/ nonlinear
o The practicable region is detected by linear/nonlinear constraints
Nonetheless, in this present reality, TV ad information, as plotted
in Figure 7.2, challenges such presumption as the diagram demon
strates a curved capacity. The requirement that might be
considered for growing such advancement issue is the greatest
sum that ought to be spent on a specific media and past that point
any further use mayprompt the expansion in income yet at a
diminishing rate. Hence, it's essential to discover the decreasing
purpose of return for each of the promoting mediums. Figure 7.2
demonstrates the income created against the cost caused for a TV
commercial and both the cost and the income said in the present
paper dollar value are in thousands:

Diminishing Point of Return for TV Advertising


70000 R':0.9871
I
........-.,-.-.....,---------------)! Concave Function
60000
50000
Revenue
from 40000
TV Ad 30000 --TV Revenue
20000 ..........Poly.(TV Revenue)
10000

2000 3000 4000 5000


TVAdvertising CO$t

Figure 7.2: Diminishing Point of RetLtrn for TV Adveriisemeni


Source: https://www.blueoceanmi.com/blueblog/application-derivatives-nonlinear-program
ming-prescriptive-analytics

The curve that perfectly fits the plotted revenue and cost of TV
promo tion is cubic and is plotted in Figure 7.2. The R-square
accomplished through the cubic condition is an astounding 98.7%.
The first and sec ond request subordinates of the cubic condition are
figured as takes after:
Polp,omial Equation, y- -3£-06x' + 0.016.,.; + 1.601.,· + 177.J2 ... (Equation
I)
First Dern·ati\'C of Equation I, dy - -9.00£-06:,. + 0.032x + 1.601... (Equation 2)
dx
Second Order Deri,·a1ive of Equation 1. - -1.80E-05x + 0.03:Z... (Eq1.1ation 3)
dx
N O T E S

The inflection point is recognised where the second derivative


chang es from positive to negative. Subsequently, numerically, it's
the point where the second derivative is 0.

In this manner, explaining Equation 3, the cost of decreasing point of


return is1777.78 and the comparative income at the diminishing
point of return in the wake of connecting the estimation of x to
Equation 1 is 36735.68. For further reviews on the utilisation of
second subordinates on nonlinear optimisation, you may refer to
Newton-Raphson calcu lation and conjugate course calculations.

Partial derivative is the other noticeable utilisation of analytics on


advancement issues. Partial derivatives of a function with a few fac
tors are proficient when a specific variable's subsidiary is registered
keeping different factors steady. A standout amongst the most
general uses of partial derivative is the least square model where the
goal is to discover the best fitting line by limiting the separation of
the line from the data points.

This is accomplished by setting first order partial derivatives of inter


cepts and the angles are equivalent to zero. The second order partial
derivative is utilised as a part ofan optimisation issue to make sense
of whether a given basic point is a relative most extreme, relative
least, or a saddle point.

The methodologies discussed earlier show howanalytics can be incor


porated with nonlinear progranuning while conveying an enhanced
arrangement. In a similar way, Lagrangean-based procedures can
likewise be incorporated with Mixed Integer Non-Linear Program
ming (MINLP) to provide the marketing budget optimisation
solution. At present, data researchers can not bet on a solitary strategy
to give examination arrangement. The genuine test is to make sense
of how numerous systems can be inventively joined to give an answer
as in teresting as the business issue.

SELF ASSESSMENT QUESTIONS

9. Curve fitting programming can be utilised to create _


analytical models.
10. NLP stands for
a. Nonlinear Programing
b. New Language for programing
c. New linear programing
d. None of these
11. The point is recognised where the second derivative
changes from positive to negative.
12. The fill form of MINLP is --------
NOTES

ACTIVITY
Create some teams in your class, each having four students and go
to the nearest truck dealer. Use the non-linear optimisation
method to calculate how to minimise the cost of transport as a lot
of trucks of the dealer ship goods to a large network of markets or
stores.

Q,,WsUMMARY
D By using the optimisation technique, prescriptive analytics deter
mines the finest substitute to minimise or maximise some equita
ble finance, marketing and many other areas.
D Data, which is available in abundance, can be streamlined for
growth and expansion in technology as well as business.
D In real life, prescriptive analytics can automatically and continu
ously process new data to improve forecast accuracy and offer
bet ter decision options.
D Prescriptive analytics are an absolute necessity for any company
to execute key marketing strategies.
D Corporate account functions can immensely use prescriptive ana
lytics to improve their capacity to settle on choices that help drive
internal excellence and outer strategy.
D Prescriptive analytics can likewise furnish, supply chain capaci
ties with an upper hand through the capacity to predict and make
decisions in a few basic areas.
D RAD is a Rapid Application Development model. Using the RAD
demonstrate, programming item is produced in a brieftimeframe.
D The inflection point is recognised where the second derivative
changes from positive to negative.

m KEYWORDS
D Analytics: It refers to the discovery, interpretation and
commu nication of meaningful patterns in data.
D Descriptive analytics: It is a preliminary stage of data process
ing that creates a summary of historical data to yield useful in
formation and possibly prepare the data forfurther analysis.
D Prescriptive analytics: It is the area of business analytics (BA)
dedicated to finding the best course of action for a given
situation.
D Predictive modelling: It is a process that uses data mining and
probability to forecast outcomes.
D Waterfall model: The model in which each stage is
completely finished before the start of the following stage.
N O T E S

fl■DESCRIPTIVE QUESTIONS
1. Explain the concept of perspective analytics along with its
functions.
2. What do you understand by perspective modeling? Discuss
three kinds of prescriptive process models in business.
3. Describe the non-linear optimisation in analytics.
4. Discuss the importance of perspective analytics in commercial
operations, research and innovation, business development and
consmner excellence.

IN ANSWERS AND HINTS

ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic, Q.No. Answers


Overview of Prescriptive 1. True
Analytics

2. subjective, decision-making
3. prescriptive

4. True
Introduction to
Prescrip tive Modeling 5. b. Target business goals and con
form all restrictions
6. d. It is a poor model for long
ac- tivities
7. Application

8. Linear sequential model


Non-linear Optimisation
9. predictive
10. a. Nonlinear Programing

11. inflection
12. Mixed Integer Non-Linear
Programming

HINTS FOR DESCRIPTIVE QUESTIONS


1. Prescriptive analytics go beyond predictions, workforce
optimisation and decision options. Refer to Section
7.2Overview of Prescriptive Analytics.
2. Prescriptive analytics every now and again fill in as a
benchmark for an organisation's analytics development. Refer to
Section 7.3Introduction to Prescriptive Modeling.
3. The spending advancement issue is understood either through
Linear or Nonlinear Programing (NLP). Refer to Section
7.4 Non-linear Optimisation.
:i

N O T E S

4. Enhancing operations are an essential utilisation of prescriptive


analytics. Refer to Section 7.2 Overview of Prescriptive
Analytics.

Q:N SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Liebowitz, J. (2014). Business analytics: an introduction. Boca
Ra ton: CRCPress.
□ Williams, S. (2016). Business intelligence strategy and big data
an alytics: a general management perspective. Cambridge, MA:
Mor gan Kaufmann.
□ Bruce, P. C. (2015). Introductory statistics and analytics a resam
pling perspective;. Hoboken, NJ: Wiley.

E-REFERENCES
□ August 17, 2016 • by Tuhin Chattopadhyay • in Big Data Analyt
ics. (20161 August 23). Application of Derivatives to Nonlinear
Programming for Prescriptive Analytics. Retrieved May 02, 2017,
from https://www.blueoceanmi.com/blueblog/application-deriva
tives-nonlinear-programming-prescriptive-analytics/
D Beginning Prescriptive Analytics with Optimization Modeling by
Jen Underwood- BeyeNETWORK. (n.d.). Retrieved May 02, 2017,
from http://www.b-eye-network.com/view/17152
D Prescriptive Analytics. (n.d.). Retrieved May 02, 2017, from
https:// www.mathworks.com/discovery/prescriptive-
analytics.html
CONTENTS

8.1 Introduction
8.2
Social Media Analytics
Self Assessment Questions
Activity
8.3 Key Elements of Social Media
Self Assessment Questions
Activity
8.4 Overview of Text Mining
8.4.1 Understanding Text Mining Process
Sentiment Analysis
Self Assessment Questions
Activity
8.5 Performing Social Media Analytics and Opinion Mining on Tweets
Self Assessment Questions
Activity
8.6 Online Social Media Analysis
Self Assessment Questions
Activity
8.7 Mobile Analytics
8.7.1 Define Mobile Analytics
8.7.2 Mobile Analytics and WebAnalytics
8.7.3 Types of Results from Mobile Analytics
8.7.4 Types of Applications for Mobile
Analytics
Self Assessment Questions
8.8 Activity
8.8.1 Mobile Analytics Tools
8.8.2 Location-based Tracking tools
8.8.3 Real-Time Analytics Tools
User Behavior Tracking Tools
CONTENTS

Self Assessment Questions


Activity
8.9 Performing Mobile Analytics
8.9.1 Data Collection Through Mobile Device
8.9.2 Data Collection on Server
Self Assessment Questions
Activity
8.10 Challenges of Mobile Analytics
Self Assessment Questions
Activity
8.11 Summary
8.12 Descriptive Questions
8.13 Answers and Hints
8.14 Suggested Readings & References
• •
INTRODUCTORY CASELET
N O T E S

TRACKING CUSTOMER SENTIMENT THROUGH WIPRO'S


SOCIAL MEDIA ANALYTICS (SMA)

Wipro Ltd. is a famous Information Technology, Consulting and


Outsourcing company that provides business solutions to its
client companies to do better business. One ofWipro's clients,
providing paid applications for the Entertainment and Media
Industry to its customers, had recently launched its application
services. The company needs a way to know its customer's
feedbacks, issues, demands and their overall experience with
this new launch. They were lacking in improving the
promotional activities to engage their customers and were also
taking a significant time in giving response to solve customer
issues.
The company took the help of Wipro to reduce the response time
in resolving customer issues by 65% by tracking customer
experi ence through Social media analytics. This analytics, using
Senti ment analysis, finds out business insights of the client's
strategies related to marketing and their customer
relationships.Sentiment analysis also helps in improving the
promotional activities and engaging customer's attention with
improved services.
Wipro has provided Wipro's Social Media Analytics (SMA)
solu tion, which accurately understands customers' core
sentiments and translates them into key business insights. This
SMA solution has been built over time by considering the
customers feelings about products and services by collecting the
social media data provided on Twitter, Facebook, biogs, forums,
etc. The solution is based on Na'ive Bayesian and Association
Mining technique, which can also handle the data having noise.
The SMA solution also allows to send reports containing the
data based on generat ed insights weekly/fortnightly to the
clients.

Some main features of SMA solution are as follows:


□ Taxonomy generation: It helps in categorising the social
me dia data into various categories, like functionality, issues,
net work, environment, competition and content.
□ Insights generation: Insights are generated on the basis of
competitor trends, geography/demography/topic based sen
timent, product launches, product/service performance, and
etc.
□ Data collection: Data is collected, on the basis of preset
rules and configurations defined by the client, from social
media in real time with the help of using a social listening
tool.
:

N O T E S INTRODUCTORY CASELET

□ Data preparation by categorisation: Data is prepared on the


basis of different categories like recent trending topics, or
cus tomer sentiments about the product.
□ Text Analytics Engine: Prepared data is provided into an
in house built Text Analytics Engine that transforms social
me dia data intoa structured format, which can be easily
analysed quantitatively.

The business impact of using Wipro's Social Media Analytics


solution was that promotional activities increased for customers
regarding improved services by periorming buzz analysis, launch
analysis and campaign analysis. The insights generated from the
SMA solution through buzz analysis, launch analysis, and cam
paign analysis helped in resource channelisation and market
expansion on the basis of customers' sentiments. Moreover, the
SMA solution also helped in identifying key influencers on the
social media by performing Social Node Network analysis.
'
N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Explain the concept of social media
- Describe the key elements of social media
- Explain the concept of text mining
- Understand the text mining process
- Describe the sentiment analysis
- Explain how social media analytics perform and mining on
Tweets
- Describe the concept of mobile analytics
- Describe the mobile analytics tools
- Explain how to perform mobile analytics
- Describe the challenges of mobile analytics

1:j• INTRODUCTION
In a world where information is readily available via Internet on the
click of a button, organisations need to remain abreast with the ongo
ing events and latest happenings in order to gain a competitive edge
over business markets. Apart from that, organisations also need to
interact with their consumers more effectively in order to gain an in
sight about the ongoing business trends and the market position of
particular products. Social media provides an opportunity to business
organisations and individuals to connect and interact with each other
worldwide. With the evolution of social media as a tool to connect
with the existing and potential customers, business organisations
have be gun to recognise the requirement of employing social media
analytics for gaining crucial business insights and taking timely
decisions.

This chapter discusses the role of social media and the importance of
conducting social media analytics by business organisations. These
analyses help organisations to evaluate feedback from consumers and
gauge their current and future position in the market. Further you
will learn about text mining and sentiment analysis. The chapter ends
with a presentation on how to perform social media analytics and
opinion mining on tweets.

1:fW SOCIAL MEDIA ANALYTICS


Simply put, social media refers to a computer-mediated, interactive,
and Internet-based platform that allows people to create, distribute,
and share a wide range of content and information, such as text and
images. Social media technologies unleash and leverage the power of
social networks to enable interaction and information exchange.
. :

.
N O T E
S
Jesse Farmer, cofounder of Dev Bootcamp, quotes social network as
a collection of people bound together via a specific set of social rela
tions. Social media, in turn, denotes a group oflnternet-based
applica tions build over the foundations of Web 2.0 that supports the
creation and exchange of user-generated content. In other words,
social media relies on Web-based technologies to generate
interactive platforms where people and organisations can create, co-
create, recreate, share, discuss, and modify user-generated content.
Prior to the advent of social media as an open-system approach to ex
change content effectively, business organisations and public relation
practitioners rarely focused on business dynamics to manage brand
images. With the changing business environments due to the
evolution of social media, business organisations also adopted the
open-system approach based on reciprocal feedback. This, in turn,
has completely transformed the way information was communicated
or the manner by which public relations were developed. The new
approach encour ages active participation in development and
distribution of informa tion by merging innovative technologies and
sociology. Social media provides a collaborative environment which
can be employed for:
D Building relationships
D Distributing content

D Rating products and services


D Engaging target audience

Social media provides an equally open platform for novices as well


as experts to express and share their viewpoints and feedback on
vari ous events and issues. This information can, in turn, be
employed by business organisations to gain insights about customers'
perspective on their products and services. In this manner, social
media enables business organisations to receive a feedback and
promote a dialog be tween customers, potential customers, and the
organisation. In other words, social media allows business
organisations to promote partici pation, conversation, sharing, and
publishing of content. Social media, however, can take different
forms, which can be categorised as follows:
D Social networking websites: These provide a Web-based platform
to users where they can create a personalised profile, summaris
ing and showcasing their interests, define other members as con
nections or contacts, and communicate and share content with
their contacts. Examples of social networking websites include
Facebook, Linkedin, MySpace, Hi5, and Bebo.
D Blogs: Short for'Web logs', blogs represent online journals toshow
case the content organised in the reverse chronological order. Ex
amples of blogging sites include Blogger, WordPress, and Tumblr.
D Microblogs: These allow people to share and showcase small
posts and are suitable for quick sharing of content in a few lines
of text or an individual photo or video. Twitter is a well-known
microb logging website.
• •
N O T E S

o Content communities and media sharing sites: These allow


users to organise and share different types of media content such
as vid eos and images. The members can also comment on the
shared con tent. Examples include YouTube, Pinterest, Flikr, and
Instagram.
□ Wiki: It represents a collective website in which the members
can create and modify content in a community-based database. In
oth er words, the users can modify the content of any hosted page
and can also create new pages in the website based on the wiki
tech nology. One of the most popular examples of Wiki websites is
Wiki pedia, which is an online encyclopedia.
□ Social bookmarking websites: These websites allow users to
or ganise and manage tags and links to other websites. Well-
known examples include Reddit, StumbleUpon, and Digg.

Apart from the listed ones, social media may include websites that
showcase reviews and ratings, such as Yelp, forums, and discussion
boards, such as Yahoo!, and websites that showcase virtual social
worlds that create a virtual environment where people can interact,
such as SecondLife. Figure 8.1depicts the forms of conversations
pos sible via social media:

COllaboratlof\

Bl0g
Platforms

Figure 8.1: Possible Forms of Conversation via Social Media


:

N O T E S

Social media analytics is the practice of collecting data from social


media websites or blogs and then analysing the data to take crucial
business decisions. Generally, the data obtained from the social
media is mined to identify customer sentiments and opinions
regarding par ticular products and services. Such an analysis helps
organisations to enhance their products and services, improve
marketing strategies, provide better customer services, reduce costs,
and gain a competitive edge in the market.

[f SELF ASSESSMENT QUESTIONS

1. websites allow users to organise and manage tags and


links to other websites.
2. WordPress is an example of a------------site while Twitter is an
example of site.

Search and prepare a report on Social Media Analytics Cycle.

i=IM KEY ELEMENTS OF SOCIAL MEDIA


Incorporating social media into everyday sales and marketing rou
tines of an organisation is not easy and requires gaining a command
over certain set of tactics and tools related to the efficient manage
ment and utilisation of social media. In order to effectively leverage
the possibilities provided by social media for the growth of business,
the organisations need to focus on certain key elements of social me
dia along with the corresponding techniques.

Social media participation involves a focus on the following key ele


ments:
D Collect: In order to effectively incorporate social media, the busi
ness organisations first need to understand how to collect and
leverage useful information and market artifacts. This involves
critical and careful analysis of the information coming from vari
ous sources such as customers, competitors, journalists, and other
market influencers. Various tools, such as feed readers, blog sub
scriptions, and email newsletters, can be employed to collect
infor mation from various sources.
D Curate: Once the information is collected from various sources,
the next step is to effectively curate the important information
to be sent to the clients and internal stakeholders. This involves
intelligent filtering and aggregation of the information collected
from various resources. This not only provides an effective
insight to the customers but also helps in having a clear vision
about the
• •
N O T E S

current industry standards and market trends. Various curation


tools such as Newsle, Linkedln, and RSS readers can be
employed for this task.
□ Create: After collection and curation of the collected
information, organisations need to create valuable content objects
that can pro vide a focus and industry buzz to them. This is an
effective mar keting strategy to create a leader position in the
industry. This can be accomplished by employing various
publishing programs and sharing routines.
□ Share: A key element for implementing effective social media is
sharing of information. This employs sharing your content, infor
mation, and ideas with others. This helps in expanding the social
media network. Various tools, such as Feedly and Hootsuite, help
in sharing information and content over the social media.
D Engage: The basic idea behind social media is to engage the ex
isting and prospective customers. The tools and routines of social
media and the regular practice of listening, curation, and sharing
help executives and sales personnel of an organisation in engag
ing more and more customers, stakeholders, prospective custom
ers, journalists, and industry influencers. Tools such as Salesforce
help to connect people from different categories. Apart from that,
various mobile apps also help in expanding the realm of reaching
more and more people.

[;f SELF ASSESSMENT QUESTIONS

3. Which of the following elements is involved in intelligent


filtering and aggregation of the information collected from
various resources?
a. Collect b. Curate
c. Create d. Share
4. The Feedly and Hootsuite tools help in information
and content over the social media.

ACTIVITY
Enlist and discuss the elements of social media marketing strategy
in your class.

1:ji OVERVIEW OF TEXT MINING


We all know that social networks are a rich source of information.
A lot of valuable content can be extracted and analysed from this in
formation to serve the knowledge requirements of various business
organisations, political parties, scientific research departments, social
N O T E S

science fraternities, and other interested domains. Social networks


generally support the exchange of information and data in various
formats, such as text, videos, and photos. However, the most
common form of information and content exchange on social
networking sites is text.

Online marketers and business analysts examine and interpret the


online content using social media analytics. This analysis helps them
to amend and mould their business objectives as per the customer
behavior. For example, the reviews posted by customers on websites
or social marketing media in the form of text or rating score enable
organisations to understand and analyse customer's perspectives and
expectations.

The insight obtained from such reviews can help organisations to


iden tify their key areas of improvement and enhance their
performance. However, certain tools and methodologies are required
to read, inter pret, and analyse the large number of reviews received
on a daily ba sis. This is accomplished by text mining.

It is pretty difficult for any database administrator, marketing pro


fessional, or a researcher to explore and extract the desired informa
tion from the huge amount of data and information generated and
exchanged online on a daily basis. The problem is multiplied
manifold by the text-based social networking communications and
documents exchanged during business operations. Although keyword
searching provides some scope to search the desired information;
however, they cannot every time relate to the exact terms in the
document.

Text mining or text analytics comes as a handy tool to quantitatively


examine the text generated by social media and filtered in the form
of different clusters, patterns, and trends. In other words, text mining
represents the set of tools, techniques, and methods applied for auto
matically processing natural language textual data provided in huge
amounts in the form of computer files. The extracted and structured
content and themes are used for rapid analysis, identification of hid
den data and information, and automatic decision making. Text min
ing tools are often based on the principles of information retrieval
and natural learning processes.

Complex linkage structure makes text mining in social networks a


challenging job, requiring the help of automated tools and sorting
techniques. A number of text mining tools and algorithms have been
developed to enable easy extraction of information from different
tex tual resources. The recent developments in statistical and data
pro cessing tools have added to the evolution in the domain of text
mining.

Text mining employs the concepts obtained from various fields,


rang ing from linguistics and statistics to Information and
Communication Technologies (ICT). Statistical pattern learning is
applied to create
'
N O T E S

patterns from the extracted text, which are further examined to ob


tain valuable information. The overall process of text mining
compris es retrieval of information, lexical analysis, creation and
recognition of patterns, tagging, extraction of information,
application of data mining techniques, and predictive analytics. This
can be summarised as follows:

Text mining = Lexicometry + Data mining

I I
NOTE
Lexicometry or lexical statistics refers to the study of identifying
the frequency of occurrence of words in textual data.

The process is initiated with the retrieval of information, which in


volves collection and identification of information from a set of
textu al material. The information can come from various sources,
such as websites, database, documents, or content management
system. The textual information is processed by parsers and other
linguistic analy sis tools to examine and recognise textual features,
such as people, or ganisations, names of places, stock ticker symbols,
and abbreviations.

Figure 8.2 depicts the text mining process:

Figure 8.2: Text Mining Process

NOTE

Theprocess of analysing a string of symbols, either in natural


or computer language on the basis of formal grammar rules is
termed as parsing or syntactic analysis

On the basis of certain identified patterns, other quantities such as


entities, emails, and telephone numbers are identified. Further, sen
timent analysis is applied to identify the underlying attitude. Finally,
the psychological profiling is determined by conducting quantitative
text analysis.
:

N O T E S

The overall purpose of text mining analytics is to transform unstruc


tured text into valuable structured data, which can be further anal
ysed and applied for various domains, such as research,
investigation, exploratory data analysis, biomedical applications, and
business intel ligence.

Statistical analysis tools, such as R and word count, aid in the assess
ment of the overall review. Further, positive and negative
relationships can be explored using various plotting techniques, such
as scatter plot. Apart from the listed application areas, text mining
techniques can be further applied for analysis of demographics,
financial status, and buying tendencies of customers.

To sum up, text mining can be applied in the following areas:


□ Competitive intelligence: In order to succeed, business organisa
tions need to know about not only the key players in the industry
but also the strengths and weaknesses of their competitors. Text
mining provides factual data to organisations that can be applied
for strategic decision ma.king.
D Community leveraging: Text mining facilitates the identification
and extraction of the information embedded in community inter
action. This information can be applied for amending marketing
strategies.
D Law enforcement: Text mining can be applied in the domain of
government intelligence for countering anti-terrorist activities.
Life sciences: Text mining can also be effectively applied in the
area of research and development of drugs. Bioinformatics com
panies, such as PubGen, are applying biomedical text mining
com bined with network visualisation as an Internet service.

8.4.1 UNDERSTANDING TEXT MINING PROCESS


The enormous amount of unstructured data collected from the social
media makes text mining a verychallenging process. The key steps
for any text mining process can be summed up as follows:
1. Extracting the keyword: Any text analysis process begins by
identification of relevant and precise keyword(s) that can be
applied for specific queries. Next, the content and the linkage
patterns are considered for applying keyword searches as the
content related to similar keywords is often linked. The selected
keywords act as social network nodes and play an important
role while clustering the text.
2. Classifying and clustering the text: Various algorithms
are applied for classifying text from the source content. For
• •
N O T E S

this process, the nodes are associated with labels prior to


classification. After that, the classified text is clustered on the
basis of similarity. The classification and clustering of the text
are greatly influenced by the linkage structure of data. Accurate
results can be obtained by applying node labeling and content
based classification techniques.
3. Identifying patterns: Trend analysis applies the principle that
even for the same content, the clusters collected at different
nodes can have different concept distributions. For this reason,
the concepts at various nodes are compared and classified
accordingly in the same or different subcollections.

Obtaining desired results for a specific query involves careful


process ing of the relevant document. For effective text mining,
several stages of processing need to be applied on a document, such
as:
□ Text preprocessing: This involves the identification of all
the unique words in a document. Non-informative words, such as
the, and, or, and when, are filtered out from the document text
before applying word stemming. Word stemming refers to the
process of reducing the inflected or derived words to their stem
base. For ex ample, words such as cat, cats, catlike, and catty will
all be mapped to the same stem base 'cat'. Terms such as
stemmers or stemming algorithms are also used interchangeably
in stemming programs. Affix stemmers trim down both suffix
and prefix, such as ed, ly, anding, from a given word. Popular
stemmers include Brute Force algorithm and Suffix Tripping
algorithm.
□ Document representation: A document is basically represented
in words and terms.
□ Document retrieval: This involves the retrieval of a document
based on some query. Accurate results are ensured using text in
dexing and accuracy measures. Text indexing and searching
capa bilities can be incorporated in an application using Lucene,
which is a Java library.
□ Document clustering: This involves the grouping of conceptually
related documents to ensure fast retrieval. A term for a given que
ry can be searched faster from the well-clustered documents.

Document clustering can be implemented using the following tech


niques:
□ Hierarchical clustering
□ One-pass clustering
□ Buckshot clustering
:

N O T E S

Once clustered, the documents are then organised into user-defined


categories or taxonomies. Figure 8.3 depicts the stages of document
processing:

• Identification of unique words


• Removal of the stop words
• Word stemming

• Vector space model


• Dimension reduction

• Text indexing
• Measure of accuracy

• Document similarity
• Document categorization

Figure 8.3: Stages of Document Processing in Text Mining

Both structured and unstructured data are involved in text mining.


Unstructured data comes from reviews and summaries while the
structured data is obtained from organised spreadsheets. Text mining
tools identify themes, patterns, and insights hidden in the structured
as well as unstructured data. Various text mining software are em
ployed by organisations for different data mining applications. The
following are some commonly used text mining software:
D R: Used for statistical data analysis, text processing, and
sentiment analysis
D ActivePoint: Applied for natural language processing and online
catalog-based contextual search
D Attensity: Used for extraction of facts, including who, what,
where, and why and then identifying people, places, and events
and how they are related
o Crossminder: Applied for cross-lingual text analytics
D Compare Suite: Used for comparing texts by keywords and
high lighting common and unique keywords
D IBM SPSS Predictive Analytics Suite: Applied for data and text
mining
D Monarch: Applied for analysis and transformation of reports into
live data
• •
N O T E S

□ SAS Text Miner: Provides a rich suite of text processing and


anal- ysis tools
□ Textalyzer: Used for online text analysis
Apart from these, some other text mining tools include AeroText,
An goss, Autonomy, Clarabridge, IBM LanguageWare, IBM SPSS,
Word Stat, and Lexalytics

Now, let's explore an important component of text mining, i.e., senti


ment analysis.

8.4.2 SENTIMENT ANALYSIS


Sentiment analysis is one of the most important components of text
mining. Also termed as opinion mining, it involves careful analysis
of people's opinions, sentiments, attitudes, appraisals, and
evaluations. This is accomplished by examining large amounts of
unstructured data obtained from the Internet on the basis of positive,
negative, or neutral view of the end user. Sentiment analysis
involves the analysis of following sentences:
□ Facts: Product A is better than product B.
□ Opinions: I don't like A. I think B is better in terms of durability.
Similar to Web analysis, specific queries are applied in sentiment
analysis to retrieve and rank relevant content. However, sentiment
analysis also differs from Web analysis in certain factors. It is possi
ble to determine from a sentiment analysis that whether the content
expresses an opinion on the topic and also whether the opinion is
pos itive or negative. Ranking in Web analysis is done on the basis of
the frequency of keywords. On the other hand, ranking in sentiment
anal ysis is done on the basis of polarity of the attitude.
With the widespread use of Web 2.0 technologies, a huge volume of
opinionated data is available on the social media. People using social
media put their reviews and comments about products used and also
share their feedback, opinions and experiences with others in their
network. These reviews and feedback are utilised by organisations
to improve and upgrade their products and services and enhance
their brand equity. Sentiment analysis applies other domains such
as linguistics, digital technologies, text analysis tools, artificial
intelli gence, and Natural Language Processing (NLP) for
identification and extraction of useful information. This greatly
influences various do mains, ranging from politics and science to
social science.
I I
NOTE
Artificial intelligence is a technology and a branch of science that
deals with the study and development of intelligent machines and
software. Natural language processing is a domain of computer
sci ence, artificial intelligence, and linguistics that deals with the
inter actions between computers and human (natural) languages.
:

N O T E S

The most common application of sentiment analysis is in the field


of consumer products and services. This also provides valuable
information to competing organisations and candidates. Sentiment
analysis can effectively track voters' expectations, perspectives,
and feedback. Apart from that, sentiment analysis can be applied in
automated scoring systems and rating applications to provide scores
and ratings to public companies. An example of rating applications
is Stock Sonar that generates automatic rating by analysing articles,
biogs, and tweets.

The process of sentiment analysis begins by tagging words using


Parts of Speech (POS), such as subject, verb phrase, verb, noun
phrase, de terminer, and prepositions. Defined patterns are filtered to
identify their sentiment orientation. For example, 'beautiful room' has
an ad jective followed by noun. The adjective 'beautiful' indicates a
positive perspective about the noun 'room'. At this stage, the
emotional factor in the phrase is also examined and analysed. After
that, an average sentiment orientation of all the phrases is computed
and analysed to conclude if a product is recommended by a user.

The following parameters may be applied to classify the given text in


the process of sentiment analysis:
0 Polarity, which can be positive, negative, or neutral
o Emotional states, which can be sad, angry, or happySubjectivity or
objectivity
D Features of key entities, like screen size of cell phone, durability of
furniture, lens quality of camera, and etc.
0 Scaling system or numeric values

Automated sentiment analysis is still evolving as it is difficult to in


terpret the conditional phrases used by people to express their sen
timents on social media. For example, 'if you don't like A, try B'. In
this sentence, the user clearly shows his/her positivity towards B
but doesn't indicate clear views about A. After removing 'if', the first
clause clearly indicates negativity towards A.

However, sentiment analysis employs various online tools to effective


ly interpret consumer sentiments. Some of the online tools are listed
as follows:
D Topsy: It is used to measure success of a Website on Twitter. It
tracks the occurrence of given and related keywords, website
name, and website URL in tweets.
0 BackTweets: This toll is applied to improve search engine
ranking of a website. It tracks tweets that link back to a website.
D Twitterfall: It locates tweets that are important for a website. It
can be used to stay in touch with the customers and consumers
and respond to their queries and suggestions in real time.
• •
N O T E S

□ TweetBeep: This is used to send timely updates or alerts for


the topics of interest.
□ Reachli: Designed especially for Pinterest, it is a content-sharing
website. This tool helps in tracking data and scheduling and or
ganising pins (denote the updates in Pinterest) in advance.
Apart from these, some other sentiment analysis tools include Social
Mention, AlertRank Sentiment Analysis, and Twitter Sentiment
Anal ysis. The business organisations can apply specific tools as per
their requirements and sentiment analysis needs.

f;f SELF ASSESSMENT QUESTIONS


5. Social networksgenerally support the exchange of information
and data in various formats, such as text, videos, and photos.
(True/False)
6. Text mining tools are often based on the principles of
and processes.

ACTIVITY
Search and prepare a report on the various applications of text
mining.

PERFORMING SOCIAL MEDIA


ANALYTICS AND OPINION MINING ON
TWEETS
In today's IT-driven world, social media has emerged as the most
popular means of communicating and sharing views and information
across the world. Some examples of social media are social network
ing websites, such as Facebook and Twitter. These websites act as
global platforms that allow people to share their likes, dislikes, and
opinions on various topics.

In this section, you will practice deriving useful information from


the data obtained from social networking sites.

Every day, millions of people use social platforms to express their


opinions about almost everything under the sun. This is why most
organisations prefer to collect data from these websites to know the
public opinion about their products and services. This information
helps organisations to take timely and crucial business decisions. R is
a statistical programming tool that implements modern statistical al
gorithms to perform various types of analytical activities. This
section takes you through the step-by-step process of gathering,
segregating,
N O T E S

and analysing text data using the R tool. A mobile phone


manufactur ing company hires a data analyst to review the opinion
given by peo ple on its products. This information will help the
company to know about the current market trends and further
enhance the quality of its products based on the insights. The data
analyst decides to collect data from the tweets of people, and then
examine it under three cate gories: positive, negative, and neutral.
Here, you are going to help him download tweets and analyse them to
derive valuable information. Before performing the social media
analytics, you need to load some library utilities into the current R
environment and verify the Twitter authentication information to
work with the tweets.

Enter the following commands to load required packages to work


with online tweets:

install.packages("twitteR")
install.packages("bitops")
install.packages("digest")
install.packages("RCurl")
# If there is any error while installing RCurl,
follow the below
command using terminal
#sudo apt-get install libcurl4-openssl-dev
install.packages ("ROAuth")
install.packages ("tm")
install.packages("stringr")
install.packages ("plyr")
library(twitteR)
library(ROAuth)
library (RCurl)
library (plyr)
library(stringr)
library(tm)

If you are working on Windows Operating System, you may face


Se cured Socket Layer (SSL) certificate issues.

You can avoid that byproviding certificate authentication


information in the options() function through the following
command:

options(RCurlOptions = list(cainfo = system.


file("CurlSSL","cacert.pem",package="RCurl")))

After loading the required R utilities and providing the SSL


certificate authentication information, load the Twitter authentication
informa tion. This information will be used to download tweets later.

Enter the following commands to load Twitter authentication


infor mation using your own Twitter credentials:
••
'
N O T E S

load("/Datasets/twitter_cred.RData")
registerTwitterOAuLh(cred)

Figure 8.4 shows the use of Twitter credentials for Twitter authenti
cation in R:

R _ RConsole
> t.a.-ta.11.pac:Jr.aoe...c•cv.L.:: •••>
l':rror1 CO\ll.4 not. .t.1.nd func J..on •t:n•t•ll.peclr•o•••
> 1-.ru:caU.pa,ck-oe•t•cw•t.t a•)
Iru1 lli.no p,ac.b9e U'lto • -e:Il!--"'°/IVv:.1..11- /:S.O•
(a.a •1.1.b• ._. .1•d·
---Ple.. t •tlte1: a CR.All a1.:c:ror ror uae: .1..0 ctu.e aa■al.OD -
:;.ry:1.39 URL "bte;p://hp.J..i.aa.ac.1.D/t:r..an/bUl/v-1.Ddo'lifa/co;nr-"1.b/3.0/tWi t•R 1.1.7. ·
COJ1te.nt t •a.pp.1..1.cat.J.an/1.t,p• l..n9 320t'SI b •• ( 1.2 D) -
opani&d URL
dovn1oacSed 31.2 Kb

The doval.oad4td bJ..n&.ry pacleaoeas a.rt u.


\,AppO.c,a\Loca1\Turp\R =,,l\dQYD1.oad.e.CS_pe,clc:S
> 11.bra.ryt•ic:wl.ttcR•i
Load,J.Q,Q' requJ. CI Jteqe: ltOA'lilU.
LOe.d-1.GO r ttd pa.cDo••u
RCurl to&d.a.no tt-QUi.'N!d o-ckaoe:
bJ.tO'PJI t.oad.l.ft.o e:o o,oc.noe:
dl.oe:•c. I.oadino N<l
paekao-: r,•on
> J.ibrary (plyrJ

> .li.b.r.a_ry(•t..r..1.nQ.Z")

Figure 8.4: Using Twitter Credentials for Twitter Authentication in R

Toanalyse tweets, you first need to segregate and download them


on the basis of some specific keywords.

I I
E( NOTE

R provides automatic downloading of tweets by using the searchTwit


ter() function, which takes as its arguments the language in which the
tweets need to be searched, the keyword (term to be searched on the
Internet), and the number of tweets that need to be extracted contain
ing the keyword.

Now, enter the following command to download 1000 English lan


guage (specified as lang="en" argument to the searchTwitter() func
tion) tweets, containing the word "nokia":

input tweets=searchTwitter("nokia", n=l000,lang="en")

We can take a list of tweets at a time to analyse different opinions, as


shown by the following command:

input tweets[l:3]
N O T E S

Figure 8.5 shows the commands along with their outputs:

R RCo e ---->--==
I> f:CloWal.oa41.nq t.vee't.!I
> 1np\Jt tweets • uerchTv1t:ter«•nok.1.a•. n•1000, le.nq-•en•)
llerning-me.!lsaqe:
In dcRppAPICa.l.l ("aea.rch/twee:ta•, n, pa.r6DIJII • param.s, retry()n.RauL.iJal.t • retr)'OnRat.$
1000 tweet.:, ven re:que ted but t.he API c:.an only return 91

Figure 8.5: Searching a Keyword in a Specified Number of Tweets

Some tweets, containing the search word, may be insignificant for


our analysis. Therefore, we need to extract tweets only with the
relevant texts. Enter the following command to extract a specific set
of words as a text string:
tweet=sapply(input tweets,function(x) x$getText())

The strings can be viewed as vectors by entering the following com


mand:
input tweets[l:4]

Thenext task is to segregate tweets on the nature of the feedback


they provide. The feedback would be positive, negative, or neutral. In
our case, we are using only positive and negative words. The
function for sentiment analysis is given as follows:

score.sentiment= function(sentences, pos.words,neg.


words, .progress='none')
{scores= laply(sentences, function(sentence, pos.
words, neg.words)
{ sem:ence = gsub ( 11 [ [ : punct: ] ] 11, 1111 sentence)
sentence = gsub ( 11 [ [: cntrl:]] ", 1111 sentence)
sentence = gsub ( '\ \d+', '' sentence)
tryTolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits (try_error, "error"))
y = tolower(x)
return (y)}
sentence= sapply(sentence, tryTolower)
word.list= str_split(sentence, 11\\s+")
words= unlist(word.list)
pas.matches match(words, pas.words)
neg.macches match(words, neg.words)
pas.matches !is.na(pos.matches)
neg.matches !is.na(neg.matches)
score= sum(pos.matches) - sum(neg.matches)
return(score)
}, pos. words, neg. words, . progress=. progress
scores.df = data.frame(texc=sentences, score=scores)
return(scores.df)}
'
N O T E S

After writing the preceding function, the files containing positive and
negative words are loaded to run the sentiment function. Enter the
following commands to load the data file containing positive and
neg ative words, respectively:

pos=readLines ("/Datasets/positive-words. txt") # find


file positive_words.txt
neg=readLines("/Datasets/negative-words.txt") # find
file negative_words.txt

Categorise each tweet as positive, negative, or neutral by using the


following code:

scores= score.sentiment(tweet, pos, neg,


.progress='text')
scores$very.pos as.numeric(scores$score > 0)
scores$very.neg as.numeric(scores$score < 0)
scores$very.neu as.numeric(scores$score == 0)

Figure 8.6shows the commands along with their outputs:

R RConsole
I> po■ - rcadL:i.nca('"Cs/wc:11.1». LAB.S/Wcelt 6/5e ota 1./d& t•/po•1.C.1vc-vo .tac•)
> neg- readLJ..Ce.a ('"C:/llCIH». Ut.B5/We-elt 6/Se.s.sJ.oa 1/csoeaaec.s/acga 1ve-vords.11:xc.'")
> .scorea - ,core. .sent::1.aent(tve-et. pos. ne:q, .prOQre.s.-•t.ext•)
I I M
>
>
>
> •core•.Svery.po.s - ••-nuae.r:Lc (aco.re■S•eore > OJ
> •cor.■.Sv•rv ..a.•9 • ••-n i.c(■cor...S-■oor• < O )
> a-corcoS•cry.ncu • ae.nuae.r1otacorc.sC':,co,:,c - O)
> n\UI00-9 - :11\lla(sooreeSYery,DO■l
> n. o • •\.a(■c:orc•.Sv.ry.11eo)
> ru.mneu • ■ia(ac:or•■SV'try.n.u)
> • <- c(auapoa,a.uane-o,nu.aeu)
> l.bl..s <- cc•POSU •.•nG:ArIVE•,•N?:OT"RAL11)
> pc:'t. <- l:Cl'IUM:l(a/.su.(a)•lOO)
> ll>U <- p.a■ce (l.b.1..a. pee)
> l.bU <- pei,n::e Ubl.a,"'"', ■.-pa••)
> P-1• (:S.,la.l)fcle • 1bl..s ool•r&.l..Cl)ov(1e.n tl.bl.s)) aa..1.c-·oPINION•)
>

Figure 8.6: Assigning Sentiment Scores to Tweets

Enter the following commands to find out the number of positive,


neg ative, and neutral tweets:

# Number
of positive, neutral, and negative tweets
numpos
sum(scores$very.pos)
numneg
sum(scoresSvery.neg)
numneu
sum(scores$very.neu)

Now, aggregate the final results by using the following commands:

#Final results aggregation


<- c(numpos,numneg,numneu)lbls
<-c ("POSITIVE", "NEGATIVE", "NEUTRAL") pct <-
round (s/sum (s) *100) lbls <- paste(lbls, pct)lbls
<-paste(lbls,11%11,sep="")
N O T E S

After the sentiments are categorised and the number of positive, neg
ative, and neutral tweets is found out, plot theresults byusing the
following command:

#Plot the results


pie(s,labels = lbls, col=
rainbow(length(lbls)),main="OPINION")

The Pie chart for the analysed sentiment score is shown in Figure 8.7:

R RGtaphk:.s:Oevtce2 (ACTIVE)

OPINION

Figure 8.7: Pie Chart for the Sentiment

u Score
SELF ASSESSMENT

7. These websites act as global platforms that allow people to


share their likes, dislikes, and opinions on various topics.
QUESTIONS
(True/False)
8. R is a statistical programming tool that implements modern
statistical to perform various types of analytical
activities.

Search and enlist various text mining packages available in R.

i:!M ONLINE SOCIAL MEDIA ANALYSIS


We can use online tools for the analysis of text generating from
social media. One of the major online social media analysis tools is
Social Mention. The analysis of the text to find out positive,
negative, and neutral views can also be performed online by
performing the follow ing steps (you will need an Internet connection
to perform this):
1. Open the following link in your browser:
http://sociaLmention.com/
'
N O T E S

The Social Mention website will appear, as shown in Figure


8.8:

_..........,_t_._J,.:,.J_ c_::JV:,,..
vc.,.a•■• o,•.=.

-
social :ttion*

Social Media Strategy


Performance + ROI

Sign up! today ,

Figure 8.8: Showing the Social Mention Website


2. Type the name of the product about which you need to gather
information in the Search box and press the Search button, as
shown in Figure 8.9:

social "ntlon*
sony
,,..,.
""""
Figure 8.9: Searching a Product for Online Text Analysis
A Web page appears with the sentiment score for the product
(Sony, in our case) being displayed on the left-hand side, as
shown in Figure 8.10:

sociaL l•#Wl'f4 tQ,l.•r ,..

ls.a,eli'

24% 70:1 Mentions about sony


,;,c:rumenr
ng!h
3-Q16•, O. - E:°ResJh- -:3
,:,,Oni1n,e shoppif"9 spikes ttnsfestive season on OiwaUdea!'ltM nt

l:l!i$1nl'S!i\.,edf- Pill!nG 47 m:nu:ino11g,g.Sr151,,n Awle 1-anyanc Uir;


r.tln!mu,,rl lc,bl!tilt: n I p,lptll&:11b;,r ..,1o,c1,o;m1i;,r;;nd 11!CCllle ,..QU
.'lJ:,g •ni.

0 Imnmn toOtveloo150::§00nyn FS:fi:i u1va-I1:!1Pf]Q10zoom


DQtal v R,e,,IE,,•1• rouro Mm'l!AffaooTI..e te'l!i w,ttt.eOEv Dedtor
car,00 m M. a/1\1SOrrt"'l;unt; Pl'ICln-ar(I a\<all30Utyaril&JJ to tie MUWJI'\_

\i•nhmentl ,.1 ;,,.;,,. ,iliJ..t. ,.

ll'k!llrnl - 1G
Mulnl _,so
HQ IIW t ,:,lJti09Augment.aBttsllfy to Maektl10 Cotlt91StJJRMII S.l.uct,,
-fo
Pffi\'fft mifltlle,;SQO'IJlc,J;J'Iso n;,s.nHoo't"'.oo-w.:11uievW3
Top Keywords r-;i Oil.a str\ln91,m,esmod(M n.kJl"IE!2012 9 e:repo.'W'<l on a Sonv
-y - 00 i:,.,ti,i\11

. .■ ■ ,.
,.,,,.,,. • bl
■ ..
,.
:.pen• I Jl oArmur rec:or@ ttie:lfl"QQSs1bte
p11au,s I 10 Da1','TeieqJ.JOIIAJslrala• FOOD'l5&mi111"'2S&gc> 8M;rl ly ll'le)n1ont l)e

Figure 8.10: Showing the WebPage with Sentiment Score


for the Product
••

N O T E S

The extended view of the Sentiment score, marked in Figure


8.10 is shown in Figure 8.11:

Top Users
moic,eslcilrlsUa, 34
Je-ns 8o Jones - 14
racll:an&p - 8
- 6
guucc"1'1 a 5
heryhong ■ 4
nature-v1orna.m ■ 4
CH.ETTV ■ 4
veloc.idodpromoe 4
Andf"OICI
Authonty
■ 4

Top Hashtags
ik%.Z..2jutefbt.2

Sources
f)tlOlObUCkel 58
delicious 50
ffic:kr 49
an.sweB_Wl.kl 49
'!/OU'-Ube 47

ask a 8

Figure 8.11: Showing the Sentiment Score for the Product


and Related Information
You can get more information on the feedback of the product
by clicking any link on the Web page shown in Figure 8.10.
Apart from R, you can also use the Sentiment140 tool to analyse
data on the basis of the feedback given by users. Sentiment140
uses Tv.ritter's data for the analysis purpose.
Figure 8.12 shows the Webpage of the Sentiment140 text analysis
tool:

c...............;.,..,,""""14o<o,r, Q.;: 0- V _,. ' De •◊ 0 ii


..... f.16-tl• ,_,,,...0 "'"'"' ...,...,_.. .;,. ""'6,, HH<l>,p.;..u /1,,NC- l, i. ~- ..,._ , R-'i HHSo.•1111 G....,lo by

..,,........

Sentiment140

Figure 8.12: Showing the WebPage of Sentimentl40


'
N O T E S

Figure 8.13 shows the online analysis of views for Toshiba


products:

0 ,IJ 'O' .-•D• nu O,.,&


Ho<l"""'-l o.,,a lli{lo!, ..JIN!/ ""'°'' i,.........,.C)Jow..<.:riirl ..,. •

Sentimentl 40 •-·•· n w <, ..


tQt.hi Engfi:a,h Fl s,111rch


sentiment illnaJyels for toi.hlba

TWeeti.ill>out;toshJba

!WOO Cbec!rn,1) St '3e8Tehing! Th!t-il!Ithe hottest de8Ion T01-/:llb9NolellOok comput ,M'UU.i('9 otrer,nut ':'I
CONPAZM"K4"'f1

: RT hl!lli'WMl!-"'81.<. .:MSGCll-!AUrncfudes Sun in.ie,rna onaJ.OSCO. Nisun,Ellcuon. Toshiba. Neb tnlemationalBank.,v.lN,Crowne


Plaia,N?

Figure 8.13: Showing the Online Analysis for Toshiba Products Using
the Sentimentl40 Tool

SELF ASSESSMENT QUESTIONS

9. We cannot use online tools for the analysis of text generating


from social media. (True/False)
10. Apart from R, you can also use the tool to analyse data on
the basis of the feedback given by users.

ACTIVITY

Search and find some tools that are used by the organisations to
analyse its Social Media Competitors.

1:N lVIOBILE ANALYTICS


We are now using fourth-generation (4G) wireless mobile technolo
gies. When you look at the past, you will see that wireless mobile
tech nologies have shown a steady growth, evolving from lG to 4G.
With every major shift in the technology, there has been a
corresponding improvement in both the speed and efficiency of
mobile devices.

First generation (lG) mobile devices provided only a "mobile voice",


but in second-generation (2G) devices, larger coverage and improved
digital quality were provided. Third-generation (3G) technology fo
cused on multimedia applications like videoconferencing through
mobile phones. 3G opened the gates for the mobile broadband, which
was seen in fourth-generation (4G) devices. 4G provides wide range
access, multiservice capacity, integration of all older mobile
technolo gies, and low bit cost to the use1:
I• :

N O T E S

Figure 8.14 shows the evolution of different generations of mobile


technologies:

2010

2001

• 4G (20()
• G (2 Mbps)
1991
Mbps)
•2G (14.4
kbps)
1980
• 1G (1. 0
/ kbps)

Ir]
Analog
GSM and
CDMA

Figure 8.14: Evolution of Mobile Technologies


Source: 3GPP Allio.nee., UMTS forums, Inform.a telecoms, Motorola, ZTI.

I I
m"NOTE

Thefull forms of the terms used in Figure 8.14 are as follows:


D GSM: Global System for Mobile Communications
□ CDMA: Code Division Multiple Access
□ GPRS: General Packet Radio Service
□ EDGE: Enhanced Data rates for GSM Evolution
□ WCDMA: Wideband Code Division Multiple Access
D LIT: Long Term Evolution

"Forget what we have taken for granted on how consumers use the In
ternet," said Karsten Weide, research vice president, Media and
En tertainment. "Soon, more users will access the Web using mobile
devic es than using PCs, and it's going to make the Internet a very
dif.ferent place."

According to International Telecommunication Union (ITU), an


agen cy of the United Nations (UN) responsible for information and
com munication technologies-related issues, the number of mobile
users in 2008 was 548 million. This number increased to 6835
million in 2012.
According to Informa, a research firm in the US, there were over
3.3 billion active cell phone subscriptions in 2007 all over the world.
This effectively means that around half of the total population of the
earth is using mobile devices.
• •
N O T E S

Given the above statistics, it is imperative for organisations to find


ways of analysing data related to mobile devices and use the data to
market and sell their products and services through these devices.
Mobile analytics is a tool that allows organisations to do this.

8.7.1 DEFINE MOBILE ANALYTICS


Marketers want to know what their customers want to see and do on
their mobile device so that they can target the customer.

Similar to the process of analytics used to study the behavior of users


on the Web or social media, mobile analytics is the process of
analys ing the behavior of mobile users. The primary goal of mobile
analytics is to understand the following:
□ New users: These are users who have just started using a mobile
service. Users are identified by unique device IDs. The growth
and popularity of a service greatly depend on the number of new
users it is able to attract.
□ Active users: These are users who use mobile services at least
once in a specified period. If the period is one day, for example,
the ac tive user will use the service several times during the day.
The number of active users in any specific period of time shows
the popularity of a service during that period.
□ Percentage of new users: This is the percentage of new users
over the total active users of a mobile service. This figure is
always less than 100%, but a very low value means that the
particular service or app is not doing very well.
□ Sessions: When a user opens an app, it is counted as one session.
In other words, the session starts with the launching of the app
and finishes with the app's termination. Note that a session is not
related to how long the app has been used by the user.
□ Average usage duration: This is the average duration that a mo
bile user uses the service.
□ Accumulated users: This refers to the total number of users (old
as well as new) who have used an app before a specific time.
□ Bounce rate: The bounce rate is calculated in percentage (%).
It can be calculated as follows:
♦ Bounce rate = Number of terminated sessions on any specific
page of an app/Total number of sessions of the app*l00.
♦ The bounce rate can be used by service providers to help them
monitor and improve their service so that customers remain
satisfied and do not leave the service.
□ User retention: After a certain period of time, the total number
of new users still using any app is known as the user retention of
that app.
I• :

N O T E S

Commercially, mobile analytics can be defined as the study of data


that is collected to achieve the following purposes:
D Track sales: Mobile analytics can track the sale of products.
0 Analyse screen flow: Mobile analytics can track how and where
a user touches the screen. This information can be used to make
interactive GUis and also decide the place for mobile ads.
0 Keep customers engaged: Mobile analytics studies the behavior
of the users or customers, and display ads and other screens to
keep them engaged.
D Analyse the preferences of visitors: On the basis of the user's
touch, tap and other behavior on the screen, mobile analytics can
analyse their preferences.
D Convert potential buyers into buyers: According to the users'
likes and dislikes, mobile analytics offers different products and
services to them. The purpose of this exercise is to convert a
visitor to a buyer.
D Analyse m-commerce activities of visitors: Mobile analytics can
analyse the m-commerce activities of the visitors and find out a
lot of useful information like a user's frequency of making a pur
chase and the amount he is willing to spend. Mobile commerce
(or m-commerce) refers to the delivery of electronic commerce
capa bilities directly into the consumer's hand anywhere, via
wireless technology.
D Track Web links that users visit on their mobile phones:
Mobile analytics can be used to analyse the visited Web links of
users and know their preferences.

8.7.2 MOBILE ANALYTICS AND WEB ANALYTICS


Mobile analytics has several similarities with Web and social analytics,
such as both can analyse the behavior of the user with regard to an
application and send this information to the service provider. Howev
er, there are also several important differences between Web
analytics and mobile analytics.

Some of the main differences between Web analytics and mobile


ana lytics are as follows:
D Analytics segmentation: Mobile analytics works on the basis of
location of the mobile devices. For example, suppose a company
is offering cab service in a city like NewYork. In this case, the
compa ny can use mobile analytics to identify the target people
travelling in New York. Mobile analytics works for location-based
segments, while Web analytics works globally.
• • I'

N O T E S

□ Complexity of code: Mobile analytics requires more complex code


and programming languages to implement than Web analytics,
which is easier to code.
□ Network service providers: Mobile analytics is totally dependent
on Network Service Providers (NSPs), while Web analytics is inde
pendent of this factor.
□ Measure: Sometimes, it is difficult to measure information from
the mobile analytics apps because they can run offline. Web ana
lytics always runs online. So, we can easily measure vital
informa tion with it.
□ Tools: To do the ultimate analysis on data, we require some
other tools of Web analytics with mobile analytics tools. Web
analytics, on the other hand, does not require any other tool for
analysis.

8.7.3 TYPES OF RESULTS FROM MOBILE ANALYTICS


The study of consumer behavior helps business firms or other
organi sations to improve their marketing strategies. Nowadays,
every organ isation is making extra effort to understand and know the
behavior of its consumers.

Mobile analytics provides an effective way of measuring large amounts


of mobile data for organisations. It also shows how useful marketing
tools, such as ads, are converting potential buyers to actual purchas
ers. It also offers deep insight into what makes people buy a product
or service and what makes them quit a service.

The technologies behind mobile analytics, like Global Positioning Sys


tem (GPS), are more sophisticated than those used in Web analytics;
hence, compared to Web analytics, users can be tracked and targeted
more accurately with mobile analytics.

Mobile analytics can easily and effectively collect data from various
data sources and manipulate it into useful information. Mobile
analyt ics keeps track of the following information:
□ Total time spent: This information shows the total time spent
by the user with an application.
□ Visitors' location: This information shows the location of the user
using any particular application.
□ Number of total visitors: This is the total number of users
using any particular application, useful in knowing the
application's popularity.
□ Click paths of the visitors: Mobile analytics tracks of the
activities of a user visiting the pages of any application.
:

N O T E S

D Pages viewed by the visitor: Mobile analytics tracks the pages


of any application visited by the user, which again reflects the
popu lar sections of the application.
D Downloading choice of users: Mobile analytics keeps track of
files downloaded by the user. This helps app owners to
understand the type of data users like to download.
D Type of mobile device and network used: Mobile analytics
tracks the type of mobile device and network used by the user.
This in formation helps mobile service providers and mobile
phone sell ers understand the popularity of mobile devices and
networks and make further improvements as required.
D Screen resolution of the mobile phone used: Any information
or content that appears on mobile devices is according to the
screen size of these devices. This important aspect of ensuring
that the content fits a particular device screen is done through
mobile an alytics.
D Performance of advertising campaigns: Mobile analytics is
used to keep track of the performance of advertising campaigns
and other activities by analysing the number of visitors and time
spent by them as well as other methods.

8.7.4 TYPES OF APPLICATIONS FOR MOBILE ANALYTICS


There are two types of applications made for mobile analytics. They
are:
D Mobile Web analytics
D Mobile application analytics

Let's learn about these types of applications in detail in the following


sections.

MOBILE WEB ANALYTICS

Mobile Web refers to the use of mobile phones or other devices like
tablets to view online content via a light-weight browser. The name
of any mobile-specific site can be the form of m.example.com.
Mobile Web sometimes depends on the size of the screen of the
devices. For example, if you design an application for a small screen,
its images would appear blurred on a big screen; similarly, if you
make your site for the bigscreen, it can be heavy for a small screen
device. Some or ganisations are starting to build sites specifically for
tablets because they have found that neither their mobile-specific site
nor their main website ideally serves the tablet segment. To solve
this problem, mo bile Web should have a responsive design. In other
words, it should
'
N O T E S

have the property to adapt the content to the screen size of the
user's device.

Figure 8.15 shows the difference between a website, a mobile site,


and a responsive-design site:

Responsive
I
Website Mobile Site Design Site
example.com m.example.com I example.com
I
- - I .

08
I x---;

I
I I
I
I
I
I
B
'- - - -'
I
I 9
I

Figure 8.15: Difference among a Website, Mobile Site, and Respon


sive-design Site

In Figure 8.15, you can see that a website can be opened on both
com puters and mobile phones, while a mobile site can be opened
only on mobile phones; responsive-design sites, on the other hand,
can be opened on any device like a computer, tablet, or mobile
phone.

MOBILE APPLICATION ANALYTICS

The term mobile app is short for the term mobile application
software. It is an application program designed to run on
smartphones and oth er mobile devices.

Mobile apps are usually available through application distribution


platforms like Apple App Store and Google Play. These application
distribution platforms are generally operated by the owners of the
mobile operating systems. Examples of mobile operating systems in
clude the Apple App Store, Google Play, Windows Phone Store, and
BlackBerry App World. Some mobile apps are freely available, while
others must be bought.

Depending on the objective of analytics, an organisation should de


cide whether it needs a mobile application or a mobile website. If the
organisation wants to create an interactive engagement with users on
mobile devices, mobile app is a good option; however, for business
purposes, mobile websites are more suitable than mobile apps.
N O T E S

Table 8.1 lists the main differences between mobile app analytics and
mobile Web analytics:

Screen Mobile app analytics does Mobile Web analytics has


and Page not have pages. The user can pages like normal websites,
interact with various screens. and users do interact with
various pages.
Use of Mobile app analytics can Mobile Web analytics does
built in access built-in features, such not use built-in features, like
feahu·es of as gyroscope, GPS, acceler gyroscope, GPS, accelerom
mobile ometer. and storage. eter. etc.
devices
Mobile app analytics Mobile Web analytics has
Session has shorter session longer session timeouts. In
time timeouts (around 30 general, a session will end
seconds). after 30minutes of inactivity
for websites.
Mobile Web analytics re
Online/Of quires an Internet connec
Depending on how it was
fline developed, mobile app tion and can run online only.
analytics may not require
to be connected to a
mobile network.
Updates App owners provide frequent Updates are not frequent.
updates and new versions of
-:;;: ttie apps.

EXHIBIT
Enhancement in Search Query on Mobile Phones

According to Google, the percentage of search queries on mobile


phones has increased manifold in the last few years. Some of the
most searched industries on mobiles are as follows: restaurants
(29.6%), auto (16.8%), electronics (15.5%), and insurance
(15.4%).
SELF ASSESSMENT QUESTIONS

11. Which of the following technologies is used to focus on


multimedia applications like videoconferencing through
mobile phones?
a. 3G b. 2G
c. lG d. None of these
12. CDMA stands for Code Multiple Access.
• •
N O T E S

Prepare a report on Amazon Mobile Analytics.

1:j:• MOBILE ANALYTICS TOOLS


Advances in analytic technologies and business intelligence are allow
ing CIOs to go big, go fast, go deep, go cheap and go mobile with
business data.-www.CIO.com

The fundamental task of the mobile analytics tool is similar to other


digital analytical tools like Web analytics. They capture data, collect
it, and help to generate reports that can be used meaningfully after
processing.
The selection of analytics tools is not an easy process because these
tools are new and undergo rapid enhancements as compared to tra
ditional Web analytics tools. Companies frequently upgrade their ex
isting analytical tools as well as launch new tools with new features.
Mobile analytics tools have some technical limitations; all mobile
ana lytics tools do not perform all the services, so must find out
which tool can be beneficial for you. Following are some points to
be considered while selecting mobile analytics tools:
□ What is your analytical goal?: No single mobile analytics tool can
fulfill all your needs; therefore, it is essential to set your
analytical goal so that you can select the right tools. You can
take the help of experts to make your goal in terms of the
capabilities of the tools.
□ Analysis techniques: Various analysis techniques exist to
analyse the information of mobile users' behavior. For example,
through 'packet sniffing' an intruder can obtain important
personal infor mation of users from data packets; 'image-based
tagging' uses a query string to guess the activities of the user;
and 'data collection scripts' analyse users' requests.
□ Way of presentation: Information flows between a mobile
device and the server according to the client-server architecture.
Mobile analytics captures data as it passes through the mobile
network. The data capture is done on a server that is different
from the ac tual server. This server is placed between mobile
devices and its actual server. This type of tracking generates a
large amount of data in the form of log files. The reading of these
log files is again a daunting task.
Now, the question is, 'how is the information presented to you?' You
must choose an analytical tool that best suits your requirements.
:

N O T E S

There are two classes of mobile analytics tools:


D Internal mobile analytics tools: These refer to software that is
provided by the SaaS (Software as a Service) vendors. It can be
software installed and maintained by the IT department of any
data center. Some examples of internal mobile analytics tools are
Localytics and WebTrends.
D External mobile analytics tools: These are services provided by
third-party data vendors, which are responsible for collecting,
ma nipulating, analysing, and generating reports for the
customers from their proprietary systems. Some examples of
external mobile analytics tools are comScore and Groundtruth.
According to Aconitum Mobile (a software development company
in the US), the top four mobile analytics tools (or packages) are as
fol lows:
l. Localytics: This is a big marketing and analytics platform for
mobile and Web apps. Its developer is Localytics, in Boston. It
supports cross-platform and Web-based applications. For more
details, you can check out their website at www.localytics.com.
Localytics supports push messaging, business analytics, and
acquisition campaigns management. Localytics has a list of big
customers like Microsoft, New York Times, ESPN,
Soundcloud, and eBay.
2. Appsee: Appsee was founded by Zahi Boussiba and Yoni Douek,
in 2012 based in Tel Aviv, Israel. It provides analytical services
with features like conversion funnel analysis, heatmaps, and
much more. For more details, you can check out their website at
http://www.appsee.com.
3. Google analytics: GoogleAnalytics is a great freeservice provided
by Google. It is a cross-platform, Web-based application. It offers
analytics services to mobile app and mobile developers. You can
check out the tool at www.google.com/analytics.
4. Mixpanel: Mixpanel is a 'business analytics service.' It is also
a cross-platform and Web-based application. It can continually
follow the user and mobile Web interactions and provide
attractive tool offers to the user. For more information, you can
visit the website at www.mixpanel.com.
Mobile analytics tools can be categorised as follows:
D Location-based tracking tools
D Real-time analytics tools
D User behavior tracking tools
• •
N O T E S

8.8.1 LOCATION-BASED TRACKING TOOLS

A location-based tracking tool stores information about the location


of mobile devices (or the location of user). These tools are software
applications for mobile devices, and they continuously monitor the
location of the devices and manipulate information thus obtained in
various ways. For example, it can display the location of friends
(per sons in the contact list of your mobile phone), ATMs, cafes,
hotels, and nearby police stations. The following are some location-
based tracking tools:
□ Geoloqi: This tool is a platform for location-based services, and
it was launched in 2010 by the founder, Amber Case in the Unit
ed States. It supports the creation of location-based notes and
time-limited private location sharing.
D Placed: Placed provides a 'ratings service' by measuring various
types of information such as place visited and duration of visit.
This is an efficient tool to explain offline consumer behavior.

8.8.2 REAL-TIME ANALYTICS TOOLS

"The 6.8billion subscribers are approaching the 7.1 billionworld popu


lation" (ITU). This is illustrated in Figure 8.16.

Figure 8.16 shows the growth of mobile phone users (in billions)
with respect to years:

8 Population
7
6
4
:!! 5
4
iii 3
2
Mobile-cellular subscriptions
1
0,_
2005 2006 2007 2008 2009 2010 2011 2012· 2013•

Figure 8.16: Growth of Mobile Subscribers

With the increase in the popularity of mobile phones and mobile


Web, business organisations want to know more about the behavior
of the user. Real-time analytics tools refer to software tools that
analyse and report data in real time.

The following are some real-time analytics tools:


□ Geckoboard: Geckoboard is a real-time dashboard that can
col lect, display, report, and share data that is important for
you and your business in real time. According to Damian
Kimmelman,
:

N O T E S

the founder and CEO at Duedil, "Geckoboard simplifies the deci


sion-making process. It is hard to dispute something that's right in
front of-you." Figure 8.17 shows the Geckoboard application:

n, GD.

Figure 8.17: Geckoboard Application


D Mixpanel: Mixpanel is a Web-based, cross-platform service. It is
a business analytics service that tracks user interactions with mo
bile Web applications. It offers services to users on the basis of
user behavior. Mi.-xpanel does all its activities in real time. Suhail
Doshi and Tim Trefren founded Mixpanel in 2009 in San
Francisco, Cali fornia. Figure 8.18 shows the Mixpanel
applications:

-
Actions speak louder thanpageviews.

>

Figure 8.18: Mixpanel Application

8.8.3 USER BEHAVIOR TRACK.ING TOOLS

User behavior tracking tool is a software tool that tracks user


behavior with any particular mobile application.

These behavior reports can help organisations to improve their appli


cations and services. You can get a lot of information about your
users, such as:
D Screens viewed in a session
• •
N O T E S

□ Number of screens viewed in a session


□ Technical errors faced by the user
□ The frequency of the user using any particular app
□ Sessions life
□ Time taken in the loading of any app elements
□ Any specific action performed with the content, such as clicking
of ads
These reports provide an excellent way for an organisation to know
how users use specific applications on their mobile phones. By cus
tomising the tracking settings, an organisation can observe the be
havior of the user on any particular page (such as the product listing
page) and use the information to fulfill their business objectives. The
following are some popular behavior-tracking tools:
□ TestFlight: Let's suppose a company has many testers, spread
over different countries (or locations). The company has a new
app that it wants to test. Howis it going to test the app? The
solution is TestFlight. TestFlight is a free software platform. Using
TestFlight, a team of developers can use beta and internal iOS
applications to manage the testing and feedback using the
dashboard of Test Flight. The TestFlight SDK has a wide range
of useful APis to test application from various dimensions. Figure
8.19 shows the Test Flight application:

Figure 8.19: TestFlight Application


□ Mobile app tracking: This tool tracks and analyses mobile
app installations. It can report user engagement and Lifetime
Value (LTV) beyond the installations. It is efficient in scaling
mo bile advertising campaigns. It can collect customer's behavior
re lated to an adjust through tapping. Lifetime Value (LTV) is
related to the net profit gained by the entire future relationship
with the customer.
:

N O T E S

Figure 8.20 shows the Mobile App Tracking application:

TOUR PRICING CUSTOMERS RESOURCES SIGNUP lea t.....553--276'!;.J E

For mobile apps that love performance.


Attrlbution analytics to measure the value of your advertislng partners

@Hif1tilifjnQ

Figure 8.20: Mobile App Tracking Application

SELF ASSESSMENT QUESTIONS


13. Through ,anintruder canobtainimportant
personal information of users from data packets.
14. Geckoboard is a that can collect, display, report, and
share data that is important for you and your business.

ACTIVITY
Search and enlist at least 10 mobile analytics tools used by organi
sations these days.

t:$1PERFORMING MOBILE ANALYTICS


In this section, you will first understand the basic steps to integrate
mobile analytics within your business processes. Then, you will do a
practical where you will analyse datasets on mobile phones by using
mobile applications.

To integrate mobile analytics with business processes, you must per


form the following basic steps:
D Select the appropriate mobile device like a smartphone or

tablet. o List the objectives of mobile analytics for your business


process. D Identify the target audience and create a dataset for it.
D Modify the dataset by adding missing values and removing
unnec essary data.
0 Use the dimension-reduction technique and transform data (if
possible).
• •
N O T E S

□ Perform some required data mining techniques, like text mining.


□ Select data mining algorithms and do some analysis for mobile mining.
□ Apply data mining algorithm and verify the relationships among
the variables.
□ Evaluate the result/interpretation.

As we have discussed in the earlier section, the whole process of


mo bile communication is done through the client-server method.
So, the mobile analytics process can be done either on a mobile
device or on the server providing services to the mobile device.
According to the location of data collection, mobile analytics is
categorised as follows:
□ Data collection through mobile device
□ Data collection on server

8.9.1 DATA COLLECTION THROUGH MOBILE DEVICE


Data for analysis is collected on the mobile devices and sent back
to the server for further manipulation. This collection process may
be done online as well as offline. This simply means that some data
col lection processes require an Internet connection to send
collected data to the server but, on the other hand, several
applications collect data into spreadsheets, and they do not require
an Internet connec tion. Collected data can be stored in various
formats.

Data collection through mobile devices has the following benefits:


□ Allows data collection in real time
□ Proves beneficial for field executives
□ Allows data collection to be stored in various formats like text,
graphs, images, etc.
□ Allows the location of field executives to be tracked and the
man agement to assign tasks to them according to their location
□ Reduces unnecessary paperwork
□ Prevents data redundancy because the data is collected and shared
with the entire team within seconds

The following are some commonly used data collection applications:


□ Numbers is a very simple-to-use tool. It can be used to organise
data, perform calculations, and manage lists for you with a few
taps. It provides various templates for making graphs and charts.
Moreover, you can create your own templates too. It has around
250 functions that can perform simple-to-complex calculations
for you. You can create your own formulas through built-in
functions and the help feature.
:

N O T E S

Figure 8.21 shows the GUI of numbers:

<
+ iiS#iif#Fi§f fin ,clll/0- e • • • e •

1234567890

Figuse 8.21: GUI of Numbers


□ HanDBase is a relational database management system. It was
ini tially designed to run on Palm PDAs, but can run on almost
any handheld platform. HanDBase is not as full-featured as
Oracle, Sybase, and DB2, but it still has various other features
that make it important for computing. It is simple enough,
supports multiple handheld platforms, and provides high security.
The company of fers a few apps to download free from its
website.
Figure 8.22 shows the GUI of HanDBase:

iPhonC!'it<ccn hol

Dat8b89B8

All

a...lnoss
Cu&tomer Lj5t

Tlll"!C8illi119

v Flnancb.1

WJo:,nthlyBudg,et

PDtSOnar
088oilN !i00;.00 "'-'lO
Chackbooloi v'3
S::IIOcfllllcm!: 112..50
EZShop
e.tss.•1/0fflJ(een-er.

Figure 8.22: GUI of HanDBase


□ Statistics Visualiser is a tool for iPads and is nicknamed as Stat
Viz. It is a perfect statistical data tool for students and research
ers. It performs statistical calculations and provides results with
detailed explanations. The main feature of this tool is the
dynamic graph, which helps you to quickly understand difficult
statistical concepts.
• •
N O T E S

Figure 8.23 shows the GUI of StatViz:

-
Figure 8.23: GUI of StatViz

8.9.2 DATA COLLECTION ON SERVER


As we have studied in the previous section, data collected by a
mo bile device is ultimately transferred to its server for analysis. A
server stores the received data, performs analysis over it, and creates
reports.

The following are some popular applications that collect data into
the server:
□ DataWinners is the data collection service design for experts.
This application converts paper forms into digital
questionnaires.Team members can submit their data through any
service like SMS, Web, etc. DataWinners provides an efficient
data collection facility that can reduce users' decision-making
time. The home page of the DataWinners application is shown in
Figure 8.24:
-=,.c

fc•---=
.. C e,
-=====::::':::!:::========= ==
,,. ..._,.,,..,._

HQW WOlll(S

Figure 8.24: Home Page of DataWinners


□ COMMANDmobile is an application that can perform mobile
data collection and workforce management services.
COMMANDmo bile is more accurate and efficient than
traditional paper and pen cil approaches. The application uploads
the data as soon as it is collected. Its response time is very small;
thus, it can perform well in emergencies. It gives very good ROI
(Return on Investment).
N O T E S

The home page of COMMANDmobile application is shown in


Figure 8.25:

- =llllCTffi"ORWATIONIOE'-10ijE EST -

-

P1100UC:T Oqi,.t1,il(»,, cm.,. ...>« 1110< trl'IOM? [()l,,:1-'CJUC l>An!Nftl'Otll,',1

.-.odr<!S theo,., at Mc,l,,eOat,Co "tlU'IKfV MobleO.ltlO>ll«'I/Q!'SOttw.-t ','ih(JT1«e!1WY'5nhcwtl-.::¥


<,c.rrcMxoi:t. &el»lil f....,wrg IMOtlUHcduct;.

Figure 8.25: Home Page of COMMANDmobile

Till now, you have studied various fundamental concepts related to mo


bile analytics. Now, let's do a practical activity with mobile analytics.

Get ready to analyse datasets on mobile phones using a mobile appli


cation. We have already discussed the key points for analysing data
by using mobile phones and tablets. Now, through this hands-on
prac tice, you will analyse data stored in a given dataset available on
a mo bile device. You must have an android mobile phone and an
Internet connection to do this practical work.

The objective of the activity is as follows: The marketing manager of


a company performed an analysis on the listed price of items that his/
her company needs to purchase from the company's peers. He/she
wants to present the results of the analysis to the top management
of his/her company during a meeting. He/she wants to use
graphical techniques for presenting the analysis results on a tablet
that runs on the android Operating System (OS). The marketing
manager will use a mobile application, which he/she needs to
download and install on his/her tablet for demonstrating the analysis
results. You are going to help him/her download and install the
application and create graphs using the application.

Perform the following steps to download and install the Graph Trial
app, and then create a graph to present the results of data analysis:
1. Open the Google Play Store by tapping the Play Store icon
on the screen of any android phone or tablet. A window
appears, showing the contents of the play store.
'
N O T E S

2. Type Graph Trial in the search box, and tap the Ea button to
start the search operation, as shown in Figure 8.26:

,; a .,ti ., 12 16 pm
Graph Trial
X

Graph Trial

.• , t.AOVIES 800

••. • · • • Free
I , Horoscop

Figure 8.26: Showing the Google Play Store Window

A window appears, containing the list of available apps for


the particular search item.
3. Select the first app, named Graph trial, from the window by
tapping it, as shown in Figure 8.27:

" n :.-. ,ti ., 12 17 pm

( II graph trail X
-

Apps

Graph trial

•••
Gaia GPS: Topo Maps an,

Books

ill Day and Secuon Hikes Pac,roc

Day & Section Hikes Pac,roc C

Figure 8.27: Selecting the Graph Trial App from the List

The next window appears, asking for permission to install the


app.
N O T E S

4. Tap the ACCEPT button to install the app on your device, as


shown in Figure 8.28:

App penmss1ons

Storo9e

Nf'IWOfk commun1cauon

Figure 8.28: Showing the Installation Permission Window

A new window appears with the INSTALL button.


5. Tap theINSTALL button to install the app, as shown in Figure 8.29:

new

w-===

Figure 8.29: Showing the Install App Window

The downloading and installation process of the Graph trial app


begins, as shown in Figure 8.30:

....

Figure 8.30: Showing the Downloading Window


'
N O T E S

After the installation is completed, the Graph trial icon appears


on the tablet's screen, as shown in Figure 8.31:

•' \

, li3
...,

Figure 8.31: Showing the Tablet Screen Containing


the Installed App Icon
6. Tapthe Graph trial icon to get to the home screen, as shown
in Figure 8.32:

Figure 8.32: Showing the App Home Screen

7. Tap the a button to get the graph type selection window, as


shown in Figure 8.33:

Figure 8.33: Showing the Graph Type Selection Window


N O T E S

8. Select the type of graph youwant.to create by tapping on its


icon. In our case, we have selected simple graph. The Create
simple graph window appears.
9. Select the simple type of graph from the Graph type tab. In our
case, we have selected the Bar graph, as shown in Figure 8.34:
,., ,-. j ,I • l r.•

lr,•,,lr,. i,pl, yr.111,/1

Figure 8.34: Showing the Create Simple Graph Window

10. Input. the details in the Y axis title, Min, and Max fields, as
shown in Figure 8.35:
r.i ,. ,-. -
.ti • -,'lI"

r,,..11l•1o,t•\pl1•1JIJpl1

,.,rc:o11

,
. .o.a..
.ii.
.h
.....

0 M,h 10000

Figure 8.35: Showing the Bar Graph Window

11. Scroll down the window to input details in other fields, as


shown in Figure 8.36:

"'J n: (I ., I lp111

ll.::ic-• Cre,1lf' s1mr1,. g1aph '.'i.:,

► itrii ,■:ii Bb

A.c.r"I 1S00

Son 2000

Lmo1ra 5500

- Apple 10000

HP 3000

Figure 8.36: Showing the Data Input Window


'
N O T E S

12. Tap the Save button to get the Earehart of list items icon, as
shown in Figure 8.37:

Figure 8.37: Showing the Window with the Earehart of list items Icon

13. Long tap the Earehart of list items icon to see the graph, as
shown in Figure 8.38:

r-, ,., n , ,11 " 12 30 pm

Back Barchart of list items P


Summary- 22000 0 Average I 4400 0

l ne Pi11
,,
10,000

8,000
QJ
5,500

-
.! :!
a.
6,000
- 4,000
3,000

2.000 1

□o0i \>-e,e.\'°'o i o-l


000
'l 0
o'v\e.
J \,e. i,:'I"

brand

Figure 8.38: Showing the Bar Graph for a Given Dataset


N O T E S

14. Select the Pie tab by tapping it to get a pie chart, as shown in
Figure 8.39:

0 0 T .,ii ., 12 30 pm

Back Barchart of list items I)


I Summary 22000.0 Average 4400.0

Mer,,

I•Acer• Sony Lenovo Apple• HPI

Figure 8.39: Showing a Pie Chart for the Given Dataset


15. Select the Line tab by tapping it to get a line chart, as shown
in Figure 8.40:

0 0 ;, .,II ., 12 37 pm

Back Barchart of list items I)


/Summary 220000 Average 4400 oI

Pi!'

10,000
QI
- 7,500
a.
- 5,000

2,500

'rc,e\so i o-io ,e '(\<?


e I>'
brand

Figure 8.40: Showing a Line Chart for the Given Dataset


Till now, we have created different types of simple graphs for a
sample dataset provided in our virtual lab.
'
N O T E S

Let's now create a bar graph by loading a dataset stored as a


Comma Separated Value (CSV) file in the tablet's memory.
16. Tap the Settings button to open a Settings window, as shown in
Figure 8.41:

r::1 "'
0 'f .,II " 12 23 pm
Settings

csv
• CSV import

Cj, All data export

All dala1mpor1

lt F1mt s,ze setungs

/' Plotlabesettings

,• ·• ,,,._ r
!.¥1hn91

J_._' Mobogenie Market


'-I
[iiI!j •••• ,c... fi',J Mt'VUl'W •

Figure 8.41: Showing the Settings Window

17. Tap the CSV import option to open the window listing the CSV
files in the selected location, as shown in Figure 8.42:

t"A " 0 'f .,,I " 12 24 pm

Select CSV file


/mnt/sdcard/graph/
Sales_data_vl .csv
/mnt/sdcard/graph/
brand.csv
/mnt/sdcard/graph/
sample.csv

Figure 8.42: Showing the List of CSVFile in a Memory Location


N O T E S

18. Select the particular file name to get a graph for the dataset.
19. Select the size of the graph by tapping an optionfrom the Select
image size window, as shown in Figure 8.43:

Figure 8.43: Selecting the Graph Size

20. Tap the OK button to save the graph in the form of an image,
as shown in Figure 8.44:

Saved graphimage at
/rnnVsdcard/ graph/Barchartojlist
items_1388991763854,png

Figure 8.44: Saving a Graph as an Image

After saving the graph as an image, you can share it via e-mail.
To do this, proceed to the next step.
'
N O T E S

21. Tap the Share option to open the Share graph image window,
as shown in Figure 8.45:

Figure 8.45: Showing the Share graph image Window

22. Tap the option through which you want to share the image.
In our case, we have selected Gmail, as shown in Figure
8.46:

Figure 8.46: Selecting the Particular Option


for Sharing the Graph Image
N O T E S

23. Enter the details in the required fields, as shown in Figure 8.47:

i::. A O T .,,I ., 12 33 pm

<""" Compose >


[email protected]

Figure 8.47: Entering Details

24. Tap the Share button and exit the application by pressing
the OK button on the pop-up box that appears, as shown in
Figure 8.48:

0 0 T .,11 ., 12 35 pm

Areyou surehnish this apphcatto"?

L.IIJ -

Figure 8.48: Sharing the Graph Image

We can quickly create some dashboards by using the above


procedure to create a presentation of the analysis on a mobile
phone or a tablet. The charts included in the presentation can also
be shared quickly through e-mails.
'
N O T E S

EXHIBIT

Premier Inn Generates £1m in revenues


through mobile analytics

*
Premier Inn
prcmicrinn.com

Premier Inn is a chain of hotels in UK. It is the largest hotel brand


of UK, having around 650 hotels. Premier Inn launched a mobile
app in January 2011 to make bookings online. Through this mo
bile app, Premier Inn was able to generate revenues of over £Im
in just three months of launching the app. Since then, the app has
achieved more than two million downloads. Around 77% of the
to tal bookings are made through the mobile app.

Howwas the hotel able to achieve such big revenues? Actually, the
magic behind the success of the Premier Inn mobile app was mo
bile data analytics provided by Grapple, a mobile-innovation agen
cy. Grapple collected data from its 300 branded applications of
cli ents. Branded applications are those which either offer a utility
or make the life of the customer easier when he or she is on the
move. Grapple analysed this data to enable companies, such as
Premier Inn, to better understand customer behavior and make
required changes to improve sales, customer retention, and loyalty.
Premier Inn used Grapple's analysis to improve the features and
function ality of its mobile application and increase sales
conversion rates from 3% to 5.9% and generated revenues of £Im
in a short period of three months.

[f SELF ASSESSMENT QUESTIONS

15. The mobile analytics process can be done either on a mobile


device or on the providing services to the mobile
device.
16. DataWinners is the data collection service design for experts.
(True/False)

Prepare a report on mobile app analytics.


:

N O T E S

1:j(II CHALLENGES OF MOBILE ANALYTICS


Mobile analytics has its own challenges. Some of the main ones can
be listed as follows:
□ Unavailability of uniform technology: Different mobile phones
support different technologies. For example, some mobile phones
support images, JavaScript, HTML, and cookies while others
do not.
□ Random change in subscriber identity: TMSI (Temporary Mobile
Subscriber Identity) is the identity of mobile devices and can be
known by the mobile network being used. This identity is
random ly assigned by the VLR (Visitor Location Register) to
every mobile (after switched-on) located in the area. This random
change in the subscriber ID makes it difficult to gather important
information such as the location of user, etc.
□ Redirect: Some mobile devices do not support redirects. The
term 'redirect' is used to describe the process in which the system
auto matically opens another page.
D Special characters in the URL: In some mobile devices, some
special characters in the URL are not supported.
D Interrupted connections: The mobile connection with the
tow er is not always dedicated. It can be interrupted when the
user is moving from one tower to another tower. This
interruption in the connection breaks the requests sent by the
devices.

Together with generalised issues mentioned above, mobile analysts


are also facing the following critical issues, which discourage mobile
analytics marketing:
o Limited understanding of the network operators: Network
oper ators are unable to understand the business processes
happening outside the carrier's firewall.
D True real-time analysis: True real-time data analysis is not
always possible with mobile analytics due to various reasons
such as sig nal interruption, variation in technology used in
mobiles, random change in subscriber ID, etc.
D Security issues: Mobile technology has various important
features but some of these features, such as GPS, cookies, Wi-Fi,
and bea cons can disclose important information of the user.
Information like details of credit cards, bank accounts, medical
history, or oth er personal content can be easily misused. Some
techniques like Deep Packet Inspection (DPI), Deep Packet
Capture (DPC), and application logs can increase security threats.

To cope with such security threats, business organisations must intel


ligently monitor all communications in real time and make sure that
personal data is not accessible to everyone.
• •
N O T E S

SELF ASSESSMENT QUESTIONS


17. TMSI stands for
a. Temporary Mobile Subscribed Identity
b. Temporary Mobile Subscriber Identification
c. Temporary Mobile Subscribed Identification
d. Temporary Mobile Subscriber Identity
18. The term is used to describe the process in which the
system automatically opens another page.

ACTIVITY
Determine the ways to overcome the challenges in the field of
mo bile marketing and mobile advertising.

1:j1I SUMMARY
D Social media refers to a computer-mediated, interactive, and in
ternet-based platform that allows people to create, distribute, and
share a wide range of content and information, such as text and
images.
D Social media analytics is the practice of collecting data from so
cial media, websites or blogs and analysing the data to take
crucial business decisions.
D Text mining or text analytics comes as a handy tool to
quantitative ly examine the text generated by social media and
filtered in the form of different clusters, patterns, and trends.
D Sentiment analysis involves careful analysis of people's opinions,
sentiments, attitudes, appraisals, and evaluations.
D Automated sentiment analysis is still evolving as it is difficult to
interpret the conditional phrases used by people to express their
sentiments on social media.
D 4G provides wide range access, multiservice capacity, integration
of all older mobile technologies, and low bit cost to the user.
0 Mobile analytics has several similarities with web and social ana
lytics, such as both can analyse the behavior of the user with
regard to an application and send this information to the service
provider.
D Mobile web refers to the use of mobile phones or other devices
like tablets to view online content via a light-weight browser.
□ Mobile apps are usually available through application distribution
platforms like apple app store and Google play.
:

N O T E S

D Mobile analytics tools have some technical limitations; all mobile


analytics tools do not perform all the services, so must find out
which tool can be beneficial for you.

II KEYWORDS
□ Blog: It represents an online journal to showcase the content
organised in the reverse chronological order.
□ Microblogs: The types of blogs that allow people to share and
showcase small posts and are suitable for quick sharing of con
tent in a few lines of text or an individual photo or video.
□ Wik:i: It represents a collective website in which the members
can create and modify content in a community-based database.
□ Social networks: It is a network that generally supports the ex
change of information and data in various formats, such as
text, videos, and photos.
D Text mining tools: The tools used to identify themes,
patterns, and insights hidden in the structured as well as
unstructured data.

1:fti DESCRIPTIVE QUESTIONS


1. Discuss the concept of social media analytics with suitable
example.
2. Enlist and explain the key elements of social media analytics.
3. What do you understand by text mining? Discuss the key steps
for any text mining process.
4. Explain the concept of mobile analytics with appropriate
examples.
5. Enlist the differences between Web analytics and mobile
analytics.
6. Describe the tasks of mobile analytics tools.

1:jQ ANSWERS AND HINTS


ANSWER FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


Social media analyi;ics 1. Social Bookmarking
2. Blogging, microblogging
Key elements of social 3. b. Curate
media analytics
4. Sharing
'
N O T E S

·Topic' Q."No. ·Answers


Overview of text mining
5. True
6. Information retrieval, natu-
ral learning
Performing social media ana
7. True
lytics and opinion mining on
tweets

8. Algorithms
Online social media analysis
9. False
10. Sentimentl40
Mobile analytics
11. a. 30
12. Division
Mobile analytics tools
13. Packet sniffing
14. Real-time dashboard
Performing mobile analytics
15. Server
16. True
17. cl. Temporary Mobile Sub-
scriber Identity
Challenges of mobile analytics
18. Redirect

HINTS FOR DESCRIPTIVE QUESTIONS


1. Social media refers to a computer-mediated, interactive, and
Internet-based platform that allows people to create,
distribute, and share a wide range of content and information,
such as text and images. Refer to Section 8.2Social Media
Analytics.
2. Incorporating social media into everyday sales and marketing
routines of an organisation is not easy and requires gaining
a command over certain set of tactics and tools related to the
efficient management and utilisation of social media. Refer to
Section 8.3 Key Elements of Social Media Analytics.
3. Text mining or text analytics comes as a handy tool to
quantitatively examine the text generated by social media and
filtered in the form of different clusters, patterns, and trends.
Refer to Section 8.4 Overview of Text Mining.
4. Similar to the process of analytics used to study the behavior of
users on the Webor social media, mobile analytics is the
process of analysing the behavior of mobile users. Refer to
Section
8.7 Mobile Analytics.
5. Mobile analytics has several similarities with Web and social
analytics, such as both can analyse the behavior of the user with
regard to an application and send this information to the service
provider. However, there are also several important differences
between Web analytics and mobile analytics. Refer to Section
8.7 Mobile Analytics.
:

N O T E S

6. The fundamental task of the mobile analytics tool is similar to


other digital analytical tools like Web analytics. Refer to
Sections
8.7 Mobile Analytics and 8.8 Mobile Analytics Tools.

1:iil SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Ganis, M., & Kohirkar, A.(2016). Social media analytics: techniques
and insights for extracting business value out of social media.
New York: IBM Press.
D Rowles, D. (2017). Mobile marketing: how mobile technology is
revolutionizing marketing, communications and advertising.
Lon don: Kogan Page.

E-REFERENCES
D Top 25 social media analytics tools for marketers - keyhole. (2017,
march 09). Retrieved April 28, 2017, from http://keyhole.co/blog/
list-of-the-top-25-social-media-analytics-tools/
D Social media analytics. (2017, April 13). Retrieved April 28,
2017, from https://en.wikipedia.org/wiki/Social_media_analytics
D What is social media analytics? - Definition from Whatls.com.
(n.d.). Retrieved April 28, 2017, from
http://searchbusinessanalyt ics.techtarget.com/definition/social-
media-analytics
D Mobile Analytics Key Benefits I Mobile Marketing. (n.d.). Re
trieved April 28, 2017, from https://www.webtrends.com/prod
ucts-solutions/digital-analytics/mobile-analytics-use-cases/
CONTENTS

9.1
Introduction
9.2
What is Visualisation?
9.2.1 Ways of Representing Visual Data
9.2.2 Techniques Used for Visual Data Representation
9.2.3 Types of Data Visualisation
9.2.4 Applications of Data Visualisation
Self Assessment Questions
Activity
9.3 Importance of Big Data Visualisation
9.3.1 Deriving Business Solutions
9.3.2 Turning Data into Information
Self Assessment Questions
Activity
9.4 Tools Used in Data Visualisation
9.4.1 Open-Source Data Visualisation Tools
9.4.2 Analytical Techniques Used in Big Data Visualisation
Self Assessment Questions
Activity
9.5 Summary
9.6 Descriptive Questions
9.7 Answers and Hints
9.8 Suggested Readings & References
INTRODUCTORYCASELET
N O T E S

HOW A COMPANY USED THE POWER OF DATA


VISUALISATION FOR BETTER ANALYTICS

Knowledgent, an industry information consultancy company,


helps organisations in transforming their information into busi
ness results by using innovations in data and analytics. The com
pany's expertise integrates industry experiences, capabilities of
data analysts and scientists, and data architecture and engineer
ing skills to discover areas that require actions to be taken.
One of the client companies of Knowledgent, actually a commer
cial distribution company, had grown rapidly with regular series
of achievements. However, the company was facing the problem
of using critical business information extracted from Enterprise
Resource Planning (ERP) using a variety of different data archi
tectures and source systems for data-driven decision making.
Main business stakeholders were making their decision on the
basis of manual assembled reports, which were lacking measur
able consistency, data reliability, and metric transparency.
As a result, the company realised that they require a way to
visualise key performance areas across the organisation. They
had the requirement of creating real-time dashboards with a
consistent user interface, across Sales, Finance, and
Operations.
Knowledgent provided an Enterprise Data Warehouse and data
visualisation solution to the client company, which was imple
mented by using Agile methodology. They started the project by
dividing it into three phases, where each business unit was start
ed from Sales data. At each phase of project, Knowledgent' team
conducted an assessment of the ERP system with the focus on
dimensions and measures required at each phase. They also de
manded the clients to define key performance indicators (KPI),
dashboards, and required end-user reporting capabilities. After
that, Knowledgent designed the Enterprise Data Warehouse to
support visualisations, and then implemented it. The ETL pro
cess was developed to integrate data from different sources and
normalise and harmonise it. Finallly, a commercial visualisation
tool was used to manage visualisation development in combina
tion with stakeholders.
The result of the entire project at the end was that the client com
pany of Knowledgent gained robust analytics and reporting effi
ciencies across organisation, customer, product, sales and suppli
er data. The analytic dashboards of the client company can now
provide key business drivers, trends, and issues. In fact they con
sider their analytics and data visualisation capabilities are differ
ent from their competitors.
;\TISUALISATION

N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


Describe the meaning of visualisation
Discuss the importance of Big Data visualisation
Explain the tools used in data visualisation

Ill INTRODUCTION
In the previous chapter, you have learned about prescriptive analyt
ics. It is the final phase of Business Analytics, which uses fundamen
tals of mathematical and computational sciences to provide different
decision options for taking the benefit of the results of descriptive
and predictive analytics.
Data visualisation is a pictorial or visual representation of data with
the help of visual aids such as graphs, bar, histograms, tables, pie
charts, mind maps, etc. Depending upon the complexity of data and
the aspects from which it is analysed, visuals can varyin terms of
their dimensions (one-/two-/multi-dimensional) or types, such as
temporal, hierarchical, network, etc. All these visuals are used for
presenting different types of datasets. Different types of tools are
available in the market for visualising data. But what is the use of
data visualisation in Big Data? Is it necessary to use it? To answer
these questions, we need to track down the real meaning of
visualisation in the context of Big Data analytics.
This chapter familiarises you with the concept of data visualisation
and the need to visualise data in Big Data analytics. You also learn
about different types of data visualisations. Next, you learn about
var ious types of tools using which data or information can be
presented in a visual format.

Iii WHAT IS VISUALISATION?


Visualisation is a pictorial or visual representation technique.
Any thing that is represented in pictorial or graphical form, with
the help of diagrams, charts, pictures, flowcharts, etc. is known as
visualisa tion. Data presented in the form of graphics can be
analysed better than the data presented in words.

9.2.1 WAYS OF REPRESENTING VISUAL DATA


The data is first analysed and then the result of that analysis is visu
alised in different ways as discussed above. There are two ways to
visualise a data-infographics and data visualisation:
□ Infographics are the visual representations of information or data
The use of colorful graphics in drawing charts and graphs helps
in
N O T E S

improving the interpretation of a given data. Figure 9.1 shows an


example of infographics:

Left
2[aovERNMEN

)COMMUNISM I ) LAtouMII ) 01MocutsI• jmrnsl

TA.ADE j "'"""o,n 09.u. ro Vll


!Ill'-·- _.,

SOCIETY & CULTURE


(1;fil ) ..........[@illr:;\J
-..... ·•··•·••_f.........,..

flW!US·- , ......-,.....► E;f' lllll.. CJ ·


)IIVIIIRflllS[ WIIQ lliNijflUlf' fllt•mns-•-·► 8" f.c!flSIII
,tlllJl( lat ■DlnJ ··"► S'
CIHIIP'IIIIS Qi DnrntHNN-···- --·····--·..S'

Figure 9.1: An Example of Infographics


Source: http://www.jackhagley.com/What-s-the-dilference-between-an-Infographic-and-a-Da
ta-Visualisation

D Data visualisation approach is different from lnfographics. It is


the study of representing data or information in a visual form.
With the advancement of digital technologies, the scope of
multimedia has increased manifold. Visuals in the form of
graphs, images, di agrams, or animations have completely
proliferated the media in dustry and the Internet. It is an
established fact that the human mind can comprehend
information more easily if it is presented in the form of visuals.
Instructional designers focus on abstract and model-based
scientific visualisations to make the learning content more
interesting and easy to understand. Nowadays, scientific data is
also presented through digitally constructed images. These
images are generally created with the help of computer software.
Visualisation is an excellent medium to analyse, comprehend,
and share information. Let's see why:
♦ Visual images help in transmitting a huge amount of informa
tion to the human brain at a glance.
;\TISUALISATION

N O T E S

♦ Visual images help in establishing relationships and distinc


tions between different patterns or processes easily.
♦ Visual interpretations help in exploring data from different
angles, which helping gaining insights.
♦ Visualisation helps in identifying problems and understanding
trends and outliers.
♦ Visualisations point out key or interesting breakthroughs in a
large dataset.

Data can be classified on the basis of the following three criteria irre
spective of whether it is presented as data visualisation or
infographics:
□ Method of creation: It refers to the type of content used while cre
ating any graphical representation.
□ Quantity of data displayed: It refers to the amount of data
which is represented.
□ Degree of creativity applied: It refers to the extent to which the
data is created graphically, and wheather it is designed in a color ful
way or in black and white diagrams.
On the basis of above evaluation, we can understand which is the
cor rect form of representation for a given data type. Let's discuss the
var ious content types:
□ Graph: A representation in which X and Yaxes are used to
depict the meaning of the information
□ Diagram: A two-dimensional representation of information to
show how something works
□ Timeline: A representation of important events in a sequence
with the help of self-explanatory visual material
□ Template: A layout design for presenting information
□ Checklist: A list of items for comparison and verification
□ Flowchart: A representation of instructions which shows how
something works or a step-by-step procedure to perform a task
□ Mind Map: A type of diagram which is used to visually
organise information

9.2.2 TECHNIQUES USED FOR VISUAL


DATA REPRESENTATION
Data can be presented in various visual forms, which include
simple line diagrams, bar graphs, tables, matrices, etc. Some
techniques used for a visual presentation of data are as follows:
□ Isoline: It is a 2D data representation of a curved line that moves
constantly on the surface of a graph. The plotting of an isoline is
based on data arrangement rather than data visualisation.
N O T E S

Figure 9.2 shows a set of isolines:

• 4780 .
60 83

. 543
3549

• 40509 .4013

3180
2469

Figure 9.2: Isolines


□ Isosurface: It is a 3D representation of an isoline. Isosurfaces
are created to represent points that are bounded in a volume of
space by a constant value, that is, in a domain that covers 3D
space. Fig ure 9.3 shows how isosurfaces look like:

0.4
0.2
0
-0.2
-0.4

Figure 9.3: Isosurfaces


D Direct Volume Rendering (DVR): It is a method used for
obtain ing a 2D projection for a 3D dataset. A 3D record is
projected in a 2D form through DVR for a clearer and more
transparent visuali sation.
;\TISUALISATION

N O T E S

Figure 9.4 shows a 2D DVR of a 3D image:

Figure 9.4: 2D Image DVR


□ Streamline: It is a field line that results from the velocity vector
field description of the data flow. Figure 9.5 shows a set of
stream lines:

.,,. - +- ........

....
.....--
Figure 9.5: Streamlines
□ Map: It is a visual representationoflocations within a specific
area. It is depicted on a planar surface. Figure 9.6 shows an
instance of Google Map:

Figure 9.6: Google Map


' .
N O T E
S
D Parallel Coordinate Plot: It is a visualisation technique of repre
senting multidimensional data. Figure 9.7 shows a parallel
coordi nate plot:

10
z

-1 0
2 X

Figure 9.7: Parallel Coordinate Plot


D Venn Diagram: It is used to represent logical relations between
finite collections of sets. Figure 9.8shows a Venn diagram for a
set of relations:

®®
AnB AnB

®88
AvB A-B

Figure 9.8: Venn Diagrams


D Timeline: It is used to represent a chronological display of
events. Figure 9.9 shows an example of a timeline for some
critical events:

Critical Events Timeline


Hanlesting orkey
LOl'li>-ra.nsseason aopsbegN'lS HaM>st,ngol longrains
beginsin allod,er in Iha $0Uthwe>L maize be,ngsin ttiekey growing
Long-r.M• seasonbeg,ns areas ol lhecountry EasternandCentral areasandcontinues

r----
In SOU1hwes1em Kenya outsidethe pastoralan,as P,0Vll1000 lllrougll January

Jan-05 Deo-05
Shon-<alns haNest Kidding.lambingbegins. S&asonal mlgratlon Short-ra,ns season beg,ns
ancl milkavallabilllyImproves orliveslOCI< to
dry-saason g,az,nga,eas
LOng-,&N seasonbegins
in !he paSIOrala,ea

Figure 9.9: Timeline for Some Cl'itical Events


;\TISUALISATION

N O T E S

□ Euler Diagram: It is a representation of the relationships


between sets. Figure 9.10 shows an example of an Euler diagram:

C:::,sote,sitn-niffl011s

C::> OUtr,IMl.ltk.,fcmit/4:.J

Figure 9.10: Euler Diagram


□ Hyperbolic Trees: They represent graphs that are drawn using
the hyperbolic geometry. Figure 9.11 shows a hyperbolic tree:

Figure 9.11: Hyperbolic Tree


□ Cluster Diagram: It represents a cluster, such as a cluster of
astro nomic entities. Figure 9.12 shows a cluster diagram:

Fignre 9.12: Cluster Diagram


N O T E S

D Ordinogram: It is used to analyse various sets of multivariate


objects. Figure 9.13 shows an ordinogram:

io l---- ...-

t
6
8

2.82 2.83 2,84 2,85 2,86 2,7 2,8 2.9 3,0 3,1 3,2 3,3
grad,ent gradient

Figure 9.13: Ordinogram

9.2.3 TYPES OF DATA VISUALISATION

You already know that data can be visualised in many ways, such as
in the forms of lD, 2D, or 3D structures. Table 9.1 briefly describes
the different types of data visualisation:
TABLE 9.1: DATA VISUALISATION TYPES
Name Description Tool
ID/Linear A list of items organised Generally, no tool is used
for in a predefined maimer lD visualisation
2D/Planar Choropleth, cartogram, GeoCommons, Google
Fusion dot distribution map, Tables, Google Maps
API,
and proportional sym- Polymaps, Many Eyes, Google
bol map Charts, and Tableau Public
3DNolumetric 3D computer models, AC3D, AutoQ3D,
TrueSpace surface rendering,
volume rendering, and
computer simulations
Temporal Timeline, time series, TimeFlow, Timeline JS, Excel,
Gantt chart, sanky dia- Timeplot, TimeSearcher, Goog-
gram, alluvial diagram, le Charts, Tableau Public, and
and connected scatter Google Fhsion Tables
plot
Multidimen- Pie chart, histogram, Many Eyes, Google
Charts, sional tag cloud, bubble cloud, Tableau Public, and
Google
bar chart, scatter plot, Fusion Tables
heat map, etc.
Tree/Hierar- Dendogram, radial tree, d3, Google Charts, and Net-
chical hyperbolic tree, a11d work Workbench/Sci2
wedge stack graph
Network Matrix, node link Pajek, Gephi, NodeXL,
diagram, hive plot, VOSviewer, UCINET,
GUESS, and tube map Network Workbench/Sci2,
sig-
ma.js, d3/Protovis, Many Eyes,
and Google Fusion Tables
;\TISUALISATION ..

N O T E S

As shown in Table 9.1, the simplest type of data visualisation is lD


representation and the most complex data visualisation is the
network representation. The following is a brief description of each
of these data visualisations:
□ ID (Linear) data visualisation: In the linear data visualisation,
data is presented in the form of lists. Hence, we cannot term it as
visualisation. It is rather a data organisation technique. Therefore,
no tool is required to visualise data in a linear manner.
□ 2D (Planar) data visualisation: This technique presents data in
the form of images, diagrams, or charts on a plane surface. Car
togram and dot distribution map are examples of 2D data visual
isation. Some tools used to create 2D data visualisation patterns
are GeoCommons, Google Fusion Tables, Google Maps API, Poly
maps, Tableau Public, etc.
□ 3D (Volumetric) data visualisation: In this method, data
presen tation involves exactly three dimensions to show
simulations, sur face and volume rendering, etc. Generally, it is
used in scientific studies. Today, many organisations use 3D
computer modelling and volume rendering in advertisements to
provide users a better feel of their products. To create 3D
visualisations, we use some visualisation tools that involve
AC3D, AutoQ3D, TrueSpace, etc.
□ Temporal data visualisation: Sometimes, visualisations are time
dependent. To visualise the dependence of analyses on time, the
temporal data visualisation is used, which includes Gantt chart,
time series, sanky diagram, etc. TimeFlow, Timeline JS, Excel,
Timeplot, TimeSearcher, Google Charts, Tableau Public, Google
Fusion Tables, etc. are some tools used to create temporal data
visualisation.
□ Multidimensional data visualisation: In this type of data
visuali sation, numerous dimensions are used to present data. We
have pie charts, histograms, bar charts, etc. to exemplify
multidimensional data visualisation. Many Eyes, Google Charts,
Tableau Public, etc. are some tools used to create
multidimensional data visualisation.
□ Tree/Hierarchical data visualisation: Sometimes, data relation
ships need to be shown in the form of hierarchies. To represent
such kind of relationships, we use tree or hierarchical data visual
isations. Examples of tree/hierarchical data visualisation include
hyperbolic tree, wedge-stack graph, etc. Some tools to create
hierarchical data visualisation are D3, Google Charts, and Net
work Workbench/Sci2.
□ Network data visualisation: It is used to represent data relations
that are toocomplex to be represented in the form of hierarchies.
Some examples of network data visualisation tools include
matrix, node link diagram, hive plot, Pajek, Gephi, NodeXL,
VOSviewer, UCINET, GUESS, Network Workbench/Sci2,
sigma.js, d3/Proto vis, Many Eyes, Google Fusion Tables, etc.
N O T E S

9.2.4 APPLICATIONS OF DATA VISUALISATION


Data visualisation tools and techniques are used in various applica
tions. Some of the areas in which we apply data visualisation are as
follows:
D Education: Visualisation is applied to teach a topic that requires
simulation or modelling of any object or process. Have you ever
wondered how difficult it would be to explain any organ or organ
system without any visuals? Organ system, structure of an atom,
etc. are best described with the help of diagrams or animations.
D Information: Visualisation is applied to transform abstract
data into visual forms for easy interpretation and further
exploration.
D Production: Various applications are used to create 3D models
of products for better viewing and manipulation. Real estate,
com munication, and automobile industry extensively use 3D
adver tisements to provide a better look and feel to their products.
D Science: Every field of science including fluid dynamics, astro
physics, and medicine use visual representation of information.
Isosurfaces and direct volume rendering are typically used to
explain scientific concepts.
D Systems visualisation: Systems visualisation is a relatively new
concept that integrates visual techniques to better describe com
plex systems.
□ Visual communication: Multimedia and entertainment industry
use visuals to communicate their ideas and information.
D Visual analytics: It refers to the science of analytical reasoning
supported by the interactive visual interface. The data generat ed
by social media interaction is interpreted using visual analytics
techniques.

SELF ASSESSMENT QUESTIONS


1. Which of the following visual aids is/are used for
representing data?
a. Graphs b. Bar
c. Histograms d. All of these
2. The use of colorful graphics in drawing charts and graphs
helps in improving the interpretation of a given data. (True/
False)
3. Scientific data is also presented through constructed
images.
4. Visual images do not help in transmitting huge amount of
information to the human brain at a glance. (True/False)

N O T E S

5. Which of the following types of diagrams refers to a


representation of instructions that shows how something
works or a step-by-step procedure to perform a task?
a. Graph
bD.ia g r a m
c. Flowchart
d. Mind Map
6. DVR stands for
------

7. diagram is used to represent logical relations


between finite collections of sets.
8. Ordinogram is used to analyse various sets of multivariate
objects. (True/False)

ACTIVITY
Search and enlist the symbols used in a flowchart. Also, create a


flowchart which represents a sequence of instructions for resolving
a problem using its symbols.

IMPORTANCE OF BIG DATA


VISUALISATION
Visual analysis of data is not a new thing. For years, statisticians and
analysts have been using visualisation tools and techniques to inter
pret and present the outcomes of their analyses.

Almost every organisation today is struggling to tackle the huge


amount of data pouring in every day. Data visualisation is a great
way to reduce the turn-around time consumed in interpreting Big
Data. Traditional visualisation techniques are not efficient enough to
capture or interpret the information that Big Data possesses. For ex
ample, such techniques are not able to interpret videos, audios, and
complex sentences. Apart from the type of data, the volume and
speed with which data is generated pose a great challenge. Most of
the tra ditional analytics techniques are unable to cater to any of these
prob le1ns.

Big Data comprises both structured as well as unstructured forms of


data collected from various sources. Because of the heterogeneity of
data sources, data streaming, and real-time data, it becomes difficult
to handle Big Data by using traditional tools. Traditional tools are
de veloped by using relational models that work best on static
interaction. Big Data is highly dynamic in function and therefore,
most traditional tools are not able to generate quality results. The
response time of tra ditional tools is quite high, making them unfit for
quality interaction.
N O T E S

9.3.1 DERIVING BUSINESS SOLUTIONS


The most common notation used for Big Data is 3Vs-volume,
veloci ty, and variety. But, the most exciting feature is the way in
which val ue is filtered from the haystack of data. Big Data
generated through social media sites is a valuable source of
information to understand consumer sentiments and demographics.
Almost every company now adays is working with Big Data and
facing the following challenges:
□ Most data is in unstructured form
□ Data is not analysed in real time
□ The amount of data generated is huge
□ There is a lack of efficient tools and techniques
Considering all these factors, IT companies are focusing more on re
search and development of robust algorithms, software, and tools
to analyse the data that is scattered in the Internet space. Tools
such as Hadoop provide state-of-the-art technology to store and
process Big Data. Analytical tools are now able to produce
interpretations on smartphones and tablets. It is possible because of
the advanced visual analytics that is enabling business owners and
researchers to explore data for finding out trends and patterns.

9.3.2 TURNING DATA INTO INFORMATION

The most exciting part of any analytical study is to find useful infor
mation from a plethora of data. Visualisation facilitates identification
of patterns in the form of graphs or charts, which in turn helps to de
rive useful information. Data reduction and abstraction are generally
followed during data mining to get valuable information.
Visual data mining also works on the same principle as simple data
mining; however, it involves the integration of information visualisa
tion and human-computer interaction. Visualisation of data produces
cluttered images that are filtered with the help of clutter-reduction
techniques. Uniform sampling and dimension reduction are two
com monly used clutter-reduction techniques.
Visual data reduction process involves automated data analysis to
measure density, outliers, and their differences. These measures are
then used as quality metrics to evaluate data-reduction activity.
Visual quality metrics can be categorised as:
□ Size metrics (e.g. number of data points)
□ Visual effectiveness metrics (e.g. data density, collisions)
□ Feature preservation metrics (e.g. discovering and preserving
data density differences)
;\TISUALISATION

N O T E S

In general, we can conclude that a visual analytics tool should be:


□ Simple enough so that even non-technical users can operate it
□ Interactive to connect with different sources of data
□ Competent to create appropriate visuals for interpretations
□ Able to interpret Big Data and share information
Apart from representing data, a visualisation tool must be able to
es tablish links between different data values, restore the missing
data, and polish data for further analysis.

[f SELF ASSESSMENT QUESTIONS


9. Analytical tools are now able to produce interpretations
on smartphones and tablets. (True/False)
10. Big Data generated through sites is a valuable
source of information to understand consumer sentiments
and demographics.
11. Which of the following is/are the challenges with Big Data?
a. Most data is in unstructured form.
b. Data is not analysed in real time.
c. The amount of data generated is huge.
d. All of these
12. and are two commonly used clutter-
reduction techniques.
13. Data reduction and are generally followed during
data mining to get valuable information.
14. Which of the following is/are a visual quality metric?
a. Size metric
b. Visual effectiveness metric
c. Feature preservation metric
d. All of these

ACTIVITY
Prepare a report on Big Data visualisation tools that are widely
used by the organisations nowadays.
N O T E S

Ill TOOLS USED IN DATA VISUALISATION


Some useful visualisation tools are listed as follows:
D Excel: It is a new tool that is used for data analysis. It helps you
to track and visualise data for deriving better insights. This tool
pro vides various ways to share data and analytical conclusions
within and across organisations. Figure 9.14 shows an example
of Excel sheet:

l3Ji 8 0,• c'• • Bi,dSt,;te,.>dsx•Ex«I ? r!l - □ X

I A B
·i""=■;";,
HOM NSERT PAGEIAVOUT FORMUlAS DATA REVIEW V1EW Ramankumar •

jc.nb,i
a I l! •
•!11
I A. A. ;;;: :,;;: :,;;: !;a •
! Geo«• I
$ • '11, •
·I C-•i<>NIFonnotting•
ljjiFo.mot"1T>blc·
00115 Editing
EB· ¢,·.A. • € € - 'oi.'& 1ilc.115tyt<,•
Clipboard r. Fcmt r. Ali9n.rnt-nt r.. Numbtr r.i

[ Al • [, [ X ../ Jx [1 Almaft:Type v[
= --:: ,;;; .::::A:: --'t:-:"'----:-:--:-----------:::---------------,c:=c-:-:r-o:::-:-r:-:-:::-::E;-;--:iG
r1-+-::-':;'-'1a"":.-:T ype ---- ;::_ =AR::::tc::: c:;Ecc:iv-,N-Tl_A_R_-PT :=--------+==..:.:::=c.,:.::'-'-'-'="'-'lLJ

Airplane UNKNOWN
f-A"-irpccl•cc.ncoe VER INnAIRPOR=T- _
Airplane CHICAGOO'HAREINTlARPT Unk.nown 8-727-200 1
:---1,oHN F-KENNEOv INTl u_nkno;.,nlUNKNOW----1
7 AhJ?'lane -+u_N_k_NO_W N .._<_1000_ft_C-_55Q_-+------1
8 Al lane UNKNOWN <1000ft f!.n1-200 1
9 Airplane CINCINNATI MUNIARPT-WNKENFIELD <lOOOft CITATION

10 rplane _..,., _,_MIAMl'-IN_Tl.-"-----------lc->=1000 h'+D A- 2000'--'-'---------=J


11 Air p lan e SAN fRANCLSCO INTLARPT < 1000ft S.73?•.500

.
12 Airplane ---;i=ISA:;..;l_c;cT -=IA-=KE _C...;;ITY;;..;;clN=Tl...;.;;c---;;;;;--=-=-; =..;..;;'--"==±<=1=000=ftc±e.=...;73=7•=300 2t=o=l0=:1 8

' @ II]

Figure 9.14: Excel Sheet


D Last.Forward: It is open-source software provided by last.fm for
analysing and visualising social music network. Figure 9.15
shows an example of a Last.Forward visual:

f.: n. ,r !��
.., a

Figure 9.15: Last.Forward


;\TISUALISATION

N O T E S

□ Digg.com: Digg.com provides some of the best Web-based visual


isation tools.
□ Pies: This tool is used to track the activity of images on a website.
□ Arc: It is used to display the topics and stories in a spherical
form. Here, a sphere is used to display stories and topics, and
bunches of stories are aligned at the outer circumference of
sphere. Figure
9.16 shows Digg Arc:

Figure 9.16: Digg Arc


Larger stories have more diggs, as shown in Figure 9.16. The arc
becomes thicker with the number of times users dig the story.
□ Google Charts API: This tool allows a user to create dynamic
charts to be embedded in a Web page. A chart obtained from the
data and formatting parameters supplied in a HyperText Trans fer
Protocol (HTTP) request is converted into a Portable Net work
Graphics (PNG) image by Google to simplify the embedding
process. Figure 9.17 shows some charts created by using Google
Charts API:

Column Chart Area Chart Candlestick Chart

b_ldJi iji, o

Timeline Bubble Chart Donut Chart

1:- ..J IRQ

§(§§)
�us 0
Figure 9.17: Charts Obtained from Google Charts API
N O T E S

D TwittEarth: This tool is capable of showing live tweets from all


over the world on a 3D globe. It is an effort to improve social media
visualisation and provide a global image mapping in tweets. Fig
ure 9.18 shows an example of a TwittEarth visual:

Figure 9.18: TwittEarth


Source: http://cybergyaan.com/2010/01/10-supercool-ways-to-visualise-internet.html

D Tag Galaxy: Tag Galaxy provides a stunning way of finding a


col lection of Flickr images. It is an unusual site which provides
search tool which makes the online combing process a
memorable visual experience. If you want to search a picture, you
have to enter a tag of your choice and it will find the picture. The
central (core) star contains all the images directly relating to the
initial tag and the revolving planets consist of similar or
corresponding tags. Click on a planet and additional sub-
categories will appear. Click on the central star and Flickr images
gather and land on a gigantic 3D sphere. Figure 9.19 shows a
visual created by Tag Galaxy:

Figure 9.19: Tag Galaxy


Source: Taggalaxy.de
;\TISUALISATION

N O T E S

□ D3: D3 enables youto attach random data with a Document


Object Model (DOM) and then applies data-driven
transformationson the document. For example, you can utilise D3
for creating an HTML table from a sequence of numbers. Or,
you can use the same data to develop an interactive SVGbar
chart having smooth transitions and interactions. Figure 9.20
shows some complex visuals created throughD3:

Figure 9.20: Some Visuals Obtained from D3


Source: http://d3js.org/

□ Rootzniap Mapping the Internet: It is a tool to generate a series


of maps on the basis of the datasets provided by the National
Aero nautics and Space Administration (NASA). Figure 9.21
shows an example of the Internet mapping through Rootzmap:

Figure 9.21: Internet Mapping


Source: http://www.sysctl.org/rootzmap/e-map.jpg
N O T E S

9.4.1 OPEN-SOURCE DATA VISUALISATION TOOLS


We already know that BigData analytics requires the implementation
of advanced tools and technologies. Due to economic and infrastruc
tural limitations, every organisation cannot purchase all the applica
tions required for analysing data. Therefore, to fulfill their require
ment of advanced tools and technologies, organisations often turn to
open-source libraries. These libraries can be defined as pools of
freely available applications and analytical tools. Some examples of
open source tools available for data visualisation are VTK, CaveSD,
ELKI, Tulip, Gephi, IBM OpenDX, Tableau Public, and VisSD.

Open-source tools are easy to use, consistent, and reusable. They de


liver high-quality performance and are compliant with the Web as
well as mobile Web security. In addition, they provide multichannel
analytics for modelling as well as customised business solutions that
can be altered with changing business demands.

9.4.2 ANALYTICAL TECHNIQUES USED IN BIG


DATA VISUALISATION
Analytical techniques are used to analyse complex relationships
among variables. The following are some commonly used analytical
techniques for Big Data solutions:
D Regression analysis: It is a statistical tool used for prediction.
Regression analysis is used to predict continuous dependent vari
ables from independent variables. In this, we try to find the effect
of one variable on other variable. For example, sales increase
when prices decrease. Types of regression analysis are as
follows:
Ordinary least squares regression: It is used when
dependent variable is continuous and there exists some
relationship be- tween the dependent variable and
independent variable.
♦ Logistic regression: It is used when dependent variable has
only two potential results.
♦ Hierarchical linear modeling: It is used when data is in
nested form.
♦ Duration models: It is used to measure length of process.
D Grouping methods: The technique of categorising observation
into significant or purposeful blocks is called grouping. The
recog nition of features to create a distinction between groups is
called discriminant analysis.
D Multiple equation models: It is used to analyse causal
pathways from independent variables to dependent variables.
Types of mul tiple equation models are as follows:
♦ Path analysis
♦ Structural equation modelling
;\TISUALISATION

N O T E S

SELF ASSESSMENT QUESTIONS


15. isanopen-source software provided by last.fro for
analysing and visualising social music network.
16. Google Charts API tool allows a user to create dynamic
charts to be embedded in a Web page. (True/False)
17. Tag Gala'{y provides a stunning way of finding a collection
of images.

ACTIVITY
Collect information about the pivot table used in Excel for repre
senting data.

IIJsuMMARY
□ Visualisation is a pictorial or visual representation technique.
a Anything which is represented in pictorial or graphical form, with
the help of diagrams, charts, pictures, flowcharts, etc. is known as
visualisation.
□ Data presented in the form of graphics can be analysed better
than the data presented in words.
a Infographics are the visual representation of information or data.
a Data visualisation approach is different from Infographics. It is
the study of representing data or information in a visual form.
□ Data can be presented in various visual forms, which include sim
ple line diagrams, bar graphs, tables, matrices, etc.
a Multimedia and entertainment industry use visuals to
communi cate their ideas and information.
a The data generated by social media interaction is interpreted
using visual analytics techniques.
a Apart from the type of data, the volume and speed with which
data is generated pose a great challenge.
a Because of heterogeneity of data sources, data streaming, and
real-time data, it becomes difficult to handle Big Data by using
traditional tools.
□ Visual data reduction process involves automated data analysis to
measure density, outliers, and their differences.
N O T E S

II KEYWORDS
□ Graph: It is a representation in which X and Y axes are used
to depict the meaning of the information.
D Diagram: It is a two-dimensional representation of
information to show how something works.
□ Timeline: It is a representation of important events in a se
quence with the help of self-explanatory visual material.
□ Flowchart: It is a representation of instructions which
shows how something works or a step-by-step procedure to
perform a task.
□ Isosurfaces: These are designed to represent points that are
bound by a constant value in a volume of space.

Ill DESCRIPTIVE QUESTIONS


l. What do you understand by data visualisation? List the different
ways of data visualisation.
2. Describe the different techniques used for visual data
representation.
3. Discuss the types and applications of data visualisation.
4. Describe the importance of Big Data visualisation.
5. Elucidate the transformation process of data into information.
6. Enlist and explain the tools used in data visualisation.
7. Describe the analytical techniques used in data visualisation.

Iii ANSWERS AND HINTS

ANSWERS FOR SELF ASSESSMENT QUESTIONS

2. True
3. digitally
4. False
5. C. Flowchart
6. Direct Volume Rendering
7. Venn

N O T E S

Topic Q. No. Answers


8. True
Importance of Big Data
Visualisation 9. True

10. social media

11. d. All of these


12. Uniform sampling; dimension
reduction
13. abstraction
14. d. Allof these
Tools Used in Data
Visual isation 15. Last.Forward

16. True
17. Flickr

HINTS FOR DESCRIPTIVE QUESTIONS


1. Visualisation is a pictorial or visual representation technique.
Anything which is represented in pictorial or graphical form,
with the help of diagrams, charts, pictures, flowcharts, etc. is
known as visualisation. Refer to Section 9.2What is
Visualisation?
2. Data can be presented in various visual forms, which include
simple line diagrams, bar graphs, tables, matrices, etc. Refer to
Section 9.2What is Visualisation?
3. Data can be visualised in many ways, such as in the forms of
lD, 2D, or 3D structures. Refer to Section 9.2What is
Visualisation?
4. Visual analysis of data is not a new thing. For years,
statisticians and analysts have been using visualisation tools and
techniques to interpret and present the outcomes of their
analyses. Refer to Section 9.3 Importance of Big Data
Visualisation.
5. The most exciting part of any analytical study is to find useful
information from a plethora of data. Refer to Section 9.3
Importance of Big Data Visualisation.
6. Excel is a new tool that is used for data analysis. It helps you
to track and visualise data for deriving better insights. Refer to
Section 9.4Tools used in Data Visualisation.
7. Analytical techniques are used to analyse complex relationships
among variables. Refer to Section 9.4 Tools used in Data
Visualisation.
:

N O T E S

Ii:• SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Kirk, A.(2016). Data visualisation: a handbook for data driven de
sign. Los Angeles: Sage Publications.
D Evergreen, S. (2017). Effective data visualization: the right chart
for the right data. Los Angeles: Sage.
D Kirk, A. (2012). Data visualization:a successful design process.
S.l.: Packt Publ.

E-REFERENCES
D Data visualization. (2017, April 26). Retrieved May 02, 2017, from
https://en.wikipedia.org/wiki/Data_visualization
D Suda, B., & Hampton-Smith, S. (2017, February 07). The 38 best
tools for data visualization. Retrieved May 02, 2017, from http://
www.creativebloq.com/design-tools/data-visualization-712402
D 50 Great Examples of Data Visualization. (2009, June 01). Re
trieved May 02, 2017, from https://www.webdesignerdepot.
com/2009/06/50-great-examples-of-data-visualization/
CONTENTS

10.1 Introduction
10.2 Financial and Fraud Analytics
Self Assessment Questions
Activity
10.3 HR Analytics
Self Assessment Questions
Activity
10.4 Marketing Analytics
Self Assessment Questions
Activity
10.5 Healthcare Analytics
Self Assessment Questions
Activity
10.6 Supply Chain Analytics
Self Assessment Questions
Activity
10.7 Web Analytics
Self Assessment Questions
Activity
10.8 Sports Analytics
Self Assessment Questions
Activity
10.9 Analytics for Government and NGO's
Self Assessment Questions
Activity
10.10 Summary
10.11 Descriptive Questions
10.12 Answers and Hints
10.13 Suggested Readings & References
:

N O T E S INTRODUCTORY CASELET

MIAMI BASEBALL TEAM USED SPORTS ANALYTICS


TO PERFORM BETTER

The Miami Red Hawks is a National Collegiate Athletic Associ


ation (NCAA) division I American baseball team which belongs
to Miami University in Oxford, Ohio. Dan Hayden is the present
coach of Red Hawks. This team is also a member of Mid-
Ameri can conference east division. The first baseball team of Miami
was played in 1915.
The Red Hawks were trying to get competitive advantage over
their competitors with the help of baseball statistics and
analytics. This need is felt because of the frequent changes in
this sport to make it more interesting and competitive. The data
analytics is widely used and its use is very popular at the
professional level in comparison to amateur sports. Miami
baseball wanted to be updated and competitive in this sport with
the use of analytics. The team wanted to use analytics for
analysing pitching which includes type, speed and location of
pitch.
After searching various available tools in the market for sports
analytics, the team has decided to use Vizion360 impact analyt
ics. This analytics involve the use of visualisation tool,
Microsoft spower Bl, at the front end. This tool provides deep
insight of the collected data. It provides detailed summary of the
performance of the team and the player at the individual level.
The summa ry also includes statistics related to pitching and
batting. After studying this data, the team worked on improving
their perfor mances in intrinsic situations of the game.
The tool also helped in improving the productivity of the
players and helped coaches in refraining from making lame
assumptions about the performance of players by providing deep
statistics about the earlier performance of players which was not
possible by using the simple statistics. This tool also helped in
doing cus tom visualisation instead of using spreadsheets for
cut/paste re porting. The results obtained after applying sports
analytics us ing Vizion360 are as follows:
D The team became capable of analysing pitch's type, location
and speed.
D The team can access the statistics, as and when required, for
analysing the current situation.
D Before the implementation of Vizion360, the team coaches
used to select players on their gut feeling. Now, they can
check performance data before selecting a player.
D Vizion360 helped coaches in better decision making against a
particular team.
D Coaches can save and analyse the data even on their mobile
devices.
I •• •

N O T E S

@J LEARNING OBJECTIVES

After studying this chapter, you will be able to:


- Describe the concept of financial and fraud analytics
- Explain the importance of HR analytics
- Discuss marketing analytics
- Define healthcare analytics
- State the significance of supply chain analytics
- Describe the functions of Webanalytics
- Explain the functions of sports analytics
- Discuss how analytics is used by the government and NGOs

■ uj■INTRODUCTION
Business analytics has emerged as a growth driver for most new
era organisations. Gone are those occasions when managers used to
settle on choices on the premise of their own guts or use large-
scale finan cial indicators and their imaginable effect on individual
organisations. Choices made without data and information have
turned out to be unfortunate for many associations. With the
advent of data innova tion and increased data handling ability of
PCs, supervisors are utilis ing numerous metbods to anticipate the
fate of business and enhance gainfulness of the venture. The
application of descriptive and predic tive analytics, client relationship
management tools and different pro cess improvement devices
brings the benefit to the organisation. The entire business world is
taking a look at huge information as an open door and source of a
competitive advantage.

Business analytics has expanded consistently over the previous de


cade as confirmed by the constantly developing business analytics
software market. It is targeting more organisations and reaches out
to more number of users, from administrators and line of business
supervisors to examiners and other information specialists, inside or
ganisations.

This chapter first discusses financial and fraud analytics. Next, the
chapter explains HR analytics, marketing analytics and healthcare
analytics. The chapter also explains supply chain analytics and Web
analytics. Towards the end, the chapter discusses sports analytics and
how analytics is used by the government and NGOs for providing
var ious beneficial services to people.
N O T E S

■ ufj FINANCIAL AND FRAUD ANALYTICS


Fraud impacts organisations in several ways which might be related
to financial, operational or psychological processes. While the money
related misfortune owing to fraud is huge, the full effect of fraud on
an organisation can be more shocking. As fraud can be executed by
any worker inside an organisation or by an external source, it is
essential for an organisation to have successful fraud management or
a fraud analytics program to defend its reputation against fraud and
prevent financial loss. Many organisations such as Simility,
MATLAB, Actim ize, etc. provide fraud detection software or suite
to detect fraud at an early stage and take appropriate measures to
prevent it. Numerous organisations stay helpless against extortion
and money related crime since they are not exploiting new abilities to
battle today's dangers. These abilities depend intensely on huge
information and analytic in novations that are currently accessible.

With these advancements, organisations can oversee and examine


terabytes of recorded and outsider information. The capacity to break
down enormous information volumes empowers organisations to
make exact and precise models for perceiving and forestalling future
fraud.

By utilising the most recent advancements in robust analytics,


organi sations can unhesitatingly ensure themselves and their clients
regard ing privacy and security of data while doing business with
them or offering them various services which require their personal
data to be utilised.

Advanced analytics can also be connected to all keyfraud


information to foresee whether an activity is possibly fraudulent
before losses hap pen. Taking a look at just little arrangements of
security information, for example, occasion logs, decreases a bank's
capacity to anticipate or identify sophisticated crime. The more
volume and sorts of infor mation an organisation can analyse at a
greater velocity, the more the organisation can guard against interior
and outer dangers.

Intelligent investigation of suspicious movement requires performing


and managing requests that are bolstered by careful investigation and
data availability. With these tools, organisations can rapidly confirm
fraud and then the further activities such as prosecution and recuper
ation can be taken.

HARNESS EXISTING HISTORICAL INFORMATION

Organisations can use the already recorded information and analyse


it to detect and prevent frauds in future. This information also helps
in detecting the past and future impressions of the fraud. The
recorded information related to fraud can help organisations to
prevent huge losses of money and data related to it or clients.
I •• •

N O T E S

Data management software empowers auditors and fraud analysts


to break down an organisation's business information to gain
knowledge into how well internal controls are working and
distinguish transac tions that appear to be fraudulent. Generally,
data analysis can be done at places in an organisation where
electronic transactions are recorded and stored.

There is no doubt in saying that data analysis gives a powerful ap


proach to become more proactive in the battle against fraud. The
companies also use whistleblower hotlines which help individuals for
reporting speculated fake conduct or unsafe conduct and violations of
its law and policy. However, using hotlines alone are insufficient.
Why be just receptive and wait for a whistleblower to come forward
at the last approach? Why not search out indicators of fraud in the
informa tion? Tosuccessfully test for fraud, every important
transaction must be analysed over all pertinent business frameworks
and applications. Breaking down business exchanges at the source
level provide audi tors with better knowledge and a more entire view
with regards to the probability of fraud happening. Analysis involves
the investigation of those activities that are suspicious and help
control weaknesses that could be misused by fraudsters.

SELF ASSESSMENT QUESTIONS

1. Companies also use hotlines to help individuals


report speculated fake conduct or unsafe conduct and
violations of its law and policy.
2. c a na l s o be connected to all key fraud
information to foresee whether an activity is possibly
fraudulent before losses happen.
3. It is essential for an organisation to have successful fraud
management or a fraud analytics program to defend its
reputation against fraud. (True/False)

ACTIVITY

Collect information from a nearby local bank related to the impact


of fraud in the financial system and all the measures taken by the
banking institution to reduce fraud. Prepare a report on this topic.

■ ujj HR ANALYTICS
Human Resource (HR) analytics, additionally called talent analytics,
is the use of complex information mining and business analytics
(BA) strategies to get HR information. HR analytics is a zone in the
field of analysis that alludes to applying analytic processes to the
human resource department of a company in the expectation of
enhancing
:

N O T E S

worker execution and along with improving in the degree of profit


ability. Organisations generally move to HR analytics and data led
solutions when there exists problems that cannot be resolved with
current management practices.

HR analytics does not simply manage the gathering of information


on employee performance and efficiency; instead, they also provide
deeper details of each process by accumulating data and then use it
for making important decisions about improving these processes.

HR analytics establishes a relationship between business data and in


dividual's data, which further help in building important connections
between them. The main aspect of HR analytics is to show people
the impact of HR department on the whole organisation. HR
analytics also help in building a cause-and-effect relationship
between the tasks of HR and business outcomes and then making
strategies on the basis of that information.

The core functionalities of HR can be improved by applying various


processes in analytics which include acquisition, optimisation,
paying and creating the employees workforce for the organisation.
HR ana lytics can also help in digging problems and challenges using
analyti cal workflow and guide managers in answering questions. It
also help managers in gaining deeper details from information at
hand, then make important decisions and take proper actions.

The field of HR analytics can be further divided into the following


segments:
□ Capability analytics: It is a talent management process that en-
ables you to identify capabilities or core competencies that you
require in your business. It helps in identifying the capabilities of
your workforce which includes their skill, level and expertise.
□ Competency acquisition analytics: It refers to the process of as
sessment how well or otherwise your business can attain the re
quired competencies. Acquiring and managing the talent is very
critical for the growth of business.
□ Capacity analytics: It helps in identifying how many
operationally efficient people are in business. For example, it
identifies whether people are spending time in profitable work or
not.
□ Employee churn analytics: Hiring employees and training them
involve time and money. Employee churn analytics refers to the
process of estimating the staff turnover rates for predicting the
future and reducing employee churn.
□ Corporate culture analytics: It refers to the process in which as
sessment and understanding about the corporate culture or the
different cultures that are followed across an organisation is done.
: IN PRACTICE

N O T E S

□ Recruitment channel analytics: It refers to the process of


finding out the source of getting or recruiting best employees and
most efficient recruitment channels.
□ Employee performance analytics: Every organisation requires
employees that are capable and perform well to survive and
thrive. Employee performance analytics is used in assessing the
perfor mance of an individual employee. The resulting
information can be used to determine which employee is
performing efficiently and which employee may require some
extra support or training for improving its performance.

f;f SELF ASSESSMENT QUESTIONS

4. HR analytics is also known as ---


analytics.
5. HR analytics help managers in gaining deeper details from
information at hand, then make important decisions and take
proper actions. (True/False)
6. analytics helps in identifying how many are
operationally efficient people are in business.

ACTIVITY

Visit an organisation and meet its HR executives to know how HR


analytics help them to motivate their employees and reduce em
ployee turnover in the last five years.

■ llj■MARKETING ANALYTICS
Every organisation strives to gain an edge over its competitors. This
can be possible if an organisation develops an effective industry lev
el strategy. For this, an organisation needs to analyse various forces,
such as level of competition in the market, entry of new
organisations, availability of substitute products, etc. For this
purpose, marketing an alytics are used by organisations.

Marketing analytics is the act of measuring, overseeing and


examining advertising execution to expand its effectiveness and
enhance quanti fiable profit (ROI). Marketing analytics helps in
providing deeper in sight into customer preferences and trends.
Despite various benefits, a majority of organisations failed to realise
the benefits of marketing analytics. With the advancement of search
engines, paid search mar keting, search engine optimisation (SEO)
and efficient new software solutions, marketing analytics has
become more effective and easier to get implemented than ever.
:

N O T E S

You need to follow the below three steps to get the benefits from
mar keting analytics:
l. Practice a balanced collection of analytic methods
In order to get the best benefits from marketing analytics, you
need an analytic evaluation that is balanced - that is, one that
merges methods for:
♦ Covering the past: Utilising marketing analytics to
research on the past. You can answer a few queries such as
which cam paign component was used to make most income
from last quarter?
♦ Exploring the present: Marketing analytics enables you to
decide how your marketing activities are acting at this mo
ment by asking questions such as: How are clients doing?
Which channels do clients use to gain maximum benefits?
What is the reaction of different networking media
personnel on the company's image?
♦ Predicting influencing what's to come: Marketing
analytics can be used to deliver data driven expectations to
change the future by putting few inquiries such as: How
would we be able to transform here and nowwininto
dedication and con tinuous engagement? In what
capacity, we should include more sales representatives to
meet expectations? Which ur ban communities would be a
good idea for us to focus next by utilising our present
situation?
Evaluate yow· analytical capabilities and fill in the gaps
Marketing organisations have an access to a lot of analytic
abilities for supporting different marketing goals. Estimating
your present analytic capabilities is necessary to attain these
goals. It is significant to know about your present situation
along with an analytic spectrum, so that you can determine gaps
and take steps to create a strategy for filling those gaps.
Consider an example in which a marketing organisation is
already gathering data from sources like the Internet and POS
transactions, but is not providing importance to the unstructured
information coming from social media platforms. Such
unstructured sources are very useful, and the technology for
transforming unstructured data into actual insights is available
today that can be used by marketers. A marketing organisation
can planand allocate budget for adding these analytic
capabilities that can be used to fill that particular gap.
3. Take action as per analytical findings
The information collected after performing marketing analytics
is not useful until you try to act on that information. In the
continuous process of testing and learning, marketing analytics
: IN
PRACTICE

N O T E S

allows you to enhance the performance of your marketing


program as a whole by performing the following tasks:
♦ Determining deficiencies in the channel
♦ Doing adjustment in strategies and tactics as and when re
quired
♦ Optimising processes
Due to a lack of capability to test and evaluate the performance of
your marketing programs, you would not be able to know what had
worked and what had not. Moreover, you would not be able to know
whether things needed to be changed or in what manner. In other
words, if you are using marketing analytics for evaluating success
and doing noth ing with the details gained, then what is the point of
using analytics? Marketing analytics enables better, more successful
marketing for your efforts and investments. It can lead to better
management which helps in generating more revenue and greater

[;f SELF ASSESSMENT QUESTIONS


7. SEO stands for
a. Search engine optimisation
b. Searching engine optimisation
c. Search engine operation
d. None of these
8. Marketing analytics enables you to decide howyour
marketing activities are acting at this moment. (True/False)
9. The information collected after performing marketing
analytics remain useful whether you act or not on that
information. (True/False)
profitability.
ACTIVITY
Prepare a report on the total sales and revenue generated by a
store at your nearby location by using marketing analytics.

■ uj.j HEALTHCARE ANALYTICS


Healthcare analytics is a term used to describe the analysis of health
care activities using the data generated and collected from
different areas in healthcare such as pharmaceutical data, research
and de velopment (R&D) data, clinical data, patient behavior and
sentiment data, etc. In addition to this, data also gets generated
from patients buying behavior in stores, claims made by patients,
preference of pa tients in selecting activities and more. The
analytics applied on this
N O T E S

data to get insight of data for providing healthcare services in a better


way.

Organisations in the field of healthcare are quickly rece1vmg data


frameworks to enhance both business operations and clinical care.
Many classes of data frameworks develop in the human services
area, extending from electronic medical records (EMRs), specialty
care management, to supply chain system.

Healthcare organisations are also implementing approaches, for


example, lean and six Sigma to take a more patient-driven concen
tration, lessen errors and waste and increase the number of flow of
patients with the objective of enhancing quality. The healthcare an
alytics industry is a growing industry and it is estimated that it will
cross $18.7 billion by 2020 alone in the United States (US). The
indus try also emphasises on various areas such as financial analysis,
clin ical analysis, fraud analysis, supply chain analysis and HR
analysis. Basically, healthcare analytics is based on the verification
of patterns in healthcare data for determining how clinical care can
be enhanced while minimising the excessive cost.

In addition to reveal data about present and past organisational


per formance, analytical tools are also used to study large
information al collections by using statistical analysis procedures
to uncover and comprehend recorded information with an eye to
foresee and enhance operational execution later on.

Healthcare analytics is used as a measurable instrument in getting


deeper details of medicinal services related information keeping in
mind the end goal to determine past performances (i.e., operational
execution or clinical results) to enhance the quality and proficiency
of clinical and business procedures and its execution in future.

As the volume and accessibility of healthcare information keeps on


increasing, healthcare organisations progressively need to depend on
analytics as a key competency to comprehend and enhance their op
erations.

IMPLEMENTING REAL-TIME HEALTHCARE DATA ANALYTICS

Healthcare data is not easily available in a unified and informative


way and therefore, restricting the industry's endeavour to enhance
quality and effectiveness in healthcare. Real-time analytics tools are
used in healthcare for addressing these issues by bringing data from
various sources at a single location with the purpose of presenting it
in a unified manner so that fruitful information can be derived from
it.

Moreover, the data picked up from breaking down gigantic measures


of collected health information can give noteworthy knowledge to
en hance operational quality and effectiveness for providers,
insurers and others. The healthcare industry is quickly transitioning
from vol-
I •• •

N O T E S

ume-to value based healthcare. Presently like never before, the ana
lytics is crucial for clinicians and health service providers so that
they can distinguish and address gaps in care, quality and hazards
and use it to bolster changes in clinical and quality results and
financial per formance.

Real-time analytics is capable of continuous reporting that illustrates


the status of the patients and how to enhance the current quality of
the services. It can also give instant and exact knowledge into
patients' therapeutic history including past clinical conditions,
analyses, med icines, usage and results irrespective of their
geographical location.

[f10. SELF ASSESSMENT QUESTIONS


EMR stands for
a. Electrical Medical Records
b. Electronic Medical Records
c. Electronic Mediclaim Records
d. None of these
11. Healthcare analytics is based on the verification of patterns
in healthcare data for determining how clinical care can be
enhanced while minimising excessive cost. (True/False)
12. analytics is capable of continuous reporting that
illustrates where a patient stands and how to enhance the
quality of services.

ACTIVITY
Study how healthcare analytics has helped in improving the care
delivered in your nearby hospital.

■ uij SUPPLY CHAIN ANALYTICS


Supply chain is an arrangement of organisations, individuals, activi
ties, data and assets required to move an item or service from suppli
er to client. Generally, a supply chain comprises suppliers, manufac
tures, wholesaleres, retailers and customers.

Intense competition and compulsion to reduce cost have impelled or


ganisations to maintain an effective supply chain network. Therefore,
organisations came up with various tools and techniques of
effectively managing a supply chain. Globalisation gave a major
push to supply chain management. Organisations that operate in a
highly competi tive global environment needs to have a highly
effective supply chain management system in place. For example,
Apple faces huge demand
:

N O T E S

for their products as soon as the products are announced in the mar
ket. Most Apple products are manufactured in China; therefore,
Apple needs to have a highly efficient supply chain to ship items
from China to different countries in the world.

It can be clearly concluded from the above discussion that supply


chain is a dynamic process in which various parties, such as suppliers
and distributors, are involved in delivering products and services
to fulfil customer requirements. Thus, in the absence of a supply
chain, there would be disruptions in the flow of products and
information.

It can be said that a supply chain plays an important role in an


organi sation. Thus, it is of utmost importance for an organisation to
manage activities involved in a supply chain. The activities in a
supply chain converts the raw material into a final product which
further can be delivered to the customer.

Almost every economy is getting globalised today, and the


companies are competing to increase their presence in the global
market. The operations performed by global manufacturing and
logistic teams are getting more intrinsic and challenging. Delay in
shipments, ineffec tive planning and inconsistent supplies can lead to
an increase in the supply chain cost of the company. Some issues
faced by supply chain organisations are as follows:
D Visibility of global supply chain and various processes in logistics

D Management of demand volatility


0 Fluctuations of cost in a supply chain

To overcome such challenges in the supply chain, the supply chain


analytics is used by most of organisations or supply chain executives.
Organisations are planning to increase their investment in analytics
to perform better in the market in comparison to their competitors.
With improvement in the supply chain analytics in the past few
years, it helps in making decisions for critical tactical and strategic
supply chain activities. The details gained from these activities help
supply chain organisations in reducing their excessive cost and
optimising supply chains.

Various solutions are provided by the supply chain analytics to


supply chain organisations are as follows:
□ Use of smarter logistics: The use of smarter logistics helps
supply chain organisations in providing more visibility in the
global mar ket. With the growth of businesses, opportunities are
developed wordwide to lure customers by satisfying its needs
related to the product irrespective of the geographical location.
As customers are present wordwide, a complex Web of supply
chain has been created that must be monitored closely to remain
competitive in business.
: IN
PRACTICE

N O T E S

The use of advanced analytics-driven 'control metrics' allows the


monitoring of real-time critical events and key performance indi
cators (KPis) with the help of various touch points. The
intergation of these metrics with predictive analytics provide a
high amount of savings in areas such as freight optimisation.
Organisations that are making investments in supply chain
visibility can take deci sions to increase supply chain
responsiveness, optimise cost and minimise customer impact.
□ Managing customer's demand through inventory management:
Due to globalisation and variations in products to fulfill the re
quirements of globally available customers, demand volatility has
enhanced to a significant level. Industries in various sectors
such as retail, consumer goods, automotive need daily or real
time pre diction to perform better in the market. Advanced
supply chain analytics can be implemented to these sectors or
related industries more precisely to forecast demand and describe
and monitor pol icies related to supply and replenishment. It is
also used for plan ning inventory flow of goods and services.
□ Reducing cost byoptimising sourcing and logistic activities: The
cost involved in supply chain is a major portion of company's
over all cost. The supply chain costs significantly impact various
finan cial metrics such as the cost of goods sold, working capital
and cash flow. There is a constant requirement to improve
organisa tions' financial performance which can manage huge
amounts of inventories. The main areas where costs can be
handled by using analytics-driven intelligence include materials,
logistics and sourc ing. Analytical tools help in providing better
visibility to the actual total component cost of products. It is
necessary to make decision regarding the buying and selling of
products. With the availabili ty of complete information at the
fingertips, organisations can de cline the material purchases
through improved practices in supply chain and better price
negotiation. The fluctuation in patterns of demand of customers
and an increased base of suppliers and logis tics partners make
organisations redesign their logistics network planning.
Companies can make strong ROI improvement by using
analytics-driven planning tasks which include route optimisation,
load planning, fleet sizing and freight cost settlement. With the
growth of business, suppliers also increase, the companies can
use these suppliers against each other by applying analytics to get
the lowest price from them. Supply chain managers can use
sophis tacted analytics programs which can provide them real-
time sup plier performance management data to improve their
strategies.
:

N O T E S

SELF ASSESSMENT QUESTIONS


13. Intense competition and the compulsion to reduce cost have
impelled organisations to maintain an effective supply chain
network. (True/False)
14. supply chain analytics can be implemented to these
sectors or related industries more precisely to forecast
demand and describe and monitor policies related to
supply and replenishment.

ACTIVITY
Prepare a report on how the use of business analytics tools in sup
plychain has helped in improving the production of the manufac
turing industry.

■ uf1WEB ANALYTICS
Web analytics refers to measuring, collecting, analysing and
reporting of Web data to understand and optimise the usage of Web.
However, Web analytics is not only restricted to measurement of
Web traffic but can also be utilised as a method of performing
research in business and market.

Web analytics also help companies in measuring the outcomes of tra


ditional print or broadcast advertising campaigns, in estimating how
traffic to a website alters after launching of a new campaign of
adver tising, in providing accurate figures of visitors on a website
and page views, and in gauging Web traffic and popularity patterns
which are useful in market research. The four basic steps of Web
analytics are as follows:
D Collection of information: This stage involves gathering of basic
or elementary data. This data involves counting of things.
D Processing of data into information: The purpose of this stage
is to process the collected data and derive information from it.
D Developing KPI: This stage focuses on using the derived
informa tion with business methodologies, referred to as Key
Performance Indicators (KPI).
D Formulating online strategy: This stage emphasises on setting
online goals, objectives and standards for the organisation or
busi ness. It also lays emphasis on making and saving money and
in creasing marketshare.

There are two categories of Web analytics: off-site Web analytics


and on-site Web analytics. Off-site Web analytics allows Web
measurement and analysis irrespective of whether you own or
maintain a website. It
: IN
PRACTICE

N O T E S

includes the measurement of a website's potential audience, visibility


and comments that are going on the Internet. On the other hand, On
site Web analytics is used to measure the behavior of a visitor who
had once visited the website. The On-site Web analytics is used to
measure the effectiveness and performance of your website in a
commercial context.

This data generated is further compared against KPis for perfor


mance and is used for improvement of a website. Google Analyt ics
and Adobe Analytics are popular on-site Web analytics services.
Heat maps and session replays are some examples of new on-site
Web analytics tools.

There are mainly two methods of gathering the data technically. The
first method lays emphasis on server log file analysis in which the
log files are read and used by the Web server for recording file
requests sent by browsers. The second method, known as page
tagging, uses JavaScript embedded in the Web page for tracking it.
Both the meth ods can gather data which can be processed for
generating reports of Web traffic. This second method provides
more accurate result as compared to the first method.

Web analytics is helpful to any business for deciding the division of


market, determining target market, analysing market trends and de
ciding the conduct of site visitors. It is additionally helpful to
compre hend visitor's advantages and priorities. Some important uses
of Web analytics for business growth are as follows:
□ Measure Web traffic: Web analytics can track the number of users
visiting the site and identify the source from where they are com
ing. It also focuses on the keywords that the visitors utilise to
query items on the website. It also demonstrates the quantity of
visitors on the Web page by means of the diverse sources like
Web search tools, through messages, online networking and
promotions.
□ Estimate visitors count: Visitors on the Internet refer to the
quan tity of unique individuals that went to the site. Frequent or
large number of visits from visitors shows the activity the site is
getting. The Web analytics tool helps in deciding how frequently
a visitor came back to asite andwhich pages of a site were given
more pref erence by visitors. It additionally tells various traits
about a visitor such as its nation, language, etc. Web analytics
also provide report about the time that wasspent bya particular
visitor on the website or total time by visitors as a whole. Such
reports help to enhance pages and reduce their bounce rate (or
lowengagement). It addi tionally demonstrates high engagement
time of pages and tells in which item or service visitor may be
interested.
□ Track bounce rate: A bounce describes a situation in which a
vis itor visits a page on the site and leaves that page without
making any move or clicking on any links on that page. Ahigh
bounce rate
:

N O T E S

could mean visitors were unable to find what they were searching
for in the site.
□ Identify exit pages: An exit is the point at which a visitor visits
various pages on site and then leaves that site. A few pages on a
site may have a high leave rate, similar to the thank youpage on
an online e-commerce website after purchasing is done
successfully. A high exit rate on a particular page demonstrates
that the page has some issue and should be investigated quickly.
Examination of such pages should be done to determine whether
visitors are not getting the intended information for which they
have visited the website. Web analytics tools help in finding such
pages quickly and rectifying the problems with those pages.
□ Identify target market: It is essential for advertisers to understand
their visitors and deliver information according to their require
ments. The discoveries of analytics services uncover the present
market requests which generally change with a geographic area.
By utilising Web analytics, marketers can track the volume and
geographical information of visitors and can offer things
according to the interest of visitors.

SELF ASSESSMENT QUESTIONS


15. analytics helps in gauging Web traffic and
popularity patterns which are useful in market research.
16. There are two categories of Web analytics which are ---
Web analytics and Web analytics.

ACTMTY
Visit a Web hosting company and try to learn howWeb analytics can
help the company to monitor the activity on the hosted websites of
the server.

■ 11j:1SPORTS ANALYTICS
Sports analytics is a technique of analysing relevant and historical in
formation in the field of sports mainly to perform better than other
team or individual. The information gathered in sports is analysed by
coaches, players and other staff members for decision making both
during and prior to sporting events. With rapid advancement in the
technology in the past fewyears, data collection has become more
pre cise and relatively easier than earlier. The advancement in the
collec tion of data has also contributed in the growth of sports
analytics as it totally relies on the collected pool of data. The growth
in analytics has further led to building of technologies such as
fitness trackers,
I •• •

N O T E S

game simulators, etc. Fitness trackers are smart devices that provide
data about the fitness of players on the basis of which coaches can
take a decision of including particular players in the team or not. The
game simulators help in practicing the game before the actual sport
ing event takes place.

The sport analytics not only modifies the way of playing a game but
also changes the way of recording the performance of players. The
National Basketball of America (NBA) teams are now using the
player tracking technology which can evaluate the efficiency of a
team by an alysing the movement of its players. As per the
information provided by the SportVu software website, the teams in
NBAhave installed six cameras for tracking the movements of each
player on the court and the basketball at the rate of 25 times per
second. The data collected using cameras provide significant amount
of innovative statistics on the basis ofspeed, player separation and
ball possession. For example, how fast a player moved, how much
distance he had covered during the game, how many times he had
passed the ball and much more. On the basis of the data collected,
strategies are created to win the game or to improve the performance
in the game.

Sports analytics has also found its application in the field of sports
gambling. The availability of more accurate information about teams
and players on the websites leads to sport gambling to new levels.
The analytics information helps gamblers in better decision
making and attaining accuracy in predicting outcomes of games or
performance of a particular player. In addition to websites or Web
pages, a number of companies also help in providing minute details
of players or teams to gamblers to fulfill their betting requirements.
Sports gambling con tributed 13%of global gambling industry valued
somewhere between
$700-$1000 billion. Some of the popular websites which provide bet
ting services to users are bet365, bwin, Paddy Power, betfair, and
Uni bet.

f;f SELF ASSESSMENT QUESTIONS

17. Fitness trackers are devices that provide data about


the fitness of players.
18. Sports analytics does not contribute in the field of sports
gambling. (True/False)

ACTIVITY
Discuss with your friends how analytics can be used in the field of
sports to enhance the energy of players while protecting them from
injuries.
:

II
N O T E S

ANALYTICS FOR GOVERNMENT AND


NGOs
Data analytics is also playing its role in the government sector. Not
only it is important for government, it is also equally beneficial for
non-governmental organisations. Data analytics is used by these or
ganisations to get deeper details of data. These details are used by the
organisations for modernising their services, progress and determin
ing the solutions faster.

Big data analytics is used in almost every part of the world for
deriving useful information from huge sets of data. Not only private
organisa tions and industries are employing data analytics but also
many gov ernment enterprises are adopting data analytics for taking
smart de cisions for the benefit of its citizens. Lot of data gets
generated in the government sector and processing and analysing this
data helps the government in improving its policies and services for
citizens. Some benefits of data analytics in government sector are as
follows:
D With the rise of national threats and criminal activities these days,
it is important for any government to ensure safety and security
of its citizens. With the help of data analytics, intelligence
organisa tions can detect crime prone areas and be prepared to
prevent or stop any kind of criminal activity.
D The analytics also help in detecting the possibility of the cyber at
tacks and identifying criminals. It also helps in detecting their
pat terns of attacks. The government can therefore, takes
appropriate action in advance to prevent people from any kind of
financial loss.
D Government can use analytics to track and monitor health of its
citizens. It can also be used for tracking disease patterns. The
gov ernment can launch proper healthcare facilities in advance in
the areas prone to diseases. It also helps in arranging and
managing free medicines and vaccinations, etc in order to save
life of people.
D Real time analysis and sensors help government departments in
water management in the city. The officials can detect the issues
in the flow of water, pollution level in water, predict scarcity of wa
ter on the basis of usage, detect areas of leakage, etc.
Government departments can take proper action to avoid these
issues to ensure supply of clean water in city.
D Government organisations also use analytics to detect tax frauds
and predict the revenue. Government can take necessary steps to
prevent tax frauds and increase the revenue.
D Government can also use the analytics in the field of agriculture
to know the appropriate time for cultivation of crops, fertilisers
required for crops, etc. Moreover, the government can also take
prior actions to prevent damage of crops in case of various envi
ronmental challenges.
I •• •

N O T E S

You can say that data analytics is helping government in building


smart cities having the capability of fast detection and rectification of
problems. For example, in India, the government led by Prime Min
ister Narendra Modi has been encouraging people to adopt Digital
India initiative. This will lead to ease in collection and quicker avail
ability of data for analytics to detect flaws in money transactions and
prevent people from becoming the victim of fake currency.

Data analytics also helps NGOs in improving their services to needy


or poor people. Mainly, NGOs help people in several ways such as by
providing free education, books, medicines, clothes, etc. NGOs use
data analytics to become more efficient while raising and allocation
of funds, predicting trends and planning campaigns, identifying pro
spective donors and encouraging donors who have made contribu
tions earlier, etc. Consider the case of a non-profit organisation, Ak
shaya Patra foundation, which supplies food in government schools
in Bangalore. The foundation was finding it difficult to supply food to
government schools due to high cost involved with it. Therefore, they
looked for a cost-effective solution to deliver food in schools without
any interruption.

According to Chanchalapathi Dasa, Vice-chairman of Akshaya Pa


tra, the foundation use 34 routes for delivering food to government
schools in Bangalore and expenditure on each route is Rs.60,000 per
month approximately. Therefore, the organisation has decided to use
the data analytics to find a cost effective solution to this problem.

While analysing various parameters required in food delivery such as


the number of vehicles utilised, the time and fuel used on each
route, they analysed that Rs.3 lakh can be saved by reducing the
number of routes by five.

Besides Akashya Patra, several other large NGOs such as Bill and
Melinda Gates Foundation India, Save the Children India, and Child
Rights and You (CRY) are also utilising data to raise their efficiency
in getting and allocating funds, predicting trends and planning cam
paigns.

These NGOs often face difficulties with data collection because they
use traditional ways of data collection. In order to overcome these
challenges, NGOs have allotted mobile phones equipped with apps
so that real time collection and recording of data can take place. The
data recorded in this manner would be accurate and will give more
precise information on the basis of which further decisions or action
plans can be made.
:

N O T E S

SELF ASSESSMENT QUESTIONS


19. NGO stands for
a. Non-governmental organisation
b. Non-governer organisation
c. Non-governing organisation
d. None of these
20. Analytics is helpful for government in building smart cities.
(True/False)

ACTIVITY
Visit to a nearby NGO and try to know how analytics has helped
them in improving their services and focus more on the overall
de velopment of the people or area.

■ 11#111SUMMARY
□ Business analytics has expanded consistently over the previous
decade as confirmed by the constantly developing business ana
lytics software market.
□ Fraud impacts organisations in several ways which might be
relat ed to financial, operational and psychological processes.
□ Numerous organisations stay helpless against extortion and mon
ey related crime since they aren't exploiting new abilities to battle
today's dangers.
□ Organisations generally move to HR analytics and data led solu
tions when there exists problems that cannot be resolved with the
current management practices.
D Marketing analytics helps in providing deeper insights of custom
er preferences and trends. Despite various benefits, a majority of
organisations failed to realise the benefits of marketing analytics.
O Healthcare organisations are also implementing approaches, for
example lean and Six Sigma to take a more patient-driven
concen tration, lessen errors and waste, and increase the number
of flow of patients with the objective of enhancing quality.
O Organisations that operate in a highly competitive global environ
ment needs to have a highly effective supply chain management
system in place.
: IN PRACTICE

N O T E S

□ The use of smarter logistics helps supply chain organisations in


providing more visibility in the global market.
□ Web analytics can provide accurate figures of visitors on a
website and page views. It helps in gauging Web traffic and
popularity pat terns.

II KEYWORDS
□ Capacity analytics: It helps in tracking the number of people
who are operationally efficient and currently in business.
□ Employee churn analytics: It refers to the process of estimating
your staff turnover rates for predicting the future and reducing
employee churn.
□ Employee performance analytics: It is used in assessing the
performance of an individual employee.
□ Fraud Analytics: It is used to detect whether a financial
activity is fraudulent or not to prevent any kind of financial
loss.
□ Marketing analytics: It helps in providing deep insight of cus
tomer preferences and trends.

■ ui•,DESCRIPTIVE QUESTIONS
1. Discuss the importance of financial and fraud analytics for an
organisation.
2. Describe the role of HR analytics in an organisation.
3. What do you understand by marketing analytics? Discuss the
steps in getting the best assistance from marketing analytics.
4. How healthcare analytics is useful in the medical field? Explain
with suitable examples.
5. Why analytics is required in supply chain? Discuss with
suitable reasons.
6. What is Web analytics? Enlist the steps involved in the Web
analytics process.
7. Describe the importance of analytics in the field of sports.
8. Discuss the need for analytics for government and NGOs.
N O T E S

■ lljfj ANSWERS AND HINTS

ANSWERS FOR SELF ASSESSMENT QUESTIONS

Financial and Fraud Ana


lytics

2. Advanced analytics
3. True
HR Analytics 4. Talent
;:,, True
6. Capacity
Marketing Analytics 7. a. Search Engine Optimisation
8. True
9. False
Healthcare Analytics 10. b. Electronic Medical Records
11. True
12. Real-time
Supply Chain Analytics 13. True
14. Advanced

Web Analytics 15. Web


16. Off-site, On-site
17. Smart
Sports Analytics
18. False
19. a. Non-governmental organisa-
Analytics for Government tion
andNGOs
20. True

HINTS FORDESCRIPTIVE QUESTIONS


1. Fraud impacts organisations in several ways which might be
related to financial, operational and psychological processes.
Refer to Section 10.2 Financial and Fraud Analytics.
2. HR analytics, additionally called talent analytics, is the use
of complex information mining and business analytics (BA)
strategies to HRinformation. Refer to Section 10.3 HR
Analytics.
3. Marketing analytics is an act of measuring, overseeing and
examining advertising execution to expand its effectiveness
and enhance quantifiable profit (ROI). Refer to Section
10.4 Marketing Analytics.
: IN
PRACTICE

N O T E S

4. Healthcare analytics is a term used to describe the analysis of


healthcare activities using the data generated and collected
from different areas in healthcare such as pharmaceutical
data, research and development (R&D) data, clinical data,
patient behavior and sentiment data, etc. Refer to Section
10.5 Healthcare Analytics.
5. Supply chain is an arrangement of organisations, individuals,
activities, data and assets required in moving an item or service
from supplier to the client. Refer to Section 10.6 Supply Chain
Analytics.
6. Web analytics refers to measuring, collecting, analysing and
reporting of Web data to understand and optimise the usage of
Web. Refer to Section 10.7 Web Analytics.
7. Sports analytics is the technique of analysing relevant and
historical information in the field of sports mainly to perform
better than any other team or individual. Refer to Section
10.8 Sports Analytics.
8. Dataanalytics is alsoplaying its role in the government sector. Not
only it is important for government, it is also equally beneficial
for non-governmental organisations as well. NGOs are also
often called non-profit organisations. Data analytics is used by
these organisations to get deeper details of data. Refer to
Section
10.9 Analytics for Government and NGOs.

■ u111SUGGESTED READINGS & REFERENCES


SUGGESTED READINGS
D Yang, H., & Lee, E. K. (2016). Healthcare analytics: from data to
knowledge to healthcare improvement. Hoboken, NJ: John Wiley
& Sons, Inc.
D Marketing Analytics: Data-Driven Techniques with Microsoft Ex
cel. (n.d.). Retrieved May 03, 2017, from http://www.wiley.com/Wi
leyCDA/WileyTitle/productCd-111837343X.html
□ Predictive HR Analytics: Mastering the HR Metric ... (n.d.). Re
trieved May 3, 2017, from https://www.bing.com/cr?IG=C49CE-
3B25C1F41949D724B 7780C3209E&CID= lAB 71989BA3B6F60
126F13FEBBAB6E4A&rd=l&h=2HPbUDM10ZucQiOWelv6s
kYOxNOGBwStlLSOwlZulhs&v=1&r=https%3a%2f%2ffanyv88.com%3a443%2fhttps%2fwww.
amazon.com%2fPredictive-HR-Analytics-Mastering-Metric%2fd p
%2f0749473916&p=DevEx,5080.l

E-REFERENCES
D Data analysis techniques for fraud detection. (2017, April 26).
Re trieved May 03, 2017, from
https://en.wikipedia.org/wiki/Data_
analysis_techniques_for_fraud_detection
... :

N O T E S

D HR Analytics. (2017, March 17). Retrieved May 03, 2017, from


https://www-01.ibm.com/software/analytics/solutions/operation al-
analytics/hr-analytics/
D Health care analytics. (2017, March 26). Retrieved May 03, 2017,
from https://en.wikipedia.org/wiki/Health_care_analytics
D What is marketing analytics? (n.d.). Retrieved May 03, 2017,
from https://www.sas.com/en us/insights/marketing/marketing-
analyt ics.html
CONTENTS

Case Study 1
HowCISCO IT uses Big Data Platform to Transform Data
Management
Case Study 2 USDA used Data Mining to know the Patterns of Loan Defaulters
Case Study 3 Cincinnati Zoo used Business Analytics for Improving Performance
Case Study 4 Application of Business Analytics in Resource Management
Case Study 5 Role of Descriptive Analytics in the Healthcare Sector
Case Study 6 An Application of Predictive Analytics in Underwriting
Case Study 7 Unicredit Bank Applies Prescriptive Analytics for Risk Management
Case Study 8 Campaign Success of Mediacom
Case Study 9 Dundas BI Solution Helped Medidata and its Clients in Getting
Better Data Visualisation
Case Study Sports Analytics Helped in the Enrichment of Performance of
10 Players
Fraud Analytics Solution Helped in Saving the Wealth of Companies
Case Study
11 Big Data Analytics Allowing Users to Visualise the future ofFree
Online Classifieds
Case Study
12
N O T E S CASE STUDY I

HOW CISCO IT USES BIG DATA PLATFORM TO TRANSFORM


DATA MANAGEMENT

This Case Stud:y shows how Had.oop Architecture based. on Cisco


UCS Common Platform Architecture (CPA) for Big Data is used.
for business insight. It is with respect to Chapter 1 of the book.

BACKGROUND
Cisco is one of the world's leading networking organisations
that has transformed the way how people connect, communicate
and collaborate. Cisco IT has 38 global data centres that totally
comprise 334,000 square feet space.

CHALLENGE
The company had to manage large datasets of information about
customers, products and network activities, which actually
comprise the company's business intelligence. In addition,
there was a large quantity of unstructured data, approximately
in terabytes in the form of Web logs, videos, emails, documents
and images. To handle such a huge amount of data, the company
decided to adopt Hadoop, which is an open-source software
framework to support distributed storage and processing of big
datasets.
According to Piyush Bhargava, a distinguished engineer at
Cisco IT, who handles big data programs, "Had.oop behaves
like an affordable supercomputing platform." He also says, "It
moves compute to where the data is stored., which mitigates the
disk I/0 bottleneck and. provides almost linear scalability. Had.oop
would. enable us to consolidate the islands of data scattered.
throughout the enterprise."
Toimplement the Hadoop platform for providingbig data analytics
services to Cisco business teams, firstly Cisco IT was required
to design and implement an enterprise platform that could
support appropriate service level agreements (SLAs) for
availability and performance. Piyush Bhargava says, "Our
challenge was adapting the open source Hadoop platform for the
enterprise."
The technical requirements of the company for implementing
the big data architecture were to:
D have open source components at place to establish the archi
tecture.
□ know the hidden business value oflarge datasets, whether the
data is structured or unstructured
• • •• DATA: MANAGEMENT-

CASE STUDY I
N O T E S

D provide service-level agreements (SLAs) to internal custom


ers, who want to use big data analytics services
D support multiple internal users on the same platform

SOLUTION
Cisco IT developed a Hadoop platform using Cisco® UCS Common
Platform Architecture (CPA) for Big Data.
According to Jag Kahlon, a Cisco IT architect, "Cisco UCS
CPAfor Big Data provides the capabilities we need to·use big data
anct.lytics for business advantage, including high-performance,
scalability, and ease of management."
For computation, the building block of the Cisco IT Hadoop
Platform is the Cisco UCS C240 M3 Rack Servers, which are
powered by Intel Xeon ES-2600 series processors, 256 GB of RAM,
and 24 TB of local storage.
Virendra Singh, a Cisco IT architect, says, "Cisco UCS C-
Series Servers provide high performance access to Local storage, the
biggest factor in Hadoop performance."
The present architecture contains four racks of servers, where
each rack is having 16 server nodes providing 384 TB of raw
storage per rack. Kahlon says,"This configuration can scale to
160 servers in a single management domain supporting 3.8
petabytes of raw storage capacity."
Cisco IT server administrators are able to manage all
elements of Cisco UCS including servers, storage access,
networking and virtualisation from a single Cisco UCS Manager
interface. Kahlon declares,"Cisco UCS Manager significantly
simplifies management of our Hadoop plaiform. UCS Manager will
help us manage larger clusters as our plaiform grows without
increasing staffing."
Cisco IT uses MapR Distribution for Apache Hadoop, and
code written in advanced C+ + rather than Java. Virendra Singh
says, "Hadoop complements rather than replacing Cisco IT's
traditional data-processing tools, such as Oracle and Teradata. Its
unique value is to process unstructured data and very large data
sets far more quickly and at far less cost."
Hadoop Distributed File System (HDFS) manages the storage on
all Cisco UCS C240 M3 servers in the cluster form to create one
large logical unit. Then, HDFS system splits the data into
smaller chunks for further processing and performing ETL
(Extract, Transform and Load) operations.
Hari Shankar, a Cisco IT architect, says, "Processing can continue
even if a node fails because Hadoop makes multiple copies of every
N O T E S CASESTUDYl

data el,ement, distributing them across several servers in the


cluster. Even if a node fails, there is no data loss." Hadoop can
detect node failure automatically and create another parallel
copy of the data, without distributing any process across the
remaining servers. ln addition, the total volume of data is not
increased, as Hadoop also compresses the data.
Tohandle the task like job scheduling and orchestration process,
Cisco IT uses Cisco TES (Cisco Tidal Enterprise Scheduler),
which works as an alternative of Oozie. Cisco TES connects
Hadoop components automatically and eliminates the need for
writing the Sqoop code manually to download data and move it
to HDFS and then execute commands to load data to Hive.
Singh says, "Using Cisco TES for job-scheduling saves hours
on each job compared to Oozie because reducing the number of
programming steps means less time neededfor debugging." Another
benefit of using Cisco TES is that it operates on mobile devices,
so that the end-users of the company can manage big data jobs
from anywhere.

RESULTS
The main result of transforming the business using Big Data by
Cisco IT is that the company has introduced multiple big data
analytics programs, which are based on the Cisco® UCS
Common Platform Architecture (CPA) for Big Data.
The revenues of the company from partner sales have been
increased. The company has started the Cisco Partner Annuity
Initiative program, which is in production. Piyush says, ''With
our Hadoop architecture, analysis of partner sales opportunities
completes in approximately one-tenth the ti-me it did on our
traditional data analysis architecture, and at one-tenth the cost."
The productivity of the company has been increased by making
intellectual capital easier to find. Earlier, many employees who
work as knowledge workers in Cisco used to take a lot of time to
search for the content on websites throughout the day as most
of the content was not tagged with relevant keywords. But, now,
Cisco IT has replaced the static and manual tagging process with
dynamic tagging on the basis of user feedback. This process uses
machine-learning techniques to examine usage patterns adopted
by users and also acts on user suggestions given for searching by
new tags.
Moreover, the Hadoop platform analyses log data of
collaboration tools, such as Cisco Unified Communications,
email, Cisco TelePresence®, Cisco WebEx®, Cisco WebEx
Social, and Cisco
• • DATA: MANAGEMENT-

CASE STUDY I
N O T E S

Jabberr" to reveal commonly used communication methods and


organisational dynamics.

LESSON LEARNED
Cisco IT has come up with the following observations shared
with other organisations:
D Hive is good for structured data processing, but provides lim
ited SQLsupport.
D Sqoop easily moves a large amount of data to Hadoop.
D Network File System (NFS) saves time and effort to manage
a large amount of data.
D Cisco TES simplifies the job-scheduling and orchestration
process.
D A library of user-defined functions (UDFs) provided by Hive
and Pig increases developer productivity.
D Knowledge of internal users is enhanced as they can now
analyse unstructured data of email, webpages, documents,
etc., besides data stored in databases.

1. \:IM4iiM4f
1. What were the challenges faced by Cisco?
(Hint: Open source components, service-level
agreements (SLAs) to the internal customers, etc.)
2. What are the lessons learned by Cisco?
(Hint: Hive is good for structured data processing, Cisco
TES simplifies job-scheduling and orchestration process,
Network File System (NFS) saves time and effort to
L manage a large amount of data, etc.) _J
N O T E S CASESTUDY2

USDA USED DATA MINING TO KNOW THE


PATTERNS OF LOAN DEFAULTERS

This Case Study discusses how a US-ba.sed rural welfare


department used the data mining technique for providing loans to
people for their welfare and development. It is with respect to
Chapter 2 of the book.
USDA Rural Development is presided byan Under-Secretary, who
is appointed directly by the US President and confirmed by the
Senate of the United States. The role of the Under-Secretary is to
provide executive direction and policy leadership so as to ensure
improved economic opportunities for the rural communities
of America. The department has a loan portfolio of more than
$216 billion for providing economic opportunities to the rural
communities of the nation.
The rural housing service of USDA runs various programmes
to create and improve housing and other important community
facilities in rural areas. USDA also provides loans, permissions
and loan guarantees for housing of single- and multi-family, fire
- and police stations, child care centres, hospitals, nursing homes,
libraries, schools, etc. The main aim of USDA and its partners
working together is to make sure that rural America should be
a better place to live, work and raise a family.
The USDA's Rural Housing Service has administered a loan
program that provides mortgage loans to people residing in
rural areas. To manage these nearly 6,00,000 loans, the
department has maintained detailed information about each
loan in its data warehouse. Like earlier lending programs,
although some USDA loans got a better response than others, it
was difficult for the department to track the exact status of
those loans.
The USDA decided to adopt a data mining technique for a
better understanding of loans, improvement in handling its
lending program and reducing the occurrence of problem
loans. Using the data mining technique, the department
wanted to determine patterns that could differentiate
borrowers who repay their loan punctually from those who do
not. Determining such type of patterns could forecast the
creditworthiness of the borrower.
Commercial lenders in the US also use the data mining technique
for predicting loan default or poor-repayment behaviours at
the time of providing loans to people. But, the main interest of
USDA is somewhat different from commercial lenders as it is
more interested in determining the problems for loans which
were already granted.

CASESTUDY2
N O T E S

Segregating problem loansallows the USDAto givemoreattention


and assistance to such type of loan borrowers, thereby
reducing the possibility that their loans will become problems.

1&1114iiMii
1. What were the motives behind setting up the USDA
Rural Development?
(Hint: Welfare of rural areas of America, etc.)
2. How could the data mining technique help the USDA?
(Hint: To determine problems in the already granted
loans, etc.)
..

N O T E S CASESTUDY3

CINCINNATI ZOO USED BUSINESS ANALYTICS


FOR IMPROVING PERFORMANCE

This Case Study discusses how business analytics has made an


impressi.on on midsized companies by improving their business
pe1formance in real time. It is with respect to Chapter 3.

BACKGROUND
Opened in 1875, Cincinnati Zoo & Botanical Garden is a world
famous zoo that is located in Cincinnati, Ohio, US. It has more
than 1.3million visitors every year.

CHALLENGE
In late 2007, the management of the zoo had begun a strategic
planning process to increase the number of visitors by enhancing
their experience with an aim to generate more revenues. For
this, the management decided to increase the sales of food
items and retail outlets in the zoo by improving their
marketing and promotional strategies.
According to Jolm Lucas, the Director of Operations at
Cincinnati Zoo& Botanical Garden, "A.lmost itnmediately, we
realised we had a story being told to us in the form of internal and
customer data,but we didn't have a lens through which to view it
in a way that would allow us to make meaningful changes."
Lucas and his team members were interested in finding business
analytics solutions to meet the zoo's needs. He said, ' t the start,
we had never heard the terms 'business intelligence' or 'business
analytics'; it was just an abstract idea. We more or less stumbled
onto it."
They looked for various providers, but did not include IBMinitially
in the false assumption that they could not afford IBM. Then,
somebody guided them that it wascompletely free to talk to IBM.
Then, they found that IBMnot only had suggested a solution that
could fit in their budget, but it was the most appropriate solution
for what they were looking for.

SOLUTION
IBM has provided a business analytics solution to the zoo's
executive committee, which provides a facility of analysing data
related to the membership of customers, their admission and
food, etc. in order to gain a better understanding of visitors'
behaviour. This solution also provides a facility of analysing the
geographic and demographic information that could help in
customer segmentation and marketing.
The zoo's executive committee wanted a platform, which would
be capable of delivering the desired goals by combining and

CASESTUDY3
N O T E S

analysing data related to ticketing and point-of-sale systems,


memberships and geographical facts. The entire project was
handled by senior executives of the zoo and consultants of IBM
and BrightStar Partners, an IBMBusiness Premier Partner.
Lucas said, "We already had a project vision, butthe consultants
on IBM's pre-sales technology teamhelped us identify other
opportunity areas.,, During the project implementation,
BrightStar became the zoo's main point of contact, and then a
platform was built on IBM Cognos 8.4 in late 2010, which
further upgraded to the Cognos 10 in early 2011.

OUTPUT
The result of implementing the IBM's business analytics solution
is that the zoo's return of investment (ROI) has increased. Lucas
admits, "Over the 10 years we'd been?"Unning that promotion,
we lost just under $1 million in revenue because we had no
visibility into where the visitors using it were coming from.,,
The new business analytics solution has helped in cost savings
for the zoo; for example, there is a saving of $40,000 in
marketing in the first year, visitors' number has been increased to
50,000 in 2011, food sales is increased by least 25%, and retail
sales has been increased by at least 7.5%, etc.
By adopting new operational management strategies of the
business analytics solution, there is a remarkable increment in
attendance and revenues, which have resulted in an annual ROI
of 411%. Lucas admits, "Prior to this engagement, I never would
have believed that an organisation of the size of the Cincinnati
Zoo could reach the level of granularity its business analytics
solution provides. These are Fortune 200 capabilities in my
eyes.'1

QUESTIONS

1. What was desired by the Cincinnati Zoo & Botanical


Garden in their business operations?
(Hint: They wanted to increase the sales of food items
and retail outlets in the zooby improving their marketing
and promotional strategies.)
2. How did IBM help the zoo?
(Hint: IBM has provided a business analytics solution to
the zoo's executive committee, which helps in
analysing data related to the membership of customers,
their admission and food, etc. in order to gain the
better understanding of visitor behaviour.)
.J
'
'

N O T E S CASESTUDY4

APPLICATION OF BUSINESS ANALYTICS IN


RESOURCE MANAGEMENT

This Case Stii,dy discusses how a real estate company uses


business analytics f01· resowrce management. It is with respect to
Chapter 4.
Analytics can influence cross domains expertise. This case study
presents an instance where a real estate company assisted a law
firm in choosing whether or not to relocate to a different office
space through the usage of data devices. This was done based on
the feedback of employees of the law firm taken by the internal
analytics team of the real estate company. Such feedback helped
the real estate company to come up with an employee lean
management program for the law firm.
In one of a kind example, the law firm wished to bring in and
keep the most suitable employees, so the first factor to be
evaluated as personnel retention. It had got great ratings for its
brilliant services and consistent focus on improving customer
service experience. Being a firm with services of this range, the
firm certainly faced some challenges as usual with any other
resource critical organisation. Now to deal with space-related
issues, the firm roped in the real estate company which went on
not to only suggest the office space but also managed to
streamline resource operations toeffectively compensate a positive
impact of admitting big data business analytics as the firm's chief
workforce driver.

METHOD
The company conducted a few surveys and questionnaire among
the group and came out with a solution to streamline and lean
manage the teams present within the law firm. For the office
space, the real estate company used the firm's resources to
map out where the employees were most often. The real estate
company assisted the law firm by utilising different location
conscious mechanisms to keep track of the whereabouts of the
firm's personnel in which the data was accumulated based on
employee partialities and activities. The end-result was that the
lawfirm decided to relocate from the high-rise office into a more
affordable space based on the location habits of its personnel.
The new location was too convenient for employees that it
resulted in increased employee retention; thereby saving costs of
the firm. Apart from the above actions, the following
questionnaires were circulated across various departments:
Questions for Management:
□ What evaluation methods should be employed to assess the
yearly performance of employees?


CASE STUDY4
N O T E S

□ What cost or economic leaks can be present inside the sys


tem which can be fixed altogether to come up with a fool-
proof plan?
□ How will rapid change in existing SLAs be met along with
proper and due transition? (For upper level executives)
Based on the knowledge and responses received, the
organisation started its studies applying normal resourcing
parameters and concepts toeffectively studyand dish out the best
possible scenario with applicable implications that helped the
organisation in going out far with using business analytics as an
effective medium to utilise whichever operations they gear up
for.

QUESTIONS
1. What were the initial challenges faced by the law firm?
(Hint: Office space relocation indecisiveness, employee
retention impact and resourcing issues)
2. What are the lessons learned from this case study?
(Hint: You can cite examples of cross-functionality
deployed by the real estate team to denote excellent all
round services provided by the real estate company.)
-------'
N O T E S CASESTUDY5

ROLE OF DESCRIPTIVE ANALYTICS IN


THE HEALTHCARE SECTOR

This Case Study discusses the rnle of descriptive analytics in


overcoming challenges in the healthcare industry. It is with respect
to Chapter 5 of the book.
Across the world, healthcare organisations are focussed on
providing better quality services to patients. Therefore, it is
necessary to define performance and determine methods
required for improving quality in the healthcare sector. Many
researches have been performed with an aim to take feedback
from patients and their families, healthcare professionals,
planners and others on patient outcomes, professional
development, system performance, etc.
Many standards and measurable attributes can be used for
defining performance and quality in the healthcare industry.
Some attributes are effectiveness, timeliness, safety, efficiency,
accessibility and availability. In addition to this, healthcare
organisations also consider patient and social preferences in
order to assess and assure quality in the healthcare sector.
The major challenge that lies in the healthcare sector across
the world is crowding of emergency rooms which may lead to
serious consequences and complications. Overcrowding and
poor performance of emergency rooms lead to long waiting
times by patients for treatment since the time they are admitted
to the hospital.
Crowding of emergency rooms and reduced performance in
this essential service is a serious issue for both researchers and
professionals in the healthcare sector. A number of researches
have been conducted to analyse the factors associated with the
overcrowding of emergency rooms. Some researchers have
classified factors that lead to overcrowding of emergency rooms
into three categories, namely input factors, throughput factors
and output factors. On the other hand, some other researchers
have determined the length of stay (LOS) in emergency rooms
by dividing it into the following three intervals:
D Waiting time: It refers to the interval of time between the ar
rival of patient and he/she is examined by a physician in an
emergency room.
□ Treatment time: It refers to the interval of time between
starting of the examination by the physician and a decision
of admitting the patient to the hospital or discharging
him/her.
D Boarding time: It refers to the interval of time started from
the decision of admitting some patients till they are shifted to
an inpatient hospital bed.
..

CASESTUDY5
N O T E S

These conceptual models help in building strategies and


solutions in order to reduce crowding to a great extent. Besides
handling patients, some more problems also exist, contributing
in overcrowding in emergency rooms and prolonged LOS.
Inadequate staffing and shortage of treatment areas make
patients wait longer for their turn or leave the hospital without
examination or proper treatment. Moreover, delay in using
ancillary services, like lab, radiology and other procedures, also
contribute to overcrowding.
Descriptive analytics has emerged as a data processing
method in modern healthcare organisations, which helps in
summarising historical data to extract meaningful information
from it and preparing the data for further analysis. Such
information helps in resolving various issues and making
appropriate decisions in healthcare organisations.
Descriptive analytics also helps in studying various decisions in
healthcare and their impact on service performance and clinical
results. Descriptive analytics is an easy and simple approach to
apply and the data is usually represented in terms of graphs and
tables, which display hospital occupancy rates, average time of
stay, indicators related to healthcare services, etc.
Moreover, descriptive analytics provide data visualisation, which
helps in answering specific queries or determining the patterns
of care. It, therefore, provides a broader perception for evidence
based clinical practice. This allows organisations to handle real
time, or near real-time data, what can be referred to as
operational content, and capture visual data of all the patients.
This analytics is also helpful in determining those patterns of
patients which were left previously unnoticed. Thus, descriptive
analytics is playing its role in providing better services to
patients by providing deep insight of data.

QUESTIONS
1. What were the challenges faced by hospitals in
emergency services?
(Hint: Overcrowding of patients, delay in providing health
services, etc.)
2. What are the advantages of descriptive analytics?
(Hint: To determine hidden patterns, better visualisation
of information, etc.)
II

N O T E S CASESTUDY6

AN APPLICATION OF PREDICTIVE ANALYTICS


IN UNDERWRITING

There is a case study done in the D&O (Directors and


Officers Liability) insurance industry, in whichthe executives of
Scottsdale Insurance Company were proposed a precarious
underwriting submission following the recession in 2008. The
proposal stated that the liability insurance (compensation for
damages or defence fee loans, given a scenario in which an
insured customer was to suffer a loss as a result of a legal
settlement) was to be paid to the institution and/or its
executives and administrators. The Scottsdale Insurance
Company approved this proposition, and thus Freedom
Specialty Insurance Company was formed.
Freedom Specialty Insurance Company placed the industry as
the top priority. Using external predictive analytic data to
calculate risk, D&O claims could be foreseen from class
action lawsuit data. An exclusive, multimillion dollar
underwriting model was created, the disbursements of which
have proven profitable to Freedom in the amount of $300
million in annual direct written premiums. Losses have been
kept at a minimum with a rate below 49% in 20121 which is the
industry's average loss percentage. The model has proven
successful in all areas, with satisfied and assured employees at
all levels of the company, as well as the reinsured being
contented. This case study is a great example of how
predictive analytics helped Freedom to soar high with a
revamped and modernised underwriting model. Many teams
took part in developing the new policy: the predictive model
itself was constructed and assessed by an actuarial firm; the
user interface was crafted by an external technology supplier,
who also formed the assimilation with company's systems; and
technology from SAS supplied components, such as data
repositories, statistical analytics engines and reporting and
conception utilities. The refurbished system that Freedom
employed consists of the following components:
□ Data sources: The system assimilates six external sources
of data (such as class action lawsuits and other financial ma
terial), and the data is acquired through executive company
applications. The external sources are frequently utilised by
the D&O industry. In the case of Freedom Specialty
Insurance Company, they have exclusive sources to
contribute to their predictive model. Freedom, in particular,
spends a lot of time in acquiring and retaining valuable
information regarding merchant activities. Classification and
back testing exposes merchant flaws, as well as
inconsistencies in their informa tion. Freedom exerts extra
time and energy for collaborating with merchants to
catalogue information and maintaining it at


CASE STUDY6
N O T E S

a high worth. Freedom also keeps a close watch on their


exter nal data, applying stringent inspection to develop
policy and claims information. The company upholds data
liberty from merchants' identification schemes. Although it
takes more ef fort to decipher values, the process safeguards
that Freedom can promptly terminate business with certain
merchants if necessary.
□ Data scrubbing: Upon its delivery, information undergoes
many "cleaning'' processes, which assures that information is
able to be used to its maximum ability. For example, there is
a review of 20,000 separate class action lawsuits per month
to observe if any variations have occurred. They were
originally classified by different factors, but now they are
gone through monthly. In the past, before the automated
system was put into place and the process needed to be
carried out manually, the process took weeks to finish. Now,
with the modernised methods and information cataloguing
devices, everything can be completed within hours.
□ Back testing: This is one of the most important processes
that determine the potential risk upon receiving a claim. The
sys tem will use the predictive model to run the claim and
analyse the selection criterion, altering tolerances as
required. Upon being used numerous times, the positive
feedback loop polish es the system.
□ Predictive model: Information is consolidated and run
through a model, which defines the wisest range of apprais al
and limits through the use of multivariate analysis. Algo
rithms assess the submission through numerous programmed
thresholds.
□ Risk selection analysis: This provides the underwriter with
a brief analytical report of recommendations. Similar risks
are shown and contrasted alongside various other risk
factors, such as industry, size, monetary configuration and
consider ations. The platform's fundamental standard is
compelled by the underwriter's rationality, with the
assistance of technol ogy. In other words, the system is made
to support, but not replace, the physical human underwriter.
□ Interface with company systems: Once a conclusion is made,
designated data is delivered to the executives of the compa
ny. The policy distribution procedure is still generally done
by hand, but is likely to be replaced by automated systems
later on. The policy is distributed, and the statistical data is
re-run through the data source element. More information is
contributed to the platform as claims are filed through loss.
As is evident in all D&O processes, underwriters are
required to
N O T E S CASE STUDY6

have a thorough understanding of technical insurance. While


in the past, underwriters put a great deal of effort into acquir
ing, organising and evaluating information, they now have to
adapt to a system in which enormous quantities of data are
condensed onto a number of analytical pages.
Predictive analytics have greatly altered the responsibilities of a
customary underwriter, who now cross over with policyholders
and negotiators in book and risk control. Although the
technology has simplified and cut down a lot of the manual
work, additional experienced technical personnel also needed to
be employed who have legal and numerical awareness that allow
them to construct predictive models in the financial
area.Integrating this model has enabled Freedom to
advanceproficiency in the processacrossmany zones. Processes
involved in managing information such as data scrubbing, back-
testing and classification were all discovered and learned by the
people themselves and were originally carried out by hand.
However, they have been increasingly mechanised since they
were first conceived. Also, there is an ever-growing quantity of
external sources. Freedom is currently undergoing processes to
assess the implementation of cyber security and intellectual
property lawsuits, with the predictive model continuously being
enhanced and improved.
The D&O industry has adopted many processes related to the
upkeep, feeding and preservation of the predictive model that
are utilised by other industries too. One situation in particular is
that following the actuarial originally constructing the predictive
model, Freedom achieved full fluency in the program's complex
processes over the course of many months. Operations were
implemented to efficiently overseeall external merchants together.
A number of external assemblies (including the actuarial firm,
the IT firm, data vendors, reinsurers and internal IT) came
together to refine and organise the predictive model together, all
of them in close collaboration with each other. It was a great feat
for Freedom to unite all of these individuals to take advantage of
their distinct expertise and understanding all together
simultaneously.

POSITIVE RESULTS OF THE MODEL


□ Freedom ended up having positive results from the
implemen tation of their predictive analytics model, with
many new op portunities and insights provided for the
company.
□ Communication and correspondence with brokers and poli
cyholders on the topic of risk management was boosted as a
result of the highly detailed analytic results.
□ The model could even be expanded to cover other areas of
liability, like property and indemnity.


CASE STUDY6
N O T E S

D Back testing and cataloguing mechanisms can also now be


im plemented to foresee other data components in the future.
D The updated and automated model highlights Freedom as
a tough contender amongst competitor companies, and has
opened up windows to uncover even more possible data
sources.

QUESTIONS
L What were the initial challenges faced by Freedom
Specialty Insurance?
(Hint: Interface bottlenecks, manual processes, various
policies developed in silos, etc.)
2. What changes did the implementation of an advanced
predictive model bring in for the company?
(Hint: Integrated processes, easier claim tracking, etc.)
••

N O T E S CASESTUDY7

UNICREDIT BANK APPLIES PRESCRIPTIVE


ANALYTICS FOR RISK MANAGEMENT

The Case Study discusses how an Ita.lian bank, UniCredit, is using


Fico software to apply prescriptive analytics to risk management. It
is with respect to Chapter 7.
When analytics are combined with algorithms, they can make a
great impact on business. An Italy's largest bank, UniCredit, has
done something similar, as it has figured out a model to handle a
high volume of data to manage its risk management processes.
For the bank, it was important to have the right information, in
order to handle their risk management projects, as it may affect
data infrastructure. The bank's goal was to replace the older
decision-making process with a new agile, flexible and
productive technology framework.
Recently, the bank has implemented Fico software, which works
as a decision engine to manage data related to credit cards,
personal loans or other small business loans. According to Ivan
Cavinato, head of credit risk methodologies for the Italian bank,
"The predictive analytics and decision management software will
analyse big data to improve C'ustomer lending decisi.ons and
capital optimization."
The Fico software adopts UniCredit's strategy on data and
prescriptive analytics to enhance customer relationships and
credit risk management. Prescriptive analytics provides multiple
decision options along with their future opportunity and risk. In
short, prescriptive analytics provides better decision options and
improved predictions with accuracy.
Cavinato says, "Our goal is to get actionable insights resuiting in
smarter decisions and better b·usiness outcomes. How you architect
business technologies and design data analytics processes to
get valuable, actionable insights varies. Fico allows 1ts to put in
vlace a prescriptive analytic environment. Prescriptive analytics
wutomatically synthesizes big data, mathematics and business rules
to suggest decision options to take advantage of the predictions."
According to Cavinato, Fico software is integrated with a new
vision of enterprise data infrastructure. He says, "We aim to build
a more flexible and agile architecture. That also nieans displacing
pieces of legacy software and embracing distributed architecture,
such as Hadoop. But let's be clear. That doesn't necessarily mean
dealing with unstructured data."
Cavinato also admits that Hadoop has created efficiency at the
processing level and operational level, so that the total time
taken by dependent tasks is automatically reduced. In addition,
• ...
CASESTUDY7
N O T E S

re-engineering of the data infrastructure by using Hadoop and


big data paradigms has reduced overall cost also. However; the
previous software was lacking all such advantages.
One of the major reasons of accepting Fico software by
UniCredit is that it allows modifying as per the requirements of
other types of businesses also. Finally, Cavinato summarises,
"Fico underpins a software methodology that's largely dependent
on algorithms. It's a key building block to proceed to a complete
o·verhaul of the entire infrastructure, physical and logical, that
supports our databusiness. It helps redefine processes with
greater agility and granularity, bringing new opportunities and
greater performance."

QUESTIONS

I. What was the challenge faced by UniCredit? --:-1


(Hint: UniCredit required right information in order to I
handle their risk management projects.)
2. How has UniCredit achieved its goal?
(Hint: By adopting Fico software that uses prescriptive
analytics to enhance customer relationships and credit
risk management.)
N O T E S CASESTUDY8

CAMPAIGN SUCCESS OF MEDIACOM

This Case Study discusses how MediaCom has taken the assistance
of Sysomos for pLanning and meas·uring data related to
advertising campaigns fo1·its clients. It is with respect to Chapter
8.
MediaCom is one of the leading media agencies of the world,
which helps its clients to plan and measure its advertising
strategies across all media channels. The company greatly
depends on Sysomos in planning and measuring the performance
of campaigns of its clients.
The main motto of MediaCom agency was to improve the
business along with having insight data related to the audience's
response to their brands and issues.
Alejandro De Luna, Social Strategy Manager at MediaCom, says
"The value Sysomos provides for us is very clear. We need to have a
bedrock of insights to justify how to approach content solutions for
different audiences and different platforms, and Sysomos helps us
to sell in our strategies by giving ·u,s a much clearer undentanding
of how audiences feel about specific brands and issues."
Sysomos has enabled MediaCom to analyse online conversations
without any limitations of keywords or result into the database of
over 550 billion social media posts. Now, MediaCom is able to
use social intelligence for planning and reporting. For example, it
can analyse the data ofsocial media discussions about the
campaign on Twitter and discussion forums to know about
consumer opinions.
Sysomos has provided a tool, Buzzgraph, to MediaCom that
helps in gaining knowledge about the key concepts of online
conversations. However, the Tweet Life tool helps in
analysing how a tweet gets viral on the Internet. With the help
of Sysomos, MediaCom, is now easily convincing its clients
that its plans are made on solid facts and figures. Sysomos has
helped MediaCom to have complex analysis of a wide range of
topics, without imposing restrictions on the number of
searching terms or obtaining the results in order to gain insight
knowledge and applying their campaign strategies.
• •
CASESTUDY8
N O T E S

QUESTIONS

1. What was the basic aim of MediaCom media agency?


(Hint: The main motto of MediaCom agency was to
improve the business along with having insight data
related to the audience's response to their brands.)
2. How has Sysomos helped MediaCom?
(Hint: Sysomos has provided tools like, Buzzgraph and
I
Tweet Life, to analyse online conve,-,ations without any
limitations of keywords or result into the database of
L many billions of social media posts.)
I•

N O T E S CASESTUDY9

DUNDAS BI SOLUTION HELPED MEDIDATA AND ITS


CLIENTS IN GETTING BETTER DATA VISUALISATION

This Case Study discusses how a custom datavisualisation solutions


provider is helping its clients in getting better visualisation of the
data stored in their database. It is with respect to Chapter 9 of
the book.
Medidata is a Portugal-based company which is specialised in
providing ERP based solutions to the Portuguese government.
The company believes in modernising the technology and
software solutions used by the Portuguese government for
combating with the fast evolution of the market. Medidata was
committed in continuous development by providing a variety
of software products and services, which might include back
office applications, support systems, etc. in order to fulfil the
requirements of municipalities and residents of Portugal. In
addition to fulfil the requirements of the Portuguese
government, Medidata has its own pool of customers who use
ERP solutions and services for improving document
management and enhancing workflow.
Medidata started receiving demands from its clients to provide
software that can help them in analysing and interacting with the
data generated from the ERP software. Medidata felt necessary
to include a Business Intelligence (BI) and analytics solution in
its collection of software solutions. The BI and analytics solution
will have the following advantages for Medidata's clients:
□ It helped them in taking better and more informed decisions
□ It improved efficiency and productivity of clients
□ It was capable of redefining processes when required
□ It was scalable which means it can increase or decrease re-
sources as and when required
In addition to fulfil the needs of clients, Medidata also wanted a
BI and analytics solution for detecting its own issues related to
the quality of data. Medidata decided to migrate on the Dundas
BI solution for data visualisation. The decision was obvious
because Dundas has been working as a partner since 2009 when
it was involved in developing business intelligence components.
The satisfaction and belief in using Dundas legacy products
helped Medidata in migrating to Dundas BI solution for visualising
the data. Dundas has been involved in creating and providing
customised data visualisation software for both Fortune 500 and
start-up companies across the world.
Before formalising the partnership, several meetings were held
between Medidata and Dundas to discuss how Medidata would
CLIENTS IN GETTING

CASE STUDY9
N O T E S

encourage the clients to use the Dundas BI solution for data


visualisation. After understanding the strategies of Medidata
of selling and marketing the Dundas BI solution, Dundas
decided to provide the customised BI solution with full
support as per Medidata's needs to use and test it fora certain
period of time.
Dundas also helped Medidata in learning the use of BI solution
by providing multimedia training content and webinars. It helped
in the adoption of BI solution rapidly across Medidata. The
interface of BI solution for data visualisation is shown in the
following figure:

Slgm1Flow. T1refu pendentH

Some important features of the BI solution are:


□ Superb interactivity: A highly interactive environment of
Dundas BI visualisation enabled clients ofMedidata in
engag ing and understanding their data in a better way.
□ Data-driven alerts: Utilising alert notifications, built-in anno
tations and scheduled reports in Dundas BI, their clients can
collaborate with the user using these tools.
□ Smart design tools: Dundas BI provides smart and in-built
design tools, which provide drag-and-drop functionality for
quickly designing reports and dashboards.
□ Extensibility: Dundas BI provides connectivity with earlier
unsupported data sources.
□ Performance tuning: The BI solution provides an ability to
store the output of the data cubes within Dundas BI's data
warehouse for better performance.
N O T E S CASE STUDY9

Due to the presence of preceding features, some key benefits of


the BI solution for Medidata are as follows:
D Medidata can now validate those database attributes, which
were incorrect in some situations but not in others.
□ Medidata can now determine inconsistencies in their data
base.
D Medidata also became able in resolving various issues related
to data integrity.
The BI solution has resolved 60% of the validity concerns faced by
Medidata. Not only BI solution for data visualisation benefitted
the Medidata, it has also proved useful for Medidata's clients:
□ It helped clients by increasing their ability to take data-
driven actions.
D It helped clients in identifying and understanding their key
performance indicators (KPis).
D It has provided dashboards to clientswhich include KPis,
such as workflow performance, in addition to the ratio of
workflow outstanding tasks, grouped by department.
D It helped clients by making information available quickly,
the decision-makers of clients became capable to regulate re
sources in real-time, task execution time in different scenari
os, and finally improve the ratio of overdue tasks.
''While using Dundas BI, I found I was able to accelerate the time-
to market ofmy BI prnjects. The usability, self-contained
management and the easy way that,in a blink, I could see and
analyse data from various sources were a great and awesome
surprise!" - Luis Silva, Senior BI Consultant, Medidata.

QUESTIONS
1. Why data visualisation is important for companies?
(Hint: Timely action, resource allocation, etc.)
2. What should be the features of a good data visualisation
solution?
(Hint: Highly interactive, data-driven alerts, etc.) _J
CASE STUDY 10
N O T E S

SPORTS ANALYTICS HELPED IN THE ENRICHMENT OF

PERFORMANCE OF PLAYERS
- This Case Study discusses how real-time analytics from IBM have
been utilised by team USA for measuring and improving their
athlete's performance. It is with respect to Chapter 10 of the book.
A US-based cycling organisation, which is dedicatedly
contributing towards the betterment of advanced US cycling
teams in the Olympics and other international events, was
involved in determining the ways to get an edge over its well
funded competitive organisations in the events like Women's
Team Pursuit. In the team pursuit event, there are four cyclists
with one in the lead and the other three remaining behind. The
challenge appears when riders change their places, which cause
disruption and slows down the group. The delay of fraction of a
second can cost the race in this extremely competitive sport.
USAcycling totally depends on private donations unlike national
teams which are totally supported by government bodies.
Coaches in USA recycling felt the need of analytics for
analysing the rider's performance along with managing the
organisation's budget efficiently. The challenge in front of USA
cycli11g was to quantify the performance in Team Pursuit
track cycling events in real time, which were organised indoors
in velodromes. It was easier to monitor and track the rider's
performance outdoors only if there are no variations in wind or
condition of cycling track.
"The single most important factor in winning a race is the power
that the riders are able to exert on the pedals. The bikes we use
have a power meter on the crank that measures the power
generated in watts," according to Andy Sparks, Direct.or of
Track Programs for USA Cycling. Collecting and applying data
analytics from bicyclists' sensors was a slow-going process that
usually took an hour in only collecting data per cyclist.
' t the end of a training session, the coach had to plug the head.
unit of each bike into his PC,download the data, manually sl.ice it
into half-second intervals, match those intervals to the events that
took place during the session - for example, when each rider was
pulling, versus when they were exchanging or pursuing - and then
calculate a variety of key metrics," according to Andy Sparks.
This means performance data and cycling analytics would not get
ready until the next training day.
USA cycling organisers were looking for the solution to
overcome these challenges. They decided to take help from real-
time analytics for analysing the performance of players and
achieve its goal. They started working with IBM jStart to
configure the
N O T E S CASE STUDY 10

flow of real-time data for delivering instant analytics to coaches


about rider's performance on their mobile dashboards. jStart is
an IBM team expertise in offering intelligent business solutions
developed on the latest emerging technologies.
Android smartphones were planted in riders' pockets. The data
generated from phones was further transferred to IBM Watson's
Internet of Things platform for analysis. Analytics of data are
further provided using summary dashboards that display metrics
like W-prime depletion which provides information about the
usage of anaerobic muscle capacity by a rider and time duration
that a rider takes to regenerate it.
Further, the jStart team incorporates IBM Analytics for Apache
Spark to compute metrics while cyclists are movingspeedily to
get their useful key metrics. As the cycling analytics data is
produced in the real-time scenario and shown on a mobile
dashboard, both coaches and cyclists, therefore, can access their
performance data dw·ing the on-going training session with
proper feedback.
"The ability to get hold of the data immediately after the training
session has finished has completely changed my rel,ationship
with the team," according to Neal Henderson, a high
performance consultant with USA Cycling.
The USA cycling can now view data promptly, it gets much
easier in identifying problems, making positive modifications
and strengthening winning behaviours that they can take into
the next session. The analytics solution allows riders to get the
review of their performance, that is, howefficiently they are
performing. It also helped in relieving stress from players.
With instant analytics, coaches can provide "·riders a q·uick debrief
after the first race, advise them on tactics for the next one, and
then just let them relax and recove1;" says Henderson.
USA cycling organisers get so much befitted from the analytics
solution that their team has won gold medal at the London World
Championships.

QUESTIONS

1. Enlist challenges faced by USA Cycling.


(Hint: Delay in assessing the performance of players.)
2. Discuss the benefits of sports analytics for USA Cycling.
(Hint: Better utilisation of resources, better assessment
of players, etc.) _J
CASE STUDY 11
N O T E S

FRAUD ANALYTICS SOLUTION HELPED IN SAVING


THE WEALTH OF COMPANIES

This Case Study discusses how IBM's fraud analytics helped


organisationsin detecting thefrauds andsaving themfromfinancial
losses. This case study is rel,ated to chapter 10 of the book.
In the year 2011, industries in the US were suffering from huge
financial loss of approximately $80 billion annually. Alone, issuers
of credit and debit cards in the US have suffered a whopping loss
of $2.4 billion. Besides industries, financial frauds had also taken
place with individuals which would have taken years to resolve.
Existing fraud detection systems were not effective as they
function on a predefined set of rules, which include flagging
on withdrawals from ATM up to a certain amount or
purchasing using the credit cards outside the credit card
holder's country. These traditional methods helped in reducing
the number of fraudulent cases but not all. The research team
at IBM decided to take the fraud detection system to the next
level, so that a large number of fraudulent financial transactions
can be detected and prevented. At IBM, the team has created a
virtual data detective solution by using machine learning and
stream computing to prevent fraudulent transactions to save
industries or individuals from financial losses.
In addition to signalling about the particular type of
transaction, the solution also analyses about transactional data
to create a model for detecting fraudulent patterns. This model
is further utilised for processing and analysing a large amount
of financial transactions as they occur in real time which is
termed as 'steam computing'.
Each transaction is allocated a fraud score, which specifies the
likelihood of a transaction being fraudulent. This model is
further customised according to the data of the client and then
upgraded after a certain period of time for covering new fraud
patterns. Fundamental analytics depend on statistical analysis
and machine-learning methods, which allow determining of
strange fraud patterns that could be skipped by human experts.
Consider an example of a large U.S.-based bank that used the
IBM machine-learning technologies for analysing transactions
of the issued credit cards got the result as shown in the following
image:
A la:rg,e US txmk \.l.5e'd. IBM mochtne lo(Dn1ng lechnoloQ1c!;JlQ cu1a.lyzo crodn cord
uanscrct1ons. It resulted In the to1Jowtna

15%
blc.reo$8 in
6Qolo
tneieose in total
trcrud. detection sa-vtngs
N O T E S CASE STUDY 11

Consider another case of an online clothing retailer. If most


transactions made at the retailer were fraudulent, then there is
a high probability that future transactions related to purchases
would also be fraudulent. The system is capable of gathering
these historical points of data and further analyse it, to detect
the possibility of future fraudulent attempts. In addition to
prevent fraudulent attempts, the system has also cut down on
false alarms after analysing the relation between the suspected
fraudulent transactions and actual fraud.
"The triple combination of prevention, catching more incidents
of actual fraud, and reducing the number of false positives
results in maximum savings with minimal. hassle. In essence, we
are able to apply complicated logic that is outside the realm of
human analysis to huge quantities of streaming data," notes
Yaara Goldschmidt, manager, Machine Learning Technologies
group.
These machine-learning technologies are presently used
in detecting and preventing fraud in financial transactions,
which includes transactions related to credit cards, ATMs and
e-payments. The system is embedded with client's
infrastructure and a machine-learning model is developed using
its existing set of data to combat with fraudulent transactions
before they take place.
"By identifying legal transactior1s that have a high probability of
being followed by a fra.ud·ulent transaction, a bank can take pro
active measures-warn acard owner or require extra measures for
approving a purchase," explains Dan Gutfreund, project technical
lead.
Machine learning and stream-computing technologies are
not capable of predicting the future, yet they enable financial
institutions to take effective decisions and work towards
preventing frauds before they occur.

1. What is the need for fraud analytics in the organisations?


(Hint: To prevent fraudulent transactions.)
2. What are the benefits of machine-learning and stean1-
computing for organisations?
(Hint: To identify the pattern of fraudulent
transactions, raise alarms, etc.)
CASE STUDY 12
N O T E S

BIG DATA ANALYTICS ALLOWING USERS TO VISUALISE

THE FUTURE OF FREE ONLINE CLASSIFIEDS


-
This Case Study s1wws how data is integrated from hundreds
of countries, do.::ens of languages and then allowing users with
powe1ful, data-driven insight to predict the future of free online
classifieds. It is with respect to Chapter 1.

BACKGROUND
OLX is a popular fast growing online classified advertising
website. It is active in around 105 countries and supports over
40 languages. This website is having more than 125 million
unique visitors per month across the world and generates one
billion page-hits per month approximately. OLX allows its users
to design and personalise their advertisements and add them
in their social networking profiles, so that their data require big
data analytics.

CHALLENGES
The main challenge for OLX website was to find new ways to
use business analytics to handle the vast data of their customers.
The business users of OLX required numerous metrics to track
their customer data. To achieve this aim, they need to build a
good control over their data warehouse. OLX takes the help of
Datalytics, Pentaho's partner vendor, in searching the solutions
for extracting, transforming and loading data from worldwide
and then creating an improved data warehouse. After creating
such a warehouse, OLXwants to allow its customers to visualise
its stored data in real time without facingany technical error or
barrier. OLX knew that it would be difficult for those people
who do not have without previous Business Intelligence (Bl)
knowledge, so it is essential to use a visualisation tool for this
purpose. According to Franciso Achaval, Business Intelligence
Manager at OLX, "While it may beeasy for a BI analyst
tounderstand what's happening in the m.imbers, to explain this to
business users who are not versed in BI or OLAP (On-line
Analytical Processing), you need visualisations."

SOLUTIONS
OLX has approached Pentaho, which is a business intelligence
software company that provides open source products and
services to its customers, such as data integration, OLAP
services, reporting, information dashboards, etc. Pentaho has
partnership with Datalytics, which is basically a consulting firm
based in Argentina. Datalytics provides data integration,
business intelligence, and data mining solutions to Pentaho's
worldwide clients.
N O T E S CASE STUDY 12

The following solutions are provided to OLX:


□ Pentaho Data Integration and Mondrian technique is used to
handle a huge amount of data across the world. This data is
extracted, transformed and loaded from multiple sources,
like Google analytics, and then stored in a data warehouse.
□ Pentaho Business Analytics allow its users to have data in
sight and analyse key trends in real time.
□ Pentaho's partner, Datalytics, has provided assistance to
OLX in the deployment and designing of a new analytics
solution. They have examined new ways to build out data
analysis ca pabilities and integrating big data.

RESULTS
OLX has realised that Datalytics' expertise and Pentaho's
platform have enabled them to deploy their new analytics
solution in less than a month. They have realised the following
changes in the new solution:
□ Pentaho Business Analytics enables OLX to facilitate its
users to create easy and creative reports about key business
metrics.
□ Instead of buying an expensive enterprise solution or invest
ing time in building a new data warehouse internally, OLX
was able to save time by focussing on data integration with
analytics capabilities.
□ Pentaho Business Analytics provides end-user satisfaction.
□ Pentaho Business Analytics provides a scalable solution to
OLX, as it can integrate any type of data from any data
source and can increase its business. In addition, Datalytics'
assis tance provides an opportunity to OLX regarding the
experi ment with big data.

l. What were the challenges faced by OLX?


(Hint: The main challenge for OLX website was to find
new ways to use business analytics to handle the vast
data of their customers.)
2. What was the result of implementing Pentaho Business
Analytics?
(Hint: Pentaho Business Analytics enables OLX to
facilitate its users and create easy and creative reports
about key business metrics.)

You might also like