0% found this document useful (0 votes)
42 views

Module 2 - BA

This document discusses descriptive analytics and data-driven decision making. It provides a framework for becoming more data-driven, including identifying high impact areas, auditing existing data and gaps, stress testing tools and reports, and closing skills gaps with training. Key aspects of making data-driven decisions are knowing your mission, identifying data sources, cleaning and organizing data, performing statistical analysis, and drawing conclusions. Descriptive, inferential, and predictive types of conclusions are discussed.

Uploaded by

Scard TM
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Module 2 - BA

This document discusses descriptive analytics and data-driven decision making. It provides a framework for becoming more data-driven, including identifying high impact areas, auditing existing data and gaps, stress testing tools and reports, and closing skills gaps with training. Key aspects of making data-driven decisions are knowing your mission, identifying data sources, cleaning and organizing data, performing statistical analysis, and drawing conclusions. Descriptive, inferential, and predictive types of conclusions are discussed.

Uploaded by

Scard TM
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

BIG DATA AND BUSINESS ANALYTICS

Module: 2

DESCRIPTIVE ANALYTICS

2.1 Framework for Data driven Decision Making

2.2 Data Pre-processing – Imputing Missing Values using SPSS/R

2.3 Measure of central tendency – Mean, Median and Mode. Measure of

Variation – Range, IQR, Variance and Standard Deviation. Measure of

Shape – Skewness and Kurtosis, Central Limit Theorem

2.4 Data Visualization – Univariate, Bivariate and Multivariate

FRAMEWORK FOR DATA DRIVEN DECISION


MAKING

What Does It Mean to be “Data-Driven”?

Perhaps one of the most common buzzwords today is “big data.”


But what is “big data,” really? The term is generally used to describe the
magnitude and complexity of information. Even a small amount of
content could be considered “big data” if a large amount of information
has been extracted from it.

Then what does it mean to be “data-driven?” This term describes a


decision-making process which involves collecting data, extracting

MODULE :03 1
patterns and facts from that data, and utilizing those facts to make
inferences that influence decision-making.

Data-driven decision making (or DDDM) is the process of


making organizational decisions based on actual data rather than
intuition or observation alone.

Every industry today aims to be data driven. No company, group,


or organization says, “Let’s not use the data; our intuition alone will lead
to solid decisions.” Most professionals understand that— without data
—bias and false assumptions (among other issues) can cloud judgment
and lead to poor decision making. And yet, in a recent survey, 58
percent of respondents said that their companies base at least half of
their regular business decisions on gut feel or intuition instead of data.

How, then, can you ensure you’re making data-driven decisions


that are void of bias and focused on clear questions that empower your
organization?

Meaning of DDDM

Data Driven Decision Making is a process that involves collecting


data based on measurable goals or KPIs, analyzing patterns and facts
from these insights and utilizing them to develop strategies and
activities that benefit the business in several areas

DDDM means working towards key business goals by leveraging


verified, analyzed data rather than merely shooting in the dark.

MODULE :03 2
1. Data or computer can process information quicker
2. Data can help overcome biases
3. Data can help refine your gut feeling

Case Studies of DDDM in Practice

“After reading a report about the future of the Internet that projected
annual web commerce growth at 2,300%, Bezos created a list of 20
products that could be marketed online. He narrowed the list to what he
felt were the five most promising products, which included: compact
discs, computer hardware, computer software, videos, and books. Bezos
finally decided that his new business would sell books online, because
of the large worldwide demand for literature, the low unit price for
books, and the huge number of titles available in print.”

Amazon continues to use data to influence what products or


services they should offer. We also saw the same principles apply to the
release of AWS (Amazon Web Services). Amazon realized that their
ecommerce platform had value, so they started working on making it
MODULE :03 3
accessible to other merchants through merchant.com. During the
process, they realized that they need to make their code cleaner and
easier to manage. This process eventually led to the creation of AWS.

Netflix is another company that relies on data quite a bit. Their


first major hit, House of Cards, came to be because of data. Netflix was
able to see that Kevin Spacey was popular with its users (based on his
movies) and that the political thriller could be a good fit for their
audience. Netflix doesn’t have to guess what is likely to do well, instead,
they just must look at their data and combine that with experienced
content creators.

Framework for Becoming More Data-Driven

Every company is unique, but they can approach their data in


similar ways. The goal of the 4-step framework below is to help you
uncover insights hidden in your existing data and figure out what gaps
you should be working to fix.

Step 1: Identify High Impact Areas

We need to start by prioritizing what areas could benefit from


data. The low hanging fruit tends to be any team that is directly tied to
revenue such as sales or marketing or any team that is overwhelmed
such as customer success dealing with onboarding.

Step 2: Audit of Existing Data & Gaps

You then need to analyze what data currently exists and what
gaps need to be addressed over time. The goal here is to have enough

MODULE :03 4
data to understand what is going. For example, if we are looking at
sales, we would want data around where deals come from, how long
they take to close, why do deals fail, common attributes to the best deals,
and so on. If you can’t answer important questions, these are gaps to be
solved.

Step 3: Stress Test Existing Tools & Reports

The next step is to stress test your existing tools and reports. Are
you able to easily generate the reports that you need? Is there a better
tool for tracking data? We want velocity when analyzing data so any
bottlenecks should be removed.

Step 4: Close Skills Gaps with Training

Finally, you want to close any gaps in analysis capabilities with


training. This means that your team should be able to query the data
they need, understand their reports and know how to mine the data for
insights. This is the last step because we need to know what team to
focus on (step 1), what data we have (step 2) and what tools are
available (step 3).

Use the framework to systematically work through different


business units and over time, you’ll find yourself making fewer
decisions based on anecdotes and more on facts.

How to Make Data-Driven Decisions

MODULE :03 5
To effectively utilize data, professionals must achieve the following:

1. Know your mission.

A well-rounded data analyst knows the business well and posses sharp
organizational acumen. Ask yourself what the problems are in your
given industry and competitive market. Identify and understand them
thoroughly. Establishing this foundational knowledge will equip you to
make better inferences with your data later on.

Before you begin collecting data, you should start by identifying the
business questions that you want to answer to achieve your
organizational goals. By determining the precise questions you need to
know to inform your strategy, you’ll be able to streamline the data
collection process and avoid wasting resources.

2. Identify data sources.

Put together the sources from which you’ll be extracting your data. You
might be coordinating information from different databases, web-driven
feedback forms, and even social media.

Coordinating your various sources seems simple, but finding common


variables among each dataset can present a tremendously difficult
problem. It can be easy to settle for the immediate goal of utilizing the
data for your current purpose alone, but it’s wise to determine whether
or not this data could also be used for additional projects in the future. If
so, you should strive to develop a strategy to present the data in a way
that’s accessible in other scenarios as well.

3. Clean and organize data.

MODULE :03 6
Surprisingly, 80 percent of a data analyst’s time is devoted to
cleaning and organizing data, and only 20 percent is spent actually
performing analysis. This so-called “80/20 rule” illustrates the
importance of having clean, orderly information before you can attempt
to interpret what it might mean for your organization.

The term “data cleaning” refers to the process of preparing raw


data for analysis by removing or correcting data that is incorrect,
incomplete, or irrelevant. To do so, start by building tables to organize
and catalog what you’ve found. Create a data dictionary—a table that
catalogs each of your variables and translates them into what they mean
to you in the context of this project. This information could include data
type and other processing factors, as well.

4. Perform statistical analysis.

Once you’ve thoroughly cleaned the data, you can begin to


analyze the information using statistical models. At this stage, you will
start to build models to test your data and answer the business questions
you identified earlier in the process. Testing different models such as
linear regressions, decision trees, random forest Modeling, and others
can help you determine which method is best suited to your data set.

MODULE :03 7
MODULE :03 8
Here, you will also need to decide how to present the information to
answer the question at hand. There are three different ways to
demonstrate your findings:

Descriptive Information: Just the facts.

Inferential Information: The facts, plus an interpretation of what


those facts indicate in the context of a particular project.

Predictive Information: An inference based upon facts and advice


for further action based on your reasoning.

Clarifying how the information will be most effectively presented


will help you remain organized when it comes time to interpret the data.

5. Draw conclusions.

The last step in data-driven decision making is coming to a


conclusion. Ask yourself, “What new information did you learn from the
collection of statistics?” Despite pressure to discover something entirely
new, a great place to start is by asking yourself questions to which you
already know—or think you know—the answer.

Many companies make frequent assumptions about their products


or market. For example, they might believe, “A market for this product
exists,” or, “This is what our customers want.” But before seeking out
new information, first put existing assumptions to the test. Proving these
assumptions are correct will give you a foundation to work from.
Alternatively, disproving these assumptions will allow you to eliminate
any false claims that have, perhaps unknowingly, been negatively
impacting your company. Keep in mind that an exceptional data-driven
decision usually generates more questions than answers.

MODULE :03 9
The conclusions drawn from your analysis will ultimately help
your organization make more informed decisions and drive strategy
moving forward. It is important to remember, though, that these
findings can be virtually useless if they are not presented effectively.
Thus, data analysts must become skilled in the art of data storytelling to
communicate their findings with key stakeholders as effectively as
possible.

Data-Driven Decisions and Organizational Success

Incidentally, most of the steps listed above do not generate


statistics. Most of these steps to effectively utilize data instead encourage
novice data analysts to become well-rounded in their role. This process
helps professionals become capable of not only analyzing, but
understanding data from a holistic perspective and providing insight
based upon the data as well.

Joel Schwartz, a nonteaching affiliate of Northeastern adds, that


it’s worth asking, ‘Who isn’t utilizing data-driven decision making in my
industry?’ because the most successful companies almost always are. He
continues:

Consider Netflix, for example. The company started as a


mail-based DVD sharing business and, based on a data-driven
decision, grew to internet streaming—becoming one of the most
successful companies today. Without data, Netflix would not have
had the basis to make such an immense and impactful decision.
Moreover, without that decision, the company would not have
flourished at the rate or in the direction it did.

MODULE :03 10
Amazon is another poignant example. What started as an
online bookstore has blossomed into a massive online hub for just
about any product a person could want or need. What drove them
to make such enormous decisions? Data. It’s no surprise that such
major (and successful) rebranding moves were made based on
data collection and the inferences made as a result.

Without the data-driven approach to decision making,


Netflix would still be mailing you an outdated mode of movie
content and Amazon would be a simple online bookstore. The
bottom line is that this data-driven approach is putting all other
methods out of business. The world is becoming data-driven, and
to not make data-driven decisions would be foolish.

Data pre-processing

MODULE :03 11
Data pre-processing is a data mining technique that involves
transforming raw data into an understandable format.
Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviours or trends, and is likely to contain many
errors.
Data pre-processing is a proven method of resolving such issues.

Data preprocessing is an important step in data mining process, data

preprocessing is that step in which the data gets transformed, or

encoded, to bring it to such a state that now the machine can easily parse

it.

Input of Data Resource

MODULE :03 12
EXTRACT,TRANSFORM ,& LOAD 

Extract

What data you want in the DW?

Transform

In what form you want the extracted data in the DW?

Load

Load the transformed extracted data onto the DW

R PROGRAMMING
MODULE :03 13
R is a programming language and free software developed by
Ross Ihaka and Robert Gentleman in 1993. R possesses an extensive
catalog of statistical and graphical methods. It includes machine learning
algorithms, linear regression, time series, statistical inference

LIST OF R PACKAGES USED TO INPUT VALUES.

1. MICE

MICE (Multivariate Imputation via Chained Equations) is one of


the commonly used package by R users. Creating multiple imputations
as compared to a single imputation (such as mean) takes care of
uncertainty in missing values.

2. Amelia

This package (Amelia II) is named after Amelia Earhart, the first
female aviator to fly solo across the Atlantic Ocean. History says, she got
mysteriously disappeared (missing) while flying over the pacific ocean
in 1937, hence this package was named to solve missing value problems.

3. missForest

missForest is an implementation of random forest algorithm. It’s a


non-parametric imputation method applicable to various variable types.
So, what’s a non parametric method ?

Non-parametric method does not make explicit assumptions about


functional form of f (any arbitrary function). Instead, it tries to
estimate f such that it can be as close to the data points without seeming
impractical

MODULE :03 14
4. Hmisc

Hmisc is a multiple purpose package useful for data analysis, high


– level graphics, imputing missing values, advanced table
making, model fitting & diagnostics (linear regression, logistic
regression & cox regression) etc. Amidst, the wide range of functions
contained in this package, it offers 2 powerful functions for imputing
missing values. These are impute() and aregImpute(). Though, it also
has transcan() function, but aregImpute() is better to use

5. mi

mi (Multiple imputation with diagnostics) package provides several


features for dealing with missing values. Like other packages, it also
builds multiple imputation models to approximate missing values. And
uses predictive mean matching method.

SPSS(Statistical Package for Social Science )

SPSS is a statistical analysis program that is used in a variety of


fields, from market researchers to government agencies. It allows you to
perform a variety of functions on your data, but you need data before
you can do any of that. There are several ways to enter data into SPSS,
from entering it manually to importing it from another file.

MODULE :03 15
SPSS means “Statistical Package for the Social Sciences” and was
first launched in 1968. Since SPSS was acquired by IBM in 2009, it's
officially known as IBM SPSS Statistics, but most users still just refer to it
as “SPSS”.

Refer : https://www.spss-tutorials.com/basics/

MEASURE OF CENTRAL TENDANCY-


MEAN,MEDIAN,MODE

Measures of central tendency represent a single value that


attempts to describe a set of data by identifying the central position
within that set of data. As such, measures of central tendency are
sometimes called measures of central location. In other words, it is a
measure that tells us where the middle of a set of data is. The mean,
which is often called the average, is the most well-known of the
measures of central tendency.

MODULE :03 16
Mean

The mean is the arithmetic average, and it is probably the measure


of central tendency that you are most familiar. Calculating the mean is
very simple. You just add up all the values and divide by the number of
observations in your dataset.

The calculation of the mean incorporates all values in the data. If


you change any value, the mean changes. However, the mean doesn’t
always locate the center of the data accurately.

Median

The median is the middle value. It is the value that splits

the dataset in half. To find the median, order your data from

smallest to largest, and then find the data point that has an

equal number of values above it and below it. The method for

locating the median varies slightly depending on whether your

dataset has an even or odd number of values.

Even or odd number of values

MODULE :03 17
Mode

The mode is the value that occurs the most frequently in your data
set. On a bar chart, the mode is the highest bar. If the data have multiple
values that are tied for occurring the most frequently, you have a
multimodal distribution. If no value repeats, the data do not have a
mode.

MODULE :03 18
MEASURE OF VARIATION-RANGE,IQR,VARIANCE AND
STANDARD DEVIATION

Range

Range is nothing but the difference between max and min values of

the data set. For the data sets we considered above the range is (15-(-

5))=20 and 7-3=4 respectively for Set1 and Set2.

Let’s look at another data series with outlier which we have used in

central tendency

1 3 5 7 9 4 2 6 3 100

The range for this data series in 100-1 = 99. But as you can visually

see the series have most of the numbers between 1 to 9. Hence, we can

say that range is very much sensitive to the outliers either on the left or

right side.

Inter Quartile Range ( IQR )

The interquartile range is a measure of where the “middle fifty” is

in a data set. Where a range is a measure of where the beginning and

end are in a set, an interquartile range is a measure of where the bulk of

the values lie. That’s why it’s preferred over many other measures of

spread when reporting things like school performance or SAT scores.

MODULE :03 19
We will follow the below steps to compute IQR

1. Sort the numbers in order.

2. Compute median or 50% quartile. This is called Q2.

3. Compute the Lower and upper quartile. The lower quartile (Q1) is
computed using mid value below Q2 and upper quartile (Q3) is
calculated using mid value of above Q2.

4. The difference between Q3 and Q1 is called IQR.

Variance

While the Range and IQR are using extreme values of the data set,

in most cases we want to measure the spread with respect to mean

values. Variance is one of such measure. Variance is the average squared

difference between the data point and mean.

Standard Deviation

The standard deviation is just square root of variance.

s = √Var

 The Standard Deviation is a measure of how spread out numbers


are.
 Its symbol is σ (the Greek letter sigma)

MODULE :03 20
 The formula is easy: it is the square root of the Variance. So now
you ask, "What is the Variance?"

Variance

The Variance is defined as:

The average of the squared differences from the Mean.

To calculate the variance, follow these steps:

 Work out the Mean (the simple average of the numbers)


 Then for each number: subtract the Mean and square the result
(the squared difference).
 Then work out the average of those squared differences. (Why
Square?)

Reference :

https://www.mathsisfun.com/data/standard-deviation.html

MEASURE OF SHAPE-SKEWNESS AND


KURTOSIS,CENTRAL LIMITED THEORAM

MODULE :03 21
Skewness is a measure of asymmetry or distortion of symmetric
distribution. It measures the deviation of the given distribution of a
random variable from a symmetric distribution, such as normal
distribution. A normal distribution is without any skewness, as it is
symmetrical on both sides. Hence, a curve is regarded as skewed if it is
shifted towards the right or the left.

Types of Skewness

1. Positive Skewness

MODULE :03 22
If the given distribution is shifted to the left and with its tail on the right
side, it is a positively skewed distribution. It is also called the right-
skewed distribution.

2. Negative Skewness

If the given distribution is shifted to the right and with its tail on the left
side, it is a negatively skewed distribution. It is also called a left-skewed
distribution.

kurtosis

The degree of tailedness of a distribution is measured by kurtosis.


It tells us the extent to which the distribution is more or less outlier-
prone (heavier or light-tailed) than the normal distribution.

Formula : β2=μ4μ2

Three different types of curves,

The normal curve is called Mesokurtic curve. If the curve of a


distribution is more outlier prone (or heavier tailed) than a normal or
mesokurtic curve, then it is referred to as a Leptokurtic curve. If a curve

MODULE :03 23
is less outlier prone (or lighter-tailed) than a normal curve, it is called as
a platykurtic curve.

Central limit theorem

The Central Limit Theorem (CLT)


states that the distribution of a sample
mean that approximates the normal
distribution, as the sample size becomes
larger, assuming that all the samples are
similar, and no matter what the shape of
the population distribution.

DATA VISUALISATION- UNIVARIATE, BIVARIATE AND


MULTIVARIATE

Data visualization is the graphical representation of information


and data. by using visual elements like charts, graphs, and maps. Data
visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data.

1. Univariate

This type of data consists of only one variable. The analysis of


univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal
with causes or relationships and the main purpose of the analysis is to
describe the data and find patterns that exist within it. The example of a
univariate data can be height.

MODULE :03 24
2. Bivariate data

This type of data involves two different variables. The analysis of


this type of data deals with causes and relationships and the analysis is
done to find out the relationship among the two variables. Example of
bivariate data can be temperature and ice cream sales in summer season.

3. Multivariate data –

MODULE :03 25
When the data involves three or more variables, it is
categorized under multivariate. Example of this type of data is
suppose an advertiser wants to compare the popularity of four
advertisements on a website, then their click rates could be
measured for both men and women and relationships between
variables can then be examined.

It is like bivariate but contains more than one dependent


variable. The ways to perform analysis on this data depends on
the goals to be achieved. Some of the techniques are regression
analysis, path analysis, factor analysis and multivariate
analysis of variance (MANOVA)

MODULE :03 26
Reference:

https://medium.com/analytics-vidhya/univariate-bivariate-and-

multivariate-analysis-8b4fc3d8202c

What is data visualization?

Data visualization is the process of translating large data sets and metrics into
charts, graphs, and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.  

A dashboard is an information visualization tool. It helps you monitor events


or activities briefly by providing insights on one or more pages or screens. Unlike an
infographic, which presents a static graphical representation, a dashboard conveys
real-time information by pulling complex data points directly from large data sets.
An interactive dashboard makes it easy to sort, filter, or drill into different types of
data as needed. Data science techniques can be used to identify what is happening,
why it's happening, and what will happen next at speed.

MODULE :03 27
As the amount of big data increases, more people are using data visualization tools
to access insights on their computer and on mobile devices. Dashboards are used by
businesspeople, data analysts, and data scientists to make data-driven business
decisions.

Reference:

https://www.tableau.com/learn/articles/data-visualization

******************************************************

MODULE :03 28

You might also like