Module 2 - BA
Module 2 - BA
Module: 2
DESCRIPTIVE ANALYTICS
MODULE :03 1
patterns and facts from that data, and utilizing those facts to make
inferences that influence decision-making.
Meaning of DDDM
MODULE :03 2
1. Data or computer can process information quicker
2. Data can help overcome biases
3. Data can help refine your gut feeling
“After reading a report about the future of the Internet that projected
annual web commerce growth at 2,300%, Bezos created a list of 20
products that could be marketed online. He narrowed the list to what he
felt were the five most promising products, which included: compact
discs, computer hardware, computer software, videos, and books. Bezos
finally decided that his new business would sell books online, because
of the large worldwide demand for literature, the low unit price for
books, and the huge number of titles available in print.”
You then need to analyze what data currently exists and what
gaps need to be addressed over time. The goal here is to have enough
MODULE :03 4
data to understand what is going. For example, if we are looking at
sales, we would want data around where deals come from, how long
they take to close, why do deals fail, common attributes to the best deals,
and so on. If you can’t answer important questions, these are gaps to be
solved.
The next step is to stress test your existing tools and reports. Are
you able to easily generate the reports that you need? Is there a better
tool for tracking data? We want velocity when analyzing data so any
bottlenecks should be removed.
MODULE :03 5
To effectively utilize data, professionals must achieve the following:
A well-rounded data analyst knows the business well and posses sharp
organizational acumen. Ask yourself what the problems are in your
given industry and competitive market. Identify and understand them
thoroughly. Establishing this foundational knowledge will equip you to
make better inferences with your data later on.
Before you begin collecting data, you should start by identifying the
business questions that you want to answer to achieve your
organizational goals. By determining the precise questions you need to
know to inform your strategy, you’ll be able to streamline the data
collection process and avoid wasting resources.
Put together the sources from which you’ll be extracting your data. You
might be coordinating information from different databases, web-driven
feedback forms, and even social media.
MODULE :03 6
Surprisingly, 80 percent of a data analyst’s time is devoted to
cleaning and organizing data, and only 20 percent is spent actually
performing analysis. This so-called “80/20 rule” illustrates the
importance of having clean, orderly information before you can attempt
to interpret what it might mean for your organization.
MODULE :03 7
MODULE :03 8
Here, you will also need to decide how to present the information to
answer the question at hand. There are three different ways to
demonstrate your findings:
5. Draw conclusions.
MODULE :03 9
The conclusions drawn from your analysis will ultimately help
your organization make more informed decisions and drive strategy
moving forward. It is important to remember, though, that these
findings can be virtually useless if they are not presented effectively.
Thus, data analysts must become skilled in the art of data storytelling to
communicate their findings with key stakeholders as effectively as
possible.
MODULE :03 10
Amazon is another poignant example. What started as an
online bookstore has blossomed into a massive online hub for just
about any product a person could want or need. What drove them
to make such enormous decisions? Data. It’s no surprise that such
major (and successful) rebranding moves were made based on
data collection and the inferences made as a result.
Data pre-processing
MODULE :03 11
Data pre-processing is a data mining technique that involves
transforming raw data into an understandable format.
Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviours or trends, and is likely to contain many
errors.
Data pre-processing is a proven method of resolving such issues.
encoded, to bring it to such a state that now the machine can easily parse
it.
MODULE :03 12
EXTRACT,TRANSFORM ,& LOAD
Extract
Transform
Load
R PROGRAMMING
MODULE :03 13
R is a programming language and free software developed by
Ross Ihaka and Robert Gentleman in 1993. R possesses an extensive
catalog of statistical and graphical methods. It includes machine learning
algorithms, linear regression, time series, statistical inference
1. MICE
2. Amelia
This package (Amelia II) is named after Amelia Earhart, the first
female aviator to fly solo across the Atlantic Ocean. History says, she got
mysteriously disappeared (missing) while flying over the pacific ocean
in 1937, hence this package was named to solve missing value problems.
3. missForest
MODULE :03 14
4. Hmisc
5. mi
MODULE :03 15
SPSS means “Statistical Package for the Social Sciences” and was
first launched in 1968. Since SPSS was acquired by IBM in 2009, it's
officially known as IBM SPSS Statistics, but most users still just refer to it
as “SPSS”.
Refer : https://www.spss-tutorials.com/basics/
MODULE :03 16
Mean
Median
the dataset in half. To find the median, order your data from
smallest to largest, and then find the data point that has an
equal number of values above it and below it. The method for
MODULE :03 17
Mode
The mode is the value that occurs the most frequently in your data
set. On a bar chart, the mode is the highest bar. If the data have multiple
values that are tied for occurring the most frequently, you have a
multimodal distribution. If no value repeats, the data do not have a
mode.
MODULE :03 18
MEASURE OF VARIATION-RANGE,IQR,VARIANCE AND
STANDARD DEVIATION
Range
Range is nothing but the difference between max and min values of
the data set. For the data sets we considered above the range is (15-(-
Let’s look at another data series with outlier which we have used in
central tendency
1 3 5 7 9 4 2 6 3 100
The range for this data series in 100-1 = 99. But as you can visually
see the series have most of the numbers between 1 to 9. Hence, we can
say that range is very much sensitive to the outliers either on the left or
right side.
the values lie. That’s why it’s preferred over many other measures of
MODULE :03 19
We will follow the below steps to compute IQR
3. Compute the Lower and upper quartile. The lower quartile (Q1) is
computed using mid value below Q2 and upper quartile (Q3) is
calculated using mid value of above Q2.
Variance
While the Range and IQR are using extreme values of the data set,
Standard Deviation
s = √Var
MODULE :03 20
The formula is easy: it is the square root of the Variance. So now
you ask, "What is the Variance?"
Variance
Reference :
https://www.mathsisfun.com/data/standard-deviation.html
MODULE :03 21
Skewness is a measure of asymmetry or distortion of symmetric
distribution. It measures the deviation of the given distribution of a
random variable from a symmetric distribution, such as normal
distribution. A normal distribution is without any skewness, as it is
symmetrical on both sides. Hence, a curve is regarded as skewed if it is
shifted towards the right or the left.
Types of Skewness
1. Positive Skewness
MODULE :03 22
If the given distribution is shifted to the left and with its tail on the right
side, it is a positively skewed distribution. It is also called the right-
skewed distribution.
2. Negative Skewness
If the given distribution is shifted to the right and with its tail on the left
side, it is a negatively skewed distribution. It is also called a left-skewed
distribution.
kurtosis
Formula : β2=μ4μ2
MODULE :03 23
is less outlier prone (or lighter-tailed) than a normal curve, it is called as
a platykurtic curve.
1. Univariate
MODULE :03 24
2. Bivariate data
3. Multivariate data –
MODULE :03 25
When the data involves three or more variables, it is
categorized under multivariate. Example of this type of data is
suppose an advertiser wants to compare the popularity of four
advertisements on a website, then their click rates could be
measured for both men and women and relationships between
variables can then be examined.
MODULE :03 26
Reference:
https://medium.com/analytics-vidhya/univariate-bivariate-and-
multivariate-analysis-8b4fc3d8202c
Data visualization is the process of translating large data sets and metrics into
charts, graphs, and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.
MODULE :03 27
As the amount of big data increases, more people are using data visualization tools
to access insights on their computer and on mobile devices. Dashboards are used by
businesspeople, data analysts, and data scientists to make data-driven business
decisions.
Reference:
https://www.tableau.com/learn/articles/data-visualization
******************************************************
MODULE :03 28