Data Visualization
Data Visualization
Sum Total of the values ]Any Total sales for a Useful in knowing the
in the entire data company. total value.
set
Mean Average of all Any Average sales per Useful in capturing the
values. month. central tendency of the
data set
Median Midpoint value in Finding the midpoint Total income forUseful in finding the point
the data set arrangedin the distribution of citizen of a country. where 50 percent of the
from high to low. data. data is above and below
Mode Most common value Knowing where values Fixed annual Useful in declaring a
in the dataset. are highly repeated in salaries where a common value in highly
the data set. limited number of repetitive data sets.
wage levels are
used..
Maximum/ Largest and smallestTo conceptualize the Largest and smallest Useful in providing a
values, respectively. spread of the data's sales in a day. scope or end points in the
Minimum
distribution. data.
Range Difference between A crude estimate of the Spread of Sales in Useful as simple estimate
the Max and the spread of the data’s unit during a given of dispersion
Min values distribution month
Standard Square root of the A precise estimate of A standard The smaller the value, the
Deviation average of the the spread of the data deviation in Naira less the variation and the
squared differences distribution from a from average sales more the predictability of
between the mean mean value in terms of the data set
value and each data the units used in the
value in the data computation
distribution
Variance Average of the A variance estimate of Best in comparing The smaller the value, the
squared difference the spread of the data one variance with less the variation and the
between a mean distribution from a another variance more the predictability of
value and each mean value NOT in the data set
observation in the terms of the units used
dataset in the computation
Coefficient Positive or Measure of symmetric As the population of The closer the coefficient
of Negative. If a + nature (the degree of a given country agesof skewedness is to zero,
Skewedness coefficient is asymmetry) of the data (having more old the more symmetry is the
resulted, it means around the mean people than young) data. A positively skewed
the data distribution the age distribution data has its largest
is positively skewed becomes negatively allocation to the left, and
and vice versa. The skewed vice versa
larger the
coefficient, the
greater the
skewedness
Fortunately, we do not need to compute these statistics to know how to use them. Computer
software provides these descriptive statistics where they are needed or requested. Illustration
will be given in the class using appropriate software and an example sales data as presented
below
Histograms:
Histograms are useful for showing the distribution of a single scale variable. Data are binned and
summarized using a count or percentage statistic. A variation of a histogram is a frequency
polygon, which is like a typical histogram except that the area graphic element is used instead of
the bar graphic element.
1. Syntax in STATA: histogram var_name,normal
Numerous other descriptive visualization techniques such gantt chart, line trend, dot plots, etc.,
etc., exist. We will look at more of these and demonstrate using various statistical software
applications such as spss, stata, spreadsheet, python, etc., as time permits.
PREDICTIVE MODELING
Predictive modeling means developing models that can be used to forecast or predict future
events. In business analytics, models can be developed based on logic or data.
Logic-Driven Models
A logic-driven model is one based on experience, knowledge, and logical relationships of
variables and constants connected to the desired business performance outcome situation. The
question here is how to put variables and constants together to create a model that can predict the
future. Doing this requires business experience. Model building requires an understanding of
business systems and the relationships of variables and constants that seek to generate a desirable
business performance outcome. To help conceptualize the relationships inherent in a business
system, diagramming methods can be helpful.
For example, the cause-and-effect diagram is a visual aid diagram that permits a user to
hypothesize relationships between potential causes of an outcome (see figure below). This
diagram lists potential causes in terms of human, technology, policy, and process resources in an
effort to establish some basic relationships that impact business performance. The diagram is
used by tracing contributing and relational factors from the desired business performance goal
back to possible causes, thus allowing the user to better picture sources of potential causes that
could affect the performance. This diagram is sometimes referred to as a fishbone diagram
because of its appearance.
Another useful diagram to conceptualize potential relationships with business performance
variables is called the influence diagram. According to Evans (2013, pp. 228–229), influence
diagrams can be useful to conceptualize the relationships of variables in the development of
models. An example of an influence diagram is presented in the next Figure. It maps the
relationship of variables and a constant to the desired business performance outcome of profit.
From such a diagram, it is easy to convert the information into a quantitative model with
constants and variables that define profit in this situation:
Profit = Revenue − Cost, or
Profit = (Unit Price × Quantity Sold) − [(Fixed Cost) + (Variable Cost × Quantity Sold)], or
P = (UP × QS) − [FC + (VC × QS)]
The relationships in this simple example are based on fundamental business knowledge.
Consider, however, how complex cost functions might become without some idea of how they
are mapped together. It is necessary to be knowledgeable about the business systems being
modeled in order to capture the relevant business behavior. Cause-and-effect diagrams and
influence diagrams provide tools to conceptualize relationships, variables, and constants, but it
often takes many other methodologies to explore and develop predictive models.
Data-Driven Models
Logic-driven modeling is often used as a first step to establish relationships through data-driven
models (using data collected from many sources to quantitatively establish model relationships).
Some of the popular techniques on the use and application of the data-driven models. Include:
Regression modelling:
Regression analysis is a common tool used in business, finance and other fields to study variable
dependency. This means that it can help a professional in these areas understand the relationship
between key variables. Learning about regression and its various methods can help you gain the
analytic skills necessary to succeed in a data-driven position.
Regression analysis is a mathematically measured correlation of variables used as a predictive
modeling method. You use regression modeling to predict numerical values depending on
various inputs. For example, you can understand the relationship between an independent and
dependent variable, allowing you to predict how the dependent variable changes along with its
independent counterpart. In this case, the dependent variable is what you’re measuring and the
independent variable is the factor that causes change.
In business, regression analysis can help:
(1) Forecast trends,
(2) Predict strengths and areas of weakness or
(3) Establish cause-and-effect relationships to make informed business decisions and strategic
plans.
You often calculate regression analysis through machine learning or artificial intelligence,
though there are also mathematical equations you can use. There are different analysis types that
you can use based on the nature of the variables you are predicting and what information you
would like to gather from your analysis. It could be a simple regression (having one independent
and a dependent variable), or multiple linear regression (involving more than one independent
variable).
In addition, depending on how the dependent variable is measured (whether nominal, ordinal,
count or continuous) different techniques of regression are suitable for each. For instance, for a
binary dependent variable (i.e. a dependent variable that is measured as 0 and 1), Binary logit
regression or Binary Probit regression is the appropriate regression technique. If the dependent
variable is however measured as categorical variable of more than 2 categories (1, 2, 3,..),
Nominal Logit or Nominal probit regression is the appropriate technique, if the categories are
just nominal – without any order of ranking or weights assigned. If however the categories are
weighted (ranked/ordered), then Ordered logit or Ordered Probit regression is the appropriate
regression technique to be used.
Correlation analysis
Correlation Analysis is statistical method that is used to discover if there is a relationship
(positive or negative) between two variables/datasets, and how strong that relationship may be.
Positive Correlation
Any score from 0.1 to +1 indicates a positive correlation, which means that they both increase at
the same time. The line of best fit, or the trend line, is places to best represent the data on the
graph. In this case, it is following the data points upwards to indicate the positive correlation.
Negative Correlation
Any score between -0.1 and -1 indicate a correlation, which means that as one variable increases,
the other decreases proportionally. The line of best fit can be seen here to indicate the negative
correlation. In these cases it will slope downwards from the point of origin.
No Correlation
Very simply, a score of 0 indicates that there is no correlation, or relationship, between the two
variables. The larger the sample size, the more accurate the result. No matter which formula is
used, this fact will stand true for all.
As a rule of thumb, a correlation coefficient of +or-0.7 to + or-1 indicate strong positive or
correlation
Correlation ≠ Causation
While a significant relationship may be identified by correlation analysis techniques, correlation
does not imply causation. The cause cannot be determined by the analysis, nor should this
conclusion be attempted. The significant relationship implies that there is more to understand and
that there are extraneous or underlying factors that should be explored further in order to search
for a cause. While it is possible that a causal relationship exists, it would be remiss of any
researcher to use the correlation results as proof of this existence.
Generally, correlation:
i. Assess variables relationships
ii. Generally useful in model development as it is used to sieve out predictor variables that
have weak association – of little value to forecasting model
Simulation:
The Simulation Analysis is a method, wherein the infinite calculations are made to obtain the
possible outcomes and probabilities for any choice of action.
In the context of business analytics, simulation analysis is a technique used to model business
processes, assess risks, and predict outcomes by creating and analyzing virtual scenarios. It
allows businesses to test different strategies, understand potential impacts, and make informed
decisions in a controlled, risk-free environment.
Key Aspects of Simulation Analysis in Business Analytics:
1. Creating Models of Business Processes:
Simulation analysis involves building models that replicate the key processes within a business,
such as sales, operations, or financial performance.
These models can incorporate variables like costs, demand, supply chain logistics, and other
operational factors.
2. Scenario Analysis and "What-If" Questions:
Businesses use simulation to explore various "what-if" scenarios by changing inputs or
conditions to see how they affect outcomes.
For example, a company might simulate the impact of a price change, a new marketing strategy,
or a disruption in the supply chain.
3. Monte Carlo Simulation:
A popular method in business analytics is Monte Carlo simulation, which uses random sampling
and statistical modeling to estimate the probability of different outcomes.
This approach is valuable for risk assessment, as it provides a range of possible results and their
likelihood, helping businesses understand the uncertainty and variability in their predictions.
4. Risk Analysis and Management:
Simulation analysis helps in identifying potential risks by showing how variations in key inputs
can impact the business.
By simulating various risk scenarios, businesses can develop strategies to mitigate potential
adverse effects.
5. Decision Support and Optimization:
Simulations support decision-making by providing insights into how different choices might
affect business performance.
For example, it can help optimize resource allocation, inventory levels, or production schedules
by simulating the outcomes of different strategies.
6. Sensitivity Analysis:
This involves examining how sensitive the results of a simulation are to changes in input
variables.
Sensitivity analysis helps identify the most critical factors that influence business outcomes,
guiding focus areas for improvement.
Applications in Business Analytics:
1. Financial Planning: Simulating different financial scenarios, such as changes in market
conditions, interest rates, or cash flow to forecast financial performance and guide
investment decisions.
2. Supply Chain Management: Modeling supply chain dynamics to optimize logistics,
inventory management, and reduce costs by predicting the impact of changes in demand
or supply disruptions.
3. Customer Behavior Analysis: Simulating customer interactions and purchase patterns to
predict the outcomes of marketing campaigns, pricing changes, or new product launches.
4. Operational Efficiency: Using simulations to model workflows and processes, identify
bottlenecks, and improve operational efficiency in manufacturing, service delivery, or
other business operations.
Example:
A retailer might use simulation analysis to predict the impact of a promotional discount on sales
volume. By modeling different discount levels, marketing spends, and consumer responses, the
retailer can identify the optimal discount that maximizes profit without excessively eroding
margins.
Put is simply in general, simulation Project future behaviour of variables by simulating the past
behaviour found in probability distributions
Many other techniques and algorithms for predictive modelling exist.
Data Mining
Data mining is a discovery-driven software application process that provides insights into
business data by finding hidden patterns and relationships in big or small data and inferring rules
from them to predict future behavior. These observed patterns and rules guide decision-making.
This is not just numbers, but text and social media information from the Web. For example,
Abrahams et al. (2013) developed a set of text-mining rules that automobile manufacturers could
use to distill or mine specific vehicle component issues that emerge on the Web but take months
to show up in complaints or other damaging media. These rules cut through the mountainous
data that exists on the Web and are reported to provide marketing and competitive intelligence to
manufacturers, distributors, service centers, and suppliers. Identifying a product’s defects and
quickly recalling or correcting the problem before customers experience a failure reduce
customer dissatisfaction when problems occur.
Data mining could be descriptive, predictive or prescriptive. It is descriptive if the purpose is just
to picture out a given pattern in the data. Example: sorting sales data by gender just to identify
which gender group patronize which product. That is descriptive data mining. If however, the
sales record is used to identify seasonal demand pattern, so as to predict when the company may
likely experience high demand, then the data mining here can be categorized as predictive. Using
the predicted pattern to optimize its inventory levels and minimize stockouts and overstocking,
qualified the mining as prescriptive. For this reason, some of the same tools used in the
descriptive analytics step may be used in the predictive step but are employed to establish a
model (either based on logical connections or quantitative formulas) that may be useful in
predicting the future.
Several methodologies for data mining exist. Defending on the type of information required,
appropriate technique may be adopted. See Sample below:
A grocery store used market basket analysis, and found that men were likely to buy beer and
diapers together. Sales increased sales by placing beer next to the diapers.
It sounds simple (and in many cases, it is). However, pitfalls to be aware of:
For large inventories (i.e. over 10,000), the combination of items may explode into the
billions, making the math almost impossible.
Data is often mined from large transaction histories. A large amount of data is usually
handled by specialized statistical software (see below).
Basic Terminology in Market Basket analysis
An itemset is the set of items a customer buys at the same time. It is typically stated as a logic
rule like IF {bread, peanut butter} THEN {jelly}. An itemset can consist of no items (a null
amount though, is usually ignored) to all items in the data set.
The support count is a count of how often the itemset appears in the transaction database.
The support is how often the item appears, stated as a probability. For example, if the support
count is 21 out of a possible 1,000 transactions, then the probability is 21/1,000 or 0.021.
The confidence is the conditional probability that the items will be purchased together.
Calculations
Calculations are rarely performed by hand, due to large number of combinations possible from
even relatively small datasets. Software that can perform market basket analysis include:
SAS® Enterprise Miner (Association Analysis).
SPSS Modeler (Association Analysis).
R (Data Mining Association Rules).
Cluster analysis is an unsupervised learning algorithm, meaning that you don’t know how many
clusters exist in the data before running the model. Unlike many other statistical methods, cluster
analysis is typically used when there is no assumption made about the likely relationships within
the data. It provides information about where associations and patterns in data exist, but not what
those might be or what they mean.
CASE STUDY 2.
Suppose a grocery store has collected a big data file on what customers put into their baskets at
the market (the collection of grocery items a customer purchases at one time). The grocery store
would like to know if there are any associated items in a typical market basket. (For example, if a
customer purchases product A, she will most often associate it or purchase it with product B.) If
the customer generally purchases product A and B together, the store might only need to
advertise product A to gain both product A’s and B’s sales. The value of knowing this
association of products can improve the performance of the store by reducing the need to spend
money on advertising both products. The benefit is real if the association holds true. Finding the
association and proving it to be valid requires some analysis. From the descriptive analytics
analysis, some possible associations may have been uncovered, such as product A’s and B’s
association. With any size data file, the normal procedure in data mining would be to divide the
file into two parts. One is referred to as a training data set, and the other as a validation data set.
The training data set develops the association rules, and the validation data set tests and proves
that the rules work. S