Business Analytics
Business Analytics
Business Analytics
What does a Business Analyst do?
Learning goals
What is BA?
Types of BA
Data for BA
Value creation
100 - Introduction
110 - Definition of Business Analytics
111 - Analytics
Analytics can be considered any data – driven process that provides insight.
Analytics
Reporting (business intelligence) & performance management tend to focus on what happened—
that is, they analyse and present historical information.
Advanced analytics aims to understand why things are happening and predict what will happen.
The distinguishing characteristic between these two is the use of higher-order statistical and
mathematical techniques including
Operations research
Parametric or nonparametric statistics
Multivariate analysis
Algorithmically based predictive models (such as decision trees, regressions, etc.)
Analytics provides insights (new knowledge) and answer(s) to a question (at a time)
BA provide relevant and valuable insights (real and measurable) given the business’ strategic
and tactical objectives.
More importantly, BA is about sustained deliveries of values to the organisations.
Consider a model that is 80 percent accurate but can be acted on creates far more value than
an extremely accurate model that cannot be deployed.
Definition of BA
Business analytics is the use of data-driven insight to generate value. It does so by requiring
business relevancy, the use of actionable insight, and performance measurement and value
measurement.
Business analytics is a process of transforming data into actions through analysis and
insights in the context of organizational decision making and problem solving.
Business analytics is the use of data, information technology, statistical analysis, quantitative
methods, and mathematical or computer-based models to help managers gain improved
insight about their business operations and make better, fact-based decisions.
data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers gain improved insight about their business operations and make better, fact-
based decisions.
BA:
Pricing: setting prices for consumer and industrial goods, government contracts, and
maintenance contracts
Customer segmentation: identifying and targeting key customer groups in retail, insurance, and
credit card industries
Merchandising: determining brands to buy, quantities, and allocations
Location: finding the best location for bank branches and ATMs, or where to service industrial
equipment
Supply Chain Design: determining the best sourcing and transportation options and finding the
best delivery routes
Staffing: ensuring appropriate staffing levels and capabilities, and hiring the right people
Health care: scheduling operating rooms to improve utilization, improving patient flow and waiting
times, purchasing supplies, and predicting health risk factors
120 - Types of BA
121 - 3 Types of BA
What items?
When to reduce the price? What form of price discounts?
To whom
Ex:
Descriptive analytics: examine historical data for similar successful and unsuccessful sales
promotion campaign
Predictive analytics: predict sales based on discounted prices
Prescriptive analytics: find the best sets of pricing and advertising to maximize sales revenue
Data: numerical or textual facts and figures that are collected through some type of measurement
process.
Information: result of analyzing data; that is, extracting meaning from data to support evaluation
and decision making.
Annual reports
Accounting audits
Financial profitability analysis
Economic trends
Marketing research
Operations management performance
Human resource measurements
Web behavior --> page views, visitor’s country, time of view, length of time, origin and destination
paths, products they searched for and viewed, products purchased, what reviews they read, and
many others.
Big data to refer to massive amounts of business data from a wide variety of sources, much of
which is available in real time, and much of which is uncertain or unpredictable. IBM calls these
characteristics volume, variety, velocity, and veracity.
“The effective use of big data has the potential to transform economies, delivering a new wave of
productivity growth and consumer surplus. Using big data will become a key basis of competition
for existing companies, and will create new competitors who are able to attract employees that
have the critical skills for a big data world.” - McKinsey Global Institute, 2011
Types of Metrics
Measurement Scales
Examples:
A tire pressure gage that consistently reads several pounds of pressure below the true value is not
reliable, although it is valid because it does measure tire pressure.
The number of calls to a customer service desk might be counted correctly each day (and thus is
a reliable measure) but not valid if it is used to assess customer dissatisfaction, as many calls may
be simple queries.
A survey question that asks a customer to rate the quality of the food in a restaurant may be
neither reliable (because different customers may have conflicting perceptions) nor valid (if the
intent is to measure customer satisfaction, as satisfaction generally includes other elements of
service besides food).
Value creation:
Companies that put data at the center of marketing and sales decisions improve marketing ROI by
12 – 20% (McKinsey, “Big Data, Analytics, and the Future of Marketing and Sales”, 2013)
Companies that successfully use data outperform peers by up to 20% (EY, “Ready for takeoff”,
2014).
Companies that adopt data-driven decision making have output and productivity that is 5-6%
higher than what would be expected given other investments (MIT, “Strength in Numbers: How
does DDD Affect a Firm’s Performance”, 2011).
The background
The problems /opportunities
Project objectives
Success criteria
Question
Assess situations
Go into details of
Resources available
Requirements and constraints (assumptions)
Risks and contingencies
Costs and benefits
Question
Business Understanding
Determine Business Objectives
Background: PLE is a private company producing traditional lawn mowers. They developed a
medium-size diesel power lawn tractor to provide for niche markets such as large estates (golf
courses, clubs, resorts, building...)
Business Objectives: improve business performance of the firm by improving sales and
stakeholders satisfaction
Business Success Criteria:
Improve sales in the Pacific Rim and Euro markets
Maintain market shares in North and South America markets
Increase customer and employee satisfaction rate
Assess Situation
Inventory of Resources
Requirements, Assumptions, and Constraints
Risk and Contingencies
Terminology
Costs and Benefits
Project plan
Business problems
Business goals
Resources & constraints
Data analysis goals
Initial assessments of tools and techniques
Question
Question
High level
Low level
Explore data
Look for patterns
Use uni- or bi-variate analysis to establish relationships (often using visualisation tools)
Test hypotheses
Identify anomalies
233 - Data Preparation
Select data
Clean data
Correct, remove or ignore noise
Special values and their meaning
Outliers and aggregation
Construct data
Create new attributes from available attributes
Integrate data from multiple sources
Format data: e.g. string values to numerical values
Question
234 - Modelling
Question
Modelling approaches
Hypothesis led
Do Directors (& their staff) know everything?
Partitioning a data set is splitting the data randomly into two, sometimes three smaller data sets:
Training, Validation and Test.
235 - Evaluation
Evaluate results
Review process:
activities missed or repeated
steps followed,
failures and misleading etc.
Determine next steps
Potential for deployment of each results
Potential for improvement
Alternative continuations
Refine process plan
Take ACTIONS!
Question
236 - Deployment
Plan deployment
Plan monitoring and maintenance
What could change in the environment?
How will accuracy be monitored
When should model(s) not be used anymore?
What if business objectives of the use of the model change?
Produce final report
Review project
Question
Could be simple or complex (e.g. when embedding in an operational system to predict in real
time and automate decisions).
Important to distinguish between a model in the modelling and deployment phases:
In modelling phases, often many different models and modelling options are built and
evaluated.
In the deployment phase, often the winning models are fixed (in a short-term)
On-going evaluation (monitoring) if models are to be used over time.
Some models have a longer life than others.
Development of models that adapt themselves to changing environments/circumstances.
Data: numbers or textual data that are collected through some type of measurement process
Information: result of analyzing data; that is, extracting meaning from data to support evaluation
and decision making
The nominal scale of measurement defines the identity property of data. The data can be placed
into categories. Examples: eye colour and country of birth.
This scale doesn’t have any form of numerical meaning. The data can’t be multiplied, divided,
added or subtracted from one another. It’s not possible to measure the difference between data
points.
Nominal data can be broken down again into three categories:
Nominal with order: data can be sub-categorised in order, e.g. “cold, warm, hot and very
hot.”
Nominal without order: data can be sub-categorised as nominal without order, such as male
and female.
Dichotomous: having only two categories or levels, such as “yes’ and ‘no’.
The interval scale contains properties of nominal and ordered data, but the difference between
data points can be quantified.
This type of data shows both the order of the variables and the exact differences between the
variables.
They can be added to or subtracted from each other, but not multiplied or divided. For example,
40 degrees is not 20 degrees multiplied by two.
The number zero is an existing variable. In the ordinal scale, zero means that the data does not
exist. In the interval scale, zero has meaning – for example, if you measure degrees, zero has a
temperature.
Data points on the interval scale have the same difference between them. The difference on the
scale between 10 and 20 degrees is the same between 20 and 30 degrees.
This scale is used to quantify the difference between variables, whereas the other two scales are
used to describe qualitative values only.
Ratio scales of measurement include properties from all four scales of measurement.
The data is nominal, can be classified in order, contains intervals and can be broken down into
exact value. Examples: weight, height and distance.
Data in the ratio scale can be added, subtracted, divided and multiplied.
Ratio scales also differ from interval scales in that the scale has a ‘true zero’.
◦ The number zero means that the data has no value point. An example of this is height or weight,
as someone cannot be zero centimetres tall or weigh zero kilos – or be negative centimetres or
negative kilos.
◦ Examples of the use of this scale are calculating shares or sales.
Of all types of data on the scales of measurement, data scientists can do the most with ratio data
points.
The purpose of sampling is to obtain sufficient information to draw a valid inference about a population.
Cross sectional data
data: collected from several entities at the same time
Time series data
data: collected over several time periods
--> Combine to make panel data
Primary data is collected “at source” and specifically for the research at hand.
Secondary data is that which has been previously collected for a purpose that is not specific to
the research at hand.
Examples: sales records, industry reports, and interview transcripts from past research are data
that would continue to exist whether or not the project at hand had come to fruition.
Annual reports
Accounting audits
Financial profitability analysis
Economic trends
Marketing research
Operations management performance
Human resource measurements
Web behaviours: page views, visitor’s country, time of view, length of time, origin and destination
paths, products they searched for and viewed, products purchased, what reviews they read, and
many others
334 - Big Data
Big data refers to massive amounts of business data (volume) from a wide variety of sources
(variety), much of which is available in real time (velocity), and much of which is uncertain or
unpredictable (veracity).
The effective use of big data has the potential to transform societies, economies, industries, and
organisations.
For businesses, using big data will become a key basis of competition for existing companies.
Data Reliability - data are consistent and accurate (or accurately collected/measured).
Data Validity - data correctly measures what it is supposed to measure.
Introduction to models
360 - Models in Business Analytics
Model is an abstraction or representation of a real system, idea, or object.
361 - Example
Example 1
The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.
Verbal description: The rate of sales starts small as early adopters begin to evaluate a new
product and then begins to grow at an increasing rate over time as positive customer feedback
spreads. Eventually, the market begins to become saturated, and the rate of sales begins to
decrease.
Example 2
The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.
Example 3
The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.
Mathematical model:
where S is sales, t is time, e is the base of natural logarithms, and a, b and c are constants
Often we use data to estimate this equation, i.e. to estimate the values for a, b, and c.
We abstract the essential features of the real world, leaving behind all the nonessential detail and
complexity.
We then construct our laboratory by combining our abstractions with specific assumptions and
building a model of the essential aspects of the real world.
This is the process of model formulation.
Data of
Structure refers to the logic and the mathematics that link the elements of our model together.
A simple example might be the equation P = R - C, in which profit is calculated as the difference
between revenue and cost.
Another example might be the relationship F = I + P - S, in which final inventory is calculated from
initial inventory, production, and shipments.
2. Analysis
Once built, we can use the model to test ideas & evaluate solutions.
This process applies logic to take us from our assumptions and abstractions to a set of derived
conclusions. It also relies on mathematics and reason to explore the implications of our
assumptions. This exploration process leads, hopefully, to insights about the problem confronting
us.
Sometimes, these insights involve an understanding of why one solution is beneficial, and another
is not; at other times, the insights involve understanding the sources of risk in a particular solution.
In another situation, the insights involve identifying the decisions that are most critical to a good
result, or identifying the inputs that have the strongest influence on a particular outcome.
3. Interpretation
To make the model insights useful, we must first translate them into the terms of the real world
and then communicate them to the actual decision makers involved.
Only then do model insights turn into useful managerial insights. And only then can we begin the
process of evaluating solutions in terms of their impact on the real world.
Descriptive model explain behaviour and allow users to evaluate potential decisions by asking
“what-if?” questions.
Example: An outsourcing decision model
Predictive model focus on what will happen in the future. Many predictive models are developed by
analysing historical data and assuming that the past is representative of the future.
A sales-promotion decision model in the grocery industry: managers typically need to know how
best to use pricing, coupons, and advertising strategies to influence sales.
Grocers often study the relationship of sales volume to these strategies by conducting controlled
experiments to identify the relationship between them and sales volumes. That is, they implement
different combinations of pricing, coupons, and advertising, observe the sales that result, and use
analytics to develop a predictive model of sales as a function of these decision strategies.
Prescriptive model help decision makers identify the best solution to a decision problem. “Best”
here refers to objectives in the optimisation problems at hand.
Optimization - finding values of decision variables that minimize (or maximize) something such as
cost (or profit)
Objective function - the equation that minimizes (or maximizes) the quantity of interest
Optimal solution - values of the decision variables at the minimum (or maximum) point
A firm wishes to determine the best pricing for one of its products in order to maximize revenue.
Analysts determined the following model:
Identify the price that maximizes total revenue.
where D is the demand, P is the unit price, a is a constant that estimates the demand when the
price is zero, and b is the slope of the demand function.
Assumes price elasticity is constant (constant ratio of % change in demand to % change in price):
where c is the demand when the price is 0 and d > 0 is the price elasticity.
Mean is often used for quantitative data unless outliers exist or data is skewed.
Median is often used in conjunction with the mean since it is not affected by outliers. Comparing
mean with median gives us an idea of skewness.
Mode is mainly used for qualitative data, rarely used for numerical data. There may be no mode,
multiple modes, or the mode may not be close to the centre of the data.
Excel’s Aggregate function
Skewness
Dispersion = Variation = Spread: refers to the degree of variation in the data; that is, the numerical
spread (or compactness) of the data.
1. Range
2. Interquartile Range
3. Percentiles
4. Standard deviation
5. Coefficient of variation
Range: the difference between the minimum and maximum value in the data – sensitive to outliers
Interquartile Range: the range of the middle 50% of the data – the difference between the third
quartile and first quartile in the data (Q3 minus Q1) – not sensitive to outliers
2. Percentiles
The position in the dataset where p% of observations are below it and (100-p)% are above it, when
ordered from smallest to largest
Useful for analysing specific points along the distribution
Most common percentiles are quartiles (i.e. 25th, 50th, 75th percentiles) or deciles (i.e. 10th,
20th,…, 90th percentiles)
More extreme percentiles are affected by outliers
=PERCENTILE.EXC( datarange , percentile )
Make sure you put the percentile in as a fraction (e.g. 20th percentile is 0.2)
3. Standard deviation
Difficult to interpret on its own, but assuming the data is approximately bell-shaped (normally
distributed):
68% of observations are situated within ± 1 standard deviation from the mean
95% of observations are situated within ± 2 standard deviation from the mean
99.7% of observations are situated within ± 3 standard deviation from the mean
= STDEV.S( datarange )
4. Coefficient of Variation
The coefficient of variation (CV) expresses the standard deviation of data relative to (divided by)
its mean
Useful for comparisons of variation across different sets of data (e.g. between returns on different
investments)
Sometimes we are interested in seeing where individual observations sit relative to the mean.
The Z-score tells us how many standard deviations away from the mean an observation sits
Use the =STANDARDIZE( x , mean , stdev ) function in Excel
a z-score of 1.0 (a positive value) means that the observation is one standard deviation
above the mean;
a z-score of -1.5 means that the observation is 1.5 standard deviations below the mean.
Useful for checking if individual observations are outliers.
Outliers
Whether we remove outliers is a contentious debate and this depends on the context
Consider income or wealth inequality issues: definitely, we do not remove (mild) outliers.
But if we assess if education affects income, then it is reasonable to remove outliers, definitely
remove extreme outliers
430 - Measures of Association
Real-world questions
A plot to gauge correlation by looking at how closed all the data points sit to the line of best fit.
Two variables have a strong statistical relationship with one another if they appear to move
together.
When two variables appear to be related, you might suspect a cause-and-effect relationship.
Sometimes, however, statistical relationships exist even though a change in one variable is not
caused by a change in the other.
432 - Covariance
Covariance is a measure of the linear association between two variables, X and Y. Like the variance,
different formulas are used for populations and samples.
Population covariance
Sample covariance
The covariance between X and Y is the average of the product of the deviations of each pair of
observations from their respective means.
433 - Correlation
Correlation is a measure of the linear relationship between two variables, X and Y, which does not
depend on the units of measurement.
Correlation is measured by the correlation coefficient, also known as the Pearson product moment
correlation coefficient.
When using the CORREL function, it does not matter if the data represent samples or populations. In
other words,
Excel computes the correlation coefficient between all pairs of variables in the Input Range. Input
Range data must be in contiguous columns.
Interpreting Correlation Coefficient
For example:
Caution
When two variables appear to be related, you might suspect a cause-and-effect relationship.
Sometimes, however, statistical relationships exist even though a change in one variable is not
caused by a change in the other.
Correlation does imply CAUSATION
Dependence:
Variables are dependent on each other if the value of one variable gives information about
the distribution of the other.
What are key statistics of a distribution? For example normal distribution?
Is that statistical correlation always meaningful, especially for prediction purposes? (i.e. predictive
analytics)
Remember that “correlation does not imply causation”
520 - Causality
Causality describes a relationship between two (or more) things (phenomena, events, variables,
etc.) in which a change in one causes a change in another.
In this diagram, A causes B under certain conditions.
So, if we observe an effect, necessarily we can infer there is a cause prior to the effect.
If there is cause, not necessarily the effect will come about.
But if a cause and all other certain conditions are complete, it is very likely that the cause will
produce its effect(s).
1. Situational assessment
Consider any business situation (i.e. business problem that needs to be solved)
We would like to assess that situation, then we often ask “how did that happen?”
Often used in Root Cause Analysis
2. Interventions
analyse the equipment’s data output under regular “healthy” operating conditions,
detect “anomalies” (i.e. any pattern of deviation from “healthy” conditions),
to predict the “behavioral” pattern of the anomaly,
if the predicted values exceed the “normal” threshold, an alert is sent.
Applications: early detection of safety issues, machine failures, more efficient electrical consumption,
predicting quality deviation, adjusting process to prevent material waste, etc.
We can convert circles into directed acyclic graphs in which we have a time dimension.
1. Chain
2. Fork
3. Collider
Causal associations
Non-causal associations
Suppose you are asked to make an assessment of the size of the market for laptop computers.
The following variables are relevant:
Do you expect that higher advertising expenditure will lead to higher sales (market volume)?
But how about the impact of advertising and number of customers on sales?
Also how about the effects of advertising and media hype on sales?
Now we put all elements together, this is our causal model for situational assessment.
Note that there is no (business) goal / objective in terms of optimisation or decision making.
Rather it assesses how causal factors affect the market value.
Example 2: instead of doing a situational assessment, you are now asked to decide how much to spend
on advertising for these products.
You need to set an objective, e.g. high market share (the proportion of sales through your retailers
to the total number sold).
So the decision variable is “Advertise”.
Simplify intervention decision: (1) run an advertising or (2) not doing that.
Further simplify that you will know the price at the time you set “Advertise”.
530 - Influence diagram
Often, rectangle shape refers to strategic option (i.e. decision point, choice variable, value
directly controlled by a strategic agent – decision making agent)
Hexagon shape refers to objective (e.g. profit, value, market share, etc.). Decision are made to
optimise the objective.
Circle shape refers to probabilistic variables that are chance variables, uncertain quantities,
environmental factors and other elements outside the direct control of strategic agents.
540 - From causal diagrams to mathematical equations
This equation fails to capture the actual relationship among independent variables (x1 --> x6)
541 - Shortcomings
Consider X1, X2 and X4: associations among these variables are clear, hence we call that this
model suffers from multicollinearity problem.
Also, we cannot use standard significance tests to reliably determine which independent
variables exert the most influence.
1. Stage 1
2. Stage 2: using estimated value of the independent variable obtained from stage 1 regression.
Summary
Causal relationships are crucial for (1) situational assessments and (2) interventions, as part
of business analytics.
If there is a cause-and-effect relationship between two variables x and y, there is statistical
association.
But (statistical) correlation/association does not necessarily imply causation.
Casual thinking and graphs are very useful because
They capture both causality and statistical association
They assist with both situational assessment and intervention tasks in business analytics
From managerial perspective, they allow identification of relevant stakeholders (agents,
people, departments, etc.) related in analytics projects as well as resources allocation.
Happiness/satisfaction matters every corner of our lives: overall life, work, school, business, etc.
Overall aims are to increase satisfaction.
Situational analysis informs interventions: how?
Discuss the following questions from your own experience and knowledge
What makes you happy = what are the causes of your own happiness?
What makes you sad = what are the causes of your own sadness?
Right click on data series and choose Add trendline frpop-up menu
Check the boxes Display Equatiom on on chart and Display R-squared value on chart
Residuals are the observed errors associated with estimating the value of the dependent variable
using the regression line.
We would like to test if the coefficient (log(GDP)) is statistically significant from zero.
If Coefficient (β1) = 0, what does this mean?
If Coefficient (β1) ≠ 0, what does this mean? (you should consider one tailed tests)
Test statistics:
P-value approach
Confidence intervals (Lower 95% and Upper 95% values in the output) provide information about
the unknown values of the true regression coefficients, accounting for sampling error.
For this example, a 95% confidence interval for the income variable is [0.638;0.845].
Prediction
if the true population parameters are at the extremes of the confidence intervals, the estimate might be
as low as
-1.411 + 0.638 * 9.392 = 4.581
or as high as
-1.411 + 0.845 * 9.392 = 6.525
Residual analysis
A linear regression model with more than one independent variable is called a multiple linear
regression model.
ANOVA tests for significance of the entire model. That is, it computes an F-statistic testing the
hypotheses
𝛼_2=0.175 p-value=0.074
If wealth (log of GDP per capita) increases by 1 unit, holding all the other independent variables
constant, the value of happiness will increase by 0.175, significant at level of 10%
𝛼_3=3.55 p-value=0.000
If social support increases by 1 unit, holding all the other independent variables constant, the
value of happiness will increase by 3.55 , significant at level of 1%
Some argue that a good regression model should include only significant independent variables.
But not always clear exactly what will happen when we add or remove variables from a model:
variables that are (or are not) significant in one model may (or may not) be significant in another.
Should not consider dropping all insignificant variables at one time,
Should take a more structured approach.
Adding an independent variable to a regression model often increase the value of R-square
Adjusted R-square reflects both the number of Xi variables and sample size.
Adjusted R-square may either increase or decrease when an Xi variable is added or dropped.
An increase in adjusted R-square indicates the model has improved.
But some prefer models what are simpler (i.e. having less Xi variables) when only minor
differences in the adjusted R-square scores.
A linear regression model with more than one independent variable is called a multiple linear
regression model.
Few examples:
A dummy variable indicates whether an observation belongs to a particular category in the data.
Consider X1 is dummy variable for gender, X1 either takes a value of zero (not female) or one
(female)
2. Four dummy variables for five categories
Department (nominal)
1 --> Admin
2 --> Production
3 --> Sales
4 --> R&D
5 --> Warehouse
3. General rules for dummy variable number
When a categorical variable has k (> 2) levels/categories, we need to add (k−1) additional dummy
variables to the model.
Curvilinear models may be appropriate when scatter charts or residual plots show nonlinear
relationships.
A second order polynomial might be used
Here ß1 represents the linear effect of X on Y and ß2 represents the curvilinear effect.
Case: Months employed and Sales Data: Reynolds Quadratic regression model
Let’s estimate the maximum sales!!
The knot/ the breakpoint: the value of the independent variable at which the relationship between the
independent variable and the dependent variable changes
X1 = -β1 / 2β2
Difference in mean sales between advertising expenditures of $50,000 and $100,000 depends on the
price of the product.
At higher selling prices, the effect of increased advertising expenditure diminishes.
For example:
Income depends on MBA and Age but you believe that the effect of Age on income differs
between two groups: with and without MBA degree
If MBA = 0, then
632 - Categorical by Categorical
For examples:
The coefficient of the interaction term of two categorical variables controls for the differences in the
union of the two groups. This coefficient acts as a constant that shifts the model if the observation is in
the union of the two categorical variables.
For example
Clearly, the partial effect of one of the continuous variables in the interaction term depends on the size
of the term it is interacted with.
Multicollinearity occurs when we have two or more independent variables that are highly
correlated with one another.
If those variables capture similar things, for example income and disposable income or body size
and weight, then we should consider to use only one (better) variable.
Multicollinearity affects statistical significance of t-tests but does not have much impacts on the
predictive power.
If you have two independent variables that are highly correlated with one another, think about:
whether they measure the same thing
They could be results of a chain or a fork
Avoiding using multiple variables that capture the same thing help us avoid
(1) multicollinearity problems (ie. For the use of situational assessment) and
(2) avoid “over-fitting the model” in predictive analytics.
Shortcomings
Consider X1, X2 and X4: associations among these variables are clear, hence we call that this
model suffers from multicollinearity problem.
Also, we cannot use standard significance tests to reliably determine which independent variables
exert the most influence.
Possible to use structural equation model (SEM) (stepwise regression) via a two-stage regression.
1. Stage 1
2. Stage 2: using estimated value of the independent variable obtained from stage 1 regression.
650 - Assessment of model assumptions