0% found this document useful (0 votes)
19 views72 pages

Business Analytics

The document provides an overview of Business Analytics (BA), defining it as the use of data-driven insights to generate value through actionable insights and performance measurement. It discusses the types of analytics, including descriptive, predictive, and prescriptive, and emphasizes the importance of data in decision-making processes. Additionally, it outlines the CRISP-DM framework for data mining, detailing the steps from business understanding to deployment, highlighting the significance of data quality and the creation of measurable value in business operations.

Uploaded by

vophuduc8758
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views72 pages

Business Analytics

The document provides an overview of Business Analytics (BA), defining it as the use of data-driven insights to generate value through actionable insights and performance measurement. It discusses the types of analytics, including descriptive, predictive, and prescriptive, and emphasizes the importance of data in decision-making processes. Additionally, it outlines the CRISP-DM framework for data mining, detailing the steps from business understanding to deployment, highlighting the significance of data quality and the creation of measurable value in business operations.

Uploaded by

vophuduc8758
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

up::

tags:: #data_analysis #business_analytics

Business Analytics
What does a Business Analyst do?

Learning goals

What is BA?
Types of BA
Data for BA
Value creation

100 - Introduction
110 - Definition of Business Analytics

111 - Analytics

Analytics can be considered any data – driven process that provides insight.

Data Science & Business Analytics


Analytics: data-driven process that provides business insight --> objectives of the firm
Data Science: strong in data and technical

Analytics

Reporting (business intelligence) & performance management tend to focus on what happened—
that is, they analyse and present historical information.
Advanced analytics aims to understand why things are happening and predict what will happen.
The distinguishing characteristic between these two is the use of higher-order statistical and
mathematical techniques including
Operations research
Parametric or nonparametric statistics
Multivariate analysis
Algorithmically based predictive models (such as decision trees, regressions, etc.)

112 - Business Analytics

Business analytics leverages all forms of analytics to achieve business outcomes.


Business analytics adds to analytics by requiring:
Business relevancy – BA makes distinction between relevant and irrelevant knowledge for
improved business operation.
Actionable insight – BA to identify insights that can guide actions.
Performance measurement – Performance or improvement in performance need to be
“measurable”
Value measurement – BA needs to create values.

Analytics vs Business Analytics

Analytics provides insights (new knowledge) and answer(s) to a question (at a time)
BA provide relevant and valuable insights (real and measurable) given the business’ strategic
and tactical objectives.
More importantly, BA is about sustained deliveries of values to the organisations.
Consider a model that is 80 percent accurate but can be acted on creates far more value than
an extremely accurate model that cannot be deployed.
Definition of BA

Business analytics is the use of data-driven insight to generate value. It does so by requiring
business relevancy, the use of actionable insight, and performance measurement and value
measurement.
Business analytics is a process of transforming data into actions through analysis and
insights in the context of organizational decision making and problem solving.
Business analytics is the use of data, information technology, statistical analysis, quantitative
methods, and mathematical or computer-based models to help managers gain improved
insight about their business operations and make better, fact-based decisions.

(Business) Analytics is the use of:

data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers gain improved insight about their business operations and make better, fact-
based decisions.

BA:

Transform data into actions (problem solving and decision making)


Use data, build models to improve insights

Discussion: Why you need data?

Step 1: Use data


Step 2: Define value
Step 3: Use data to generate value (Value creation)

113 - Examples of applications

Pricing: setting prices for consumer and industrial goods, government contracts, and
maintenance contracts
Customer segmentation: identifying and targeting key customer groups in retail, insurance, and
credit card industries
Merchandising: determining brands to buy, quantities, and allocations
Location: finding the best location for bank branches and ATMs, or where to service industrial
equipment
Supply Chain Design: determining the best sourcing and transportation options and finding the
best delivery routes
Staffing: ensuring appropriate staffing levels and capabilities, and hiring the right people
Health care: scheduling operating rooms to improve utilization, improving patient flow and waiting
times, purchasing supplies, and predicting health risk factors

Case study: Netflix

Netflix is a data company


Use data to improve business services to recommend the best suited content, tailored for each
user

120 - Types of BA
121 - 3 Types of BA

analytics understand the past and current performance to make decisions


Descriptive analytics:
Predictive analytics:
analytics predict the future by examining historical data
Prescriptive analytics:
analytics best alternatives to minimize or maximize objectives

Descriptive Predictive Prescriptive


Questions What happened? What will happen? What should I do?
What is happening? Why will it happen? Why should I do it?
Enables Business reporting; Data mining; Optimization;
Dashboards; Text mining; Simulation;
Scorecards; Web/media mining; Decision modelling;
Data warehousing Machine learning Network science
Outcomes Well-defined business Accurate projections of Accurate projections of
problems and opportunities future events and future events and outcomes
outcomes

Discussion: Retail markdown decisions

Most department stores clear seasonal inventory by reducing prices.


Business problems
Business questions
Value created by any possible BA

What items?
When to reduce the price? What form of price discounts?
To whom
Ex:

Descriptive analytics: examine historical data for similar successful and unsuccessful sales
promotion campaign
Predictive analytics: predict sales based on discounted prices
Prescriptive analytics: find the best sets of pricing and advertising to maximize sales revenue

122 - Visual Perspective of Business Analytics


Learned Topics

123 - Drivers of BA growth

1. A need to make better decisions (how about faster decisions?)


2. Cultural change towards evidence-based management
3. The availability of large volumes of data, or “big data”
4. Advances in computing power and machine learning algorithms.
124 - BA capability is very crucial

130 - Data for Business Analytics

131 - Data and Information

Data: numerical or textual facts and figures that are collected through some type of measurement
process.
Information: result of analyzing data; that is, extracting meaning from data to support evaluation
and decision making.

132 - Examples of Data Sources and Uses

Annual reports
Accounting audits
Financial profitability analysis
Economic trends
Marketing research
Operations management performance
Human resource measurements
Web behavior --> page views, visitor’s country, time of view, length of time, origin and destination
paths, products they searched for and viewed, products purchased, what reviews they read, and
many others.

133 - Data Sets and Databases

Data set - a collection of data.


Examples: Marketing survey responses, a table of historical stock prices, and a collection of
measurements of dimensions of a manufactured item.
Database - a collection of related files containing records on people, places, or things.
A database file is usually organized in a two-dimensional table, where the columns
correspond to each individual element of data (called fields, or attributes), and the rows
represent records of related data elements.

A Sales Transaction Database File


134 - Big Data

Big data to refer to massive amounts of business data from a wide variety of sources, much of
which is available in real time, and much of which is uncertain or unpredictable. IBM calls these
characteristics volume, variety, velocity, and veracity.
“The effective use of big data has the potential to transform economies, delivering a new wave of
productivity growth and consumer surplus. Using big data will become a key basis of competition
for existing companies, and will create new competitors who are able to attract employees that
have the critical skills for a big data world.” - McKinsey Global Institute, 2011

135 - Metrics and Data Classification

Metric - a unit of measurement that provides a way to objectively quantify performance.


Measurement - the act of obtaining data associated with a metric.
Measures - numerical values associated with a metric.

Types of Metrics

Discrete metrics - one that is derived from counting something.


For example, a delivery is either on time or not; an order is complete or incomplete; or an
invoice can have one, two, three, or any number of errors. Some discrete metrics would be
the proportion of on-time deliveries; the number of incomplete orders each day, and the
number of errors per invoice.
Continuous metrics are based on a continuous scale of measurement.
Any metrics involving dollars, length, time, volume, or weight, for example, are continuous.

Measurement Scales

Nominal data (Categorical) - sorted into categories according to specified characteristics.


Ordinal data - can be ordered or ranked according to some relationship to one another.
Interval data - ordinal but have constant differences between observations and have arbitrary
zero points.
Ratio data - continuous and have a natural zero.
136 - Data Reliability and Validity

Data Reliability - data are accurate and consistent.


Data Validity - data correctly measures what it is supposed to measure.

Examples:

A tire pressure gage that consistently reads several pounds of pressure below the true value is not
reliable, although it is valid because it does measure tire pressure.
The number of calls to a customer service desk might be counted correctly each day (and thus is
a reliable measure) but not valid if it is used to assess customer dissatisfaction, as many calls may
be simple queries.
A survey question that asks a customer to rate the quality of the food in a restaurant may be
neither reliable (because different customers may have conflicting perceptions) nor valid (if the
intent is to measure customer satisfaction, as satisfaction generally includes other elements of
service besides food).

140 - Value Creation


141 - Definition of Value Creation

Value: what does this mean to you?

Value creation:

New value creation: BA can create something entirely new


New product, new service, new market
More value creation: BA can make existing process more efficient
Lower costs --> (holding revenues constant) --> more profit
Higher revenue (through higher prices, more quantities of sales, etc.) --> (holding cost
increases less) --> more profit
Better value creation: BA can make existing process “better”
Better customer services through higher product quality, better quality control, faster
responses to customers’ needs, etc.

142 - The value of BA: without action there is no value

Companies that put data at the center of marketing and sales decisions improve marketing ROI by
12 – 20% (McKinsey, “Big Data, Analytics, and the Future of Marketing and Sales”, 2013)
Companies that successfully use data outperform peers by up to 20% (EY, “Ready for takeoff”,
2014).
Companies that adopt data-driven decision making have output and productivity that is 5-6%
higher than what would be expected given other investments (MIT, “Strength in Numbers: How
does DDD Affect a Firm’s Performance”, 2011).

143 - The Gartner Business Value Model

3 business aspects: aggregate measures, each has many prime metrics

Values: the perspective of firms

To increase the efficiency of converting inputs into value – added outputs


Cost, revenue, profits, productive efficiency, productivity, profitability
Innovations, new products and services, etc.
Market share and other competitive advantages including customer relationships
To deliver a return to shareholders
To jointly benefit other stakeholders in the supply chain and the economy

Values: the perspective of public sectors

To intervene to mediate and resolve various forms of market failure


Natural environment: pollution, natural resources, natural disasters, climate change, etc.
Health and safety and others
To increase social welfare
To provide for national security and economic stability

200 - Business in practice and The CRISP-DM framework


210 - Definition of CRISP-DM framework

CRISP-DM – CRoss Industry Standard Process for Data Mining.


Developed in the 1990s by a group of five companies: SPSS, TeraData, Daimler AG, NCR, and
OHRA.
An open standard process model that describes common approaches used by data mining experts
and is the most widely-used analytics model.

220 - Data Mining


Data mining is a complex process of examining large sets of data for identifying patterns and
then using them for valuable business insights.
Data mining can be considered as parts of descriptive and prescriptive analytics.
In descriptive analytics, data-mining tools identify patterns in data.
In predictive analytics, data-mining tools predict the future or make better decisions that
will impact future performance.

230 - A CRISP-DM framework


The arrows show the most important dependencies between stages.
This sequence is not fixed.
The cycle reflects the ongoing nature of data mining work
231 - Business Understanding

Determine business objectives

The background
The problems /opportunities
Project objectives
Success criteria

Question

What are business problems or opportunities?


Set clear objectives (could be more than one objectives)
Agree on success criteria for BA proposal (initiatives)
What types of analytics involved?

Assess situations

Go into details of

Resources available
Requirements and constraints (assumptions)
Risks and contingencies
Costs and benefits

Question

Identify and document what resources are available?


Data
Staff & experience
Budget
What resources are required and then identify constraints
Other risks involved?
Evaluate costs and benefits
Understand how action can be taken based on the likely outcomes (how to deploy?)
Not only in development stage but also in deployment stage

Case study: PLE

Business Understanding
Determine Business Objectives
Background: PLE is a private company producing traditional lawn mowers. They developed a
medium-size diesel power lawn tractor to provide for niche markets such as large estates (golf
courses, clubs, resorts, building...)
Business Objectives: improve business performance of the firm by improving sales and
stakeholders satisfaction
Business Success Criteria:
Improve sales in the Pacific Rim and Euro markets
Maintain market shares in North and South America markets
Increase customer and employee satisfaction rate

Assess Situation
Inventory of Resources
Requirements, Assumptions, and Constraints
Risk and Contingencies
Terminology
Costs and Benefits

Project plan

Business problems
Business goals
Resources & constraints
Data analysis goals
Initial assessments of tools and techniques

Question

Data analysis goals


Tools and techniques
Classification of customers
Linear regression, logistic regression, regression tree
Optimisation (what levels of incentives to obtain higher total revenue or total profit)

232 - Data Understanding

Collect initial data


Describe data
Data volume and properties
Accessibility and availability of attributes
Attribute types, range, correlation and identifies
Basis descriptive analysis
Explore data
Visualise and identify relationships
Verify data quality

Question

High level

Identify data sources and data fields


Review data strategy and documentation
What data are relevant and in what formats (database, text files, excel etc.)
Crucially, target data fields that maps to business/analytical objectives, e.g.

Low level

Explore data
Look for patterns
Use uni- or bi-variate analysis to establish relationships (often using visualisation tools)
Test hypotheses
Identify anomalies
233 - Data Preparation

Select data
Clean data
Correct, remove or ignore noise
Special values and their meaning
Outliers and aggregation
Construct data
Create new attributes from available attributes
Integrate data from multiple sources
Format data: e.g. string values to numerical values

Question

Data Understanding helps design this step.


This could take more time than expected, especially newer projects
Create new data variables (e.g. revenue / cost = profitability)

234 - Modelling

Select modelling techniques: e.g. regression


Generate test design: e.g. splitting data into training, test and validation sets
Build models
Assess models

Question

Apply a variety of modelling techniques


Choice of techniques depends on
Analytical and business objectives
Data quantities and data types
Modelling approaches:
Hypothesis led: theories, knowledge or experience tell us what variables (data fields) to
use
Data led: add more fields at the beginning and incrementally reduce and or let the
algorithms do that.
Many cases best to combine these two approaches.
Assess models against agreed objectives and success criteria.

Modelling approaches

Hypothesis led
Do Directors (& their staff) know everything?

Partitioning a data set is splitting the data randomly into two, sometimes three smaller data sets:
Training, Validation and Test.

Training: The subset of data used to create (build) a model


Validation: the subset of data that remains unseen when building the model and is used to tune
the model parameter estimates.
Test (hold-out): A subset of data used to measure overall model performance and compare the
performance among different candidate models.

235 - Evaluation

Evaluate results
Review process:
activities missed or repeated
steps followed,
failures and misleading etc.
Determine next steps
Potential for deployment of each results
Potential for improvement
Alternative continuations
Refine process plan
Take ACTIONS!

Question

Essential that the models are tested against unseen data.


Evaluate against the success criteria agreed in the Business Understanding phase.
Evaluate how well the model perform against a given value criteria, e.g. revenue.

236 - Deployment

Plan deployment
Plan monitoring and maintenance
What could change in the environment?
How will accuracy be monitored
When should model(s) not be used anymore?
What if business objectives of the use of the model change?
Produce final report
Review project

Question

Could be simple or complex (e.g. when embedding in an operational system to predict in real
time and automate decisions).
Important to distinguish between a model in the modelling and deployment phases:
In modelling phases, often many different models and modelling options are built and
evaluated.
In the deployment phase, often the winning models are fixed (in a short-term)
On-going evaluation (monitoring) if models are to be used over time.
Some models have a longer life than others.
Development of models that adapt themselves to changing environments/circumstances.

240 - CRISP-DM process reprised


The process is iterative.
Meaning: doing again and again to improve both the process and outcomes.
The order of stages in the process is not fixed.

IBM’s new ASUM-DM framework that extends CRISP-DM


Steps if find 1 variable is insignificant

1. Check measurement of variables is correct or not


2. If check không bị sai → insignificant

Systematic Model Building Approach

1. Consider causal graphs


2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of coefficients
and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-square
again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)
300 - Overview of using data

Two main goals

Data types, sources, privacy


Introduction to models

Data types, sources, privacy


310 - Overview of using data
311 - Data for Business Analytics

Data: numbers or textual data that are collected through some type of measurement process
Information: result of analyzing data; that is, extracting meaning from data to support evaluation
and decision making

312 - Metrics and Data Classification

Metric - a unit of measurement that provides a way to objectively quantify performance.


Measurement - the act of obtaining data associated with a metric.
Measures - numerical values associated with a metric.

313 - Types of Metrics

Discrete metrics - one that is derived from counting something.


A delivery is either on time or not
An order is complete or incomplete
An invoice can have one, two, three, or any number of errors
The number of incomplete orders each day (or the number of errors per invoice)
Continuous metrics are based on a continuous scale of measurement.
Any metrics involving dollars, length, time, volume, or weight, for example, are continuous.
320 - Properties and Scales of measurement
Scales of measurement is how variables are defined and categorized.
Four common scales of measurement: nominal, ordinal, interval and ratio.
Properties:
Properties Each scale of measurement has properties that determine how to properly analyse
the data.
Four properties are of our interest: identity, magnitude, equal intervals and a minimum value
of zero.

321 - Four properties of data

Identity: Identity refers to each value having a unique meaning.


Magnitude: Magnitude means that the values have an ordered relationship to one another, so
there is a specific order to the variables.
Equal intervals: Equal intervals mean that data points along the scale are equal, so the difference
between data points one and two will be the same as the difference between data points five and
six.
A minimum value of zero: A minimum value of zero means the scale has a true zero point.
Degrees, for example, can fall below zero and still have meaning. But if you weigh nothing, you
don’t exist.

322 - Scales of measurement


Nominal scale of measurement

The nominal scale of measurement defines the identity property of data. The data can be placed
into categories. Examples: eye colour and country of birth.
This scale doesn’t have any form of numerical meaning. The data can’t be multiplied, divided,
added or subtracted from one another. It’s not possible to measure the difference between data
points.
Nominal data can be broken down again into three categories:
Nominal with order: data can be sub-categorised in order, e.g. “cold, warm, hot and very
hot.”
Nominal without order: data can be sub-categorised as nominal without order, such as male
and female.
Dichotomous: having only two categories or levels, such as “yes’ and ‘no’.

Ordinal scale of measurement

The ordinal scale defines data that is placed in a specific order.


While each value is ranked, there’s no information that specifies what differentiates the categories
from each other.
These values can’t be added to or subtracted from.
Examples:
satisfaction data points in a survey, where ‘one = happy, two = neutral, and three = unhappy.’
Where someone finished in a race: first place, second place or third place
Data show what orders the runners finished in but each value doesn’t specify how far the
first-place finisher was in front of the second-place finisher.

Interval scale of measurement

The interval scale contains properties of nominal and ordered data, but the difference between
data points can be quantified.
This type of data shows both the order of the variables and the exact differences between the
variables.
They can be added to or subtracted from each other, but not multiplied or divided. For example,
40 degrees is not 20 degrees multiplied by two.
The number zero is an existing variable. In the ordinal scale, zero means that the data does not
exist. In the interval scale, zero has meaning – for example, if you measure degrees, zero has a
temperature.
Data points on the interval scale have the same difference between them. The difference on the
scale between 10 and 20 degrees is the same between 20 and 30 degrees.
This scale is used to quantify the difference between variables, whereas the other two scales are
used to describe qualitative values only.

Ratio scale of measurement

Ratio scales of measurement include properties from all four scales of measurement.
The data is nominal, can be classified in order, contains intervals and can be broken down into
exact value. Examples: weight, height and distance.
Data in the ratio scale can be added, subtracted, divided and multiplied.
Ratio scales also differ from interval scales in that the scale has a ‘true zero’.
◦ The number zero means that the data has no value point. An example of this is height or weight,
as someone cannot be zero centimetres tall or weigh zero kilos – or be negative centimetres or
negative kilos.
◦ Examples of the use of this scale are calculating shares or sales.
Of all types of data on the scales of measurement, data scientists can do the most with ratio data
points.

323 - Populations and Samples

Population - all items of interest for a particular decision or investigation


all married drivers over 25 years old
all subscribers to Netflix
Sample - a subset of the population
a list of individuals who rented a comedy from Netflix in the past year

The purpose of sampling is to obtain sufficient information to draw a valid inference about a population.
Cross sectional data
data: collected from several entities at the same time
Time series data
data: collected over several time periods
--> Combine to make panel data

330 - Data collection


Data collection for research and analytics can broadly be divided into two major types: primary data
and secondary data.

331 - Primary data

Primary data is collected “at source” and specifically for the research at hand.

The data source could be individuals, groups, organizations, etc.


Data would be actively elicited or passively observed and collected.
Surveys, interviews, and focus groups all fall under the ambit of primary data.
The main advantage of primary data is that it is tailored specifically to the questions posed by the
research project.
The disadvantages are cost and time.

332 - Secondary data

Secondary data is that which has been previously collected for a purpose that is not specific to
the research at hand.
Examples: sales records, industry reports, and interview transcripts from past research are data
that would continue to exist whether or not the project at hand had come to fruition.

333 - Examples of Data Sources

Annual reports
Accounting audits
Financial profitability analysis
Economic trends
Marketing research
Operations management performance
Human resource measurements
Web behaviours: page views, visitor’s country, time of view, length of time, origin and destination
paths, products they searched for and viewed, products purchased, what reviews they read, and
many others
334 - Big Data

Big data refers to massive amounts of business data (volume) from a wide variety of sources
(variety), much of which is available in real time (velocity), and much of which is uncertain or
unpredictable (veracity).
The effective use of big data has the potential to transform societies, economies, industries, and
organisations.
For businesses, using big data will become a key basis of competition for existing companies.

Seven V's characteristics of Big Data analytics


340 - Reliability and Validity
These two concepts can be applied to data and research design & testing. With respect to data (in the
field of statistics)

Data Reliability - data are consistent and accurate (or accurately collected/measured).
Data Validity - data correctly measures what it is supposed to measure.

350 - Data Privacy: how can or how should?


Legal Standards: established by law, order, or rule to compel treatment of certain classes of data
Ethical Standards: established by industry bodies or professional organizations which seek to
establish non-legally binding treatment of information
Most academic/ Science/ Medical/ Legal fields have broad ethical standards-making bodies,
some of which address use of data
Marketing/Advertising associations and alliances or initiatives that also provide some broad
guidelines
Policy Standards: established by a company or agency’s own published Data Privacy policy
Companies should have formal privacy policies that are actively disclosed to consumers.
Generally, these policies outline what is captured & shared, and outline opt-out or opt-in
procedures
Good Judgment Standards: one should always stop to ask “Is this a good idea?” and “What might
be the consequences?”

351 - Personally Identifiable Information (PII)

Definition: A information about an individual maintained by an agency, including


any information that can be used to distinguish or trace an individual‘s identity, such as name,
social security number, date and place of birth, mother‘s maiden name, or biometric records;
and
any other information that is linked or linkable to an individual, such as medical, educational,
financial, and employment information

352 - PII-Related Regulations

Country/Region PII Law or Regulation


United States HIPAA, FCPA, COPPA, GLBA, Privacy Act, state laws (e.g., CA, UT, CO, VA)
Country/Region PII Law or Regulation
Europe General Data Protection Regulation (GDPR)
Australia Privacy Act of 1988
India Digital Personal Data Protection Bill
Brazil General Data Protection Law (LGDP)
Canada Personal Information Protection and Electronic Documents Act (PIPEDA)
China Personal Information Protection Law (PIPL)

353 - Other sources of consumer information

Consumer financial information:


A consumer provides to a financial institution to obtain a financial product or service from the
institution
Results from a transaction between the consumer and the institution involving a financial
product or service
A financial institution otherwise obtains about a consumer in connection with providing a
financial product or service
Data collected by telecommunications companies about a consumer's telephone calls.
It includes the time, date, duration and destination number of each call, the type of network a
consumer subscribes to, and any other information that appears on the consumer's telephone
bill.
Protected health information

Introduction to models
360 - Models in Business Analytics
Model is an abstraction or representation of a real system, idea, or object.

Captures the most important features

Three forms of a model:


a written or verbal description,
a visual representation,
a mathematical formula, or a spreadsheet

361 - Example

Example 1

The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.

Verbal description: The rate of sales starts small as early adopters begin to evaluate a new
product and then begins to grow at an increasing rate over time as positive customer feedback
spreads. Eventually, the market begins to become saturated, and the rate of sales begins to
decrease.
Example 2

The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.

Visual model: A sketch of sales as an S-shaped curve over time

Example 3

The sales of a new product, such as a first-generation iPad or 3D television, often follow a common
pattern.

Mathematical model:

where S is sales, t is time, e is the base of natural logarithms, and a, b and c are constants
Often we use data to estimate this equation, i.e. to estimate the values for a, b, and c.

362 - Real World vs Model World

A model is an abstraction, or simplification, of the real world.


Model is a laboratory—an artificial environment—in which we can experiment and test ideas
without the costs and risks of experimenting with real systems and organizations.
1. Formulation

We abstract the essential features of the real world, leaving behind all the nonessential detail and
complexity.
We then construct our laboratory by combining our abstractions with specific assumptions and
building a model of the essential aspects of the real world.
This is the process of model formulation.

370 - Decision models


371 - Inputs

Data of

Uncontrollable inputs - quantities that can change but cannot be controlled


Decision variables (options) - controllable and selected at the discretion of the decision maker

372 - Features of a model

Data for all variables used in the model:


Uncontrollable inputs - quantities that can change but cannot be controlled
Decision options: decision variables which refer to possible choices, or courses of action, that
we might take.
Outcomes refers to consequences of the decisions- the performance measures we use to
evaluate the results of taking action. Examples include revenue, cost, profit, or efficiency, or
market share, etc.
Structure.

373 - Model structure

Structure refers to the logic and the mathematics that link the elements of our model together.
A simple example might be the equation P = R - C, in which profit is calculated as the difference
between revenue and cost.
Another example might be the relationship F = I + P - S, in which final inventory is calculated from
initial inventory, production, and shipments.

2. Analysis

Once built, we can use the model to test ideas & evaluate solutions.
This process applies logic to take us from our assumptions and abstractions to a set of derived
conclusions. It also relies on mathematics and reason to explore the implications of our
assumptions. This exploration process leads, hopefully, to insights about the problem confronting
us.
Sometimes, these insights involve an understanding of why one solution is beneficial, and another
is not; at other times, the insights involve understanding the sources of risk in a particular solution.
In another situation, the insights involve identifying the decisions that are most critical to a good
result, or identifying the inputs that have the strongest influence on a particular outcome.

3. Interpretation

To make the model insights useful, we must first translate them into the terms of the real world
and then communicate them to the actual decision makers involved.
Only then do model insights turn into useful managerial insights. And only then can we begin the
process of evaluating solutions in terms of their impact on the real world.

380 - Descriptive, Predictive & Prescriptive models


381 - Descriptive model]]

Descriptive model explain behaviour and allow users to evaluate potential decisions by asking
“what-if?” questions.
Example: An outsourcing decision model

An outsourcing decision model

382 - Predictive Models

Predictive model focus on what will happen in the future. Many predictive models are developed by
analysing historical data and assuming that the past is representative of the future.

A sales-promotion decision model in the grocery industry: managers typically need to know how
best to use pricing, coupons, and advertising strategies to influence sales.
Grocers often study the relationship of sales volume to these strategies by conducting controlled
experiments to identify the relationship between them and sales volumes. That is, they implement
different combinations of pricing, coupons, and advertising, observe the sales that result, and use
analytics to develop a predictive model of sales as a function of these decision strategies.

A Sales-Promotion Decision Model

383 - Prescriptive Models

Prescriptive model help decision makers identify the best solution to a decision problem. “Best”
here refers to objectives in the optimisation problems at hand.
Optimization - finding values of decision variables that minimize (or maximize) something such as
cost (or profit)
Objective function - the equation that minimizes (or maximizes) the quantity of interest
Optimal solution - values of the decision variables at the minimum (or maximum) point

A Prescriptive Pricing Model

A firm wishes to determine the best pricing for one of its products in order to maximize revenue.
Analysts determined the following model:
Identify the price that maximizes total revenue.

384 - Model assumptions

Assumptions are made to


simplify a model and make it more tractable (i.e. able to be analysed or solved).
better characterize historical data or past observations.
The task of the modeler is to select or build an appropriate model that best represents the
behaviour of the real situation.
Economic theory tells us that demand for a product is negatively related to its price. Thus, as
prices increase, demand falls, and vice versa (modelled by price elasticity - the ratio of the
percentage change in demand to the percentage change in price).
A key assumption in developing a model is the type of relationship between demand and price.
CRUCIAL to check if assumptions are reasonable and hold in the real world. For example, if P goes
up, D goes down for any goods or services?

390 - Demand Prediction Model


391 - A Linear Demand Prediction Model

As price increases, demand falls. A simple model is:

where D is the demand, P is the unit price, a is a constant that estimates the demand when the
price is zero, and b is the slope of the demand function.

392 - A Non-Linear Demand Prediction Model

Assumes price elasticity is constant (constant ratio of % change in demand to % change in price):
where c is the demand when the price is 0 and d > 0 is the price elasticity.

400 - Descriptive statistics


Used to describe and summarise a variable or variables for a sample of data.

For categorical or grouped data: The proportion.


Measures of Central Tendency: mean, median, and mode
Measures of Dispersion: range, interquartile range, standard deviation, coefficient of variation,
percentiles, z-scores
Measures of Shape: skewness, kurtosis
Measures of Association: covariance, correlation
410 - Measures of Central Tendency

411 - The Proportion

The proportion, p, is the percentage of observations that have a certain characteristic


Very useful for categorical or grouped data
Take the number of observations with a characteristic (X) and divide it by the total number of
observations (N)
412 - Measures of central tendency

Three different measures of the “typical” or “representative” value in a dataset

Mean vs Median vs Mode

Mean is often used for quantitative data unless outliers exist or data is skewed.
Median is often used in conjunction with the mean since it is not affected by outliers. Comparing
mean with median gives us an idea of skewness.
Mode is mainly used for qualitative data, rarely used for numerical data. There may be no mode,
multiple modes, or the mode may not be close to the centre of the data.
Excel’s Aggregate function

Syntax AGGREGATE( function_num , options , ref1,[ref2], …) --> ignore error value


Array form: AGGREGATE( function_num , options , array , [k])
function_num : A number from 1 to 19 that specifies which function to use. For example:
1: AVERAGE
2: COUNT
3: COUNTA
4: MAX
5: MIN
6: PRODUCT
7: STDEV.S (standard deviation for a sample)
8: STDEV.P (standard deviation for a population)
9: SUM
10: VAR.S (variance for a sample)
11: VAR.P (variance for a population)
12: MEDIAN
13: MODE.SNGL (mode for a single value)
14: LARGE (k-th largest value)
15: SMALL (k-th smallest value)
16: PERCENTILE.INC (percentile, inclusive)
17: QUARTILE.INC (quartile, inclusive)
18: PERCENTILE.EXC (percentile, exclusive)
19: QUARTILE.EXC (quartile, exclusive)
options : A numerical value that determines which values to ignore in the evaluation range for the
function
0 or omitted: Ignore nested SUBTOTAL and AGGREGATE functions.
1: Ignore hidden rows, nested SUBTOTAL, and AGGREGATE functions.
2: Ignore error values, nested SUBTOTAL, and AGGREGATE functions.
3: Ignore hidden rows, error values, nested SUBTOTAL, and AGGREGATE functions.
4: Ignore nothing.
5: Ignore hidden rows.
6: Ignore error values.
7: Ignore hidden rows and error values.

420 - Measures of dispersion and shape


421 - Measures of shape

Skewness

Measures symmetry relative to a bell-shaped (normal) distribution.


Normal distribution: bell shape; median = mode = mean; no skewness
If the mean is different to the median, this implies skewness. As a general rule, a value for
skewness:
< -1 or > 1 is highly skewed
Between -1 and -0.5 or between 0.5 and 1 is moderately skewed
Between -0.5 and 0 or between 0 and 0.5 is approximately symmetric

Income inequality in Australia/Vietnam has been increasing recently

What would you show in your analysis?

Think of a specific context:


(global vs) national vs state level
by socio-economic demographic factors: gender, ethnicity, skills, education & qualification,
efforts etc.
Think of a specific data set
entire population vs income groups
Think of specific measures (metrics/ indicators/ variables)
mean – median – mode – skewness etc.

422 - Measures of dispersion (variation)

Dispersion = Variation = Spread: refers to the degree of variation in the data; that is, the numerical
spread (or compactness) of the data.

Five key measures

1. Range
2. Interquartile Range
3. Percentiles
4. Standard deviation
5. Coefficient of variation

1. Range and Interquartile Range

Range: the difference between the minimum and maximum value in the data – sensitive to outliers
Interquartile Range: the range of the middle 50% of the data – the difference between the third
quartile and first quartile in the data (Q3 minus Q1) – not sensitive to outliers

2. Percentiles

The position in the dataset where p% of observations are below it and (100-p)% are above it, when
ordered from smallest to largest
Useful for analysing specific points along the distribution
Most common percentiles are quartiles (i.e. 25th, 50th, 75th percentiles) or deciles (i.e. 10th,
20th,…, 90th percentiles)
More extreme percentiles are affected by outliers
=PERCENTILE.EXC( datarange , percentile )
Make sure you put the percentile in as a fraction (e.g. 20th percentile is 0.2)
3. Standard deviation

Difficult to interpret on its own, but assuming the data is approximately bell-shaped (normally
distributed):

68% of observations are situated within ± 1 standard deviation from the mean
95% of observations are situated within ± 2 standard deviation from the mean
99.7% of observations are situated within ± 3 standard deviation from the mean
= STDEV.S( datarange )

Real world business uses of SD

Banking and finance:


Standard deviation is often used as a measure of a relative riskiness of an asset.
A volatile stock has a high standard deviation, while the deviation of a stable stock is usually
rather low.
Actuaries calculate standard deviation of healthcare usage to know how much variation in usage
to expect in a given period (month, quarter, or year)
Real estate agents calculate the standard deviation of house prices in a particular area to inform
their clients of the type of variation in house prices they can expect.
Human Resource managers often calculate the standard deviation of salaries in a certain field to
know what type of variation in salaries to offer to new employees.

4. Coefficient of Variation

The coefficient of variation (CV) expresses the standard deviation of data relative to (divided by)
its mean
Useful for comparisons of variation across different sets of data (e.g. between returns on different
investments)

423 - Standardized Values (Z-scores)

Sometimes we are interested in seeing where individual observations sit relative to the mean.
The Z-score tells us how many standard deviations away from the mean an observation sits
Use the =STANDARDIZE( x , mean , stdev ) function in Excel
a z-score of 1.0 (a positive value) means that the observation is one standard deviation
above the mean;
a z-score of -1.5 means that the observation is 1.5 standard deviations below the mean.
Useful for checking if individual observations are outliers.

Outliers

Skewness indicate the presence of outliers.


No standard definition of what constitutes an outlier
Several good rules of thumb are:
Z-scores greater than +3 or less than −3
Extreme outliers: more than 3 x IQR to the left of Q1 or right of Q3
Mild outliers: between 1.5 x IQR and 3 x IQR to the left Q1 or right of Q3
Visual – an individual data point sit relative to the rest of the data

When to remove Outliers?

Whether we remove outliers is a contentious debate and this depends on the context

Consider income or wealth inequality issues: definitely, we do not remove (mild) outliers.
But if we assess if education affects income, then it is reasonable to remove outliers, definitely
remove extreme outliers
430 - Measures of Association

Real-world questions

Is that true that…

bottled water sales increase as temperature increases?


older houses are worth less?
those that earn more consume more?

We can gain insights by looking measures of association: covariance and correlation

Covariance measures the direction of a relationship between two quantitative variables.


Correlation measures both the direction and strength of the relationship between two quantitative
variables.

A plot to gauge correlation by looking at how closed all the data points sit to the line of best fit.

431 - Linear or Non-Linear Relationship

Two variables have a strong statistical relationship with one another if they appear to move
together.
When two variables appear to be related, you might suspect a cause-and-effect relationship.
Sometimes, however, statistical relationships exist even though a change in one variable is not
caused by a change in the other.

432 - Covariance

Covariance is a measure of the linear association between two variables, X and Y. Like the variance,
different formulas are used for populations and samples.

Population covariance

Excel function: =COVARIANCE.P( array1 , array2 )

Sample covariance

Excel function: =COVARIANCE.S( array1 , array2 )

The covariance between X and Y is the average of the product of the deviations of each pair of
observations from their respective means.

433 - Correlation

Correlation is a measure of the linear relationship between two variables, X and Y, which does not
depend on the units of measurement.

Correlation is measured by the correlation coefficient, also known as the Pearson product moment
correlation coefficient.

Correlation coefficient for a population

Correlation coefficient for a sample

The correlation coefficient is scaled between -1 and 1.

Excel function: =CORREL( array1 , array2 )


Notes on the CORREL Function

When using the CORREL function, it does not matter if the data represent samples or populations. In
other words,

CORREL(array1,array2) = COVARIANCE.P(array1,array2) / STDEV.P(array1) *


STDEV.P(array2)
and
CORREL(array1,array2) = COVARIANCE.S(array1,array2) / STDEV.S(array1) *
STDEV.S(array2)

Excel Correlation Tool

Data > Data Analysis > Correlation

Excel computes the correlation coefficient between all pairs of variables in the Input Range. Input
Range data must be in contiguous columns.
Interpreting Correlation Coefficient

Direction of the relationship: positive or negative


Strength of the relationship: no, weak, moderate, strong, very strong, perfect.

For example:

Correlation of 0.4 indicates a moderate and positive linear relationship


Correlation of -0.72 indicates a strong and negative linear relationship

Caution

When two variables appear to be related, you might suspect a cause-and-effect relationship.
Sometimes, however, statistical relationships exist even though a change in one variable is not
caused by a change in the other.
Correlation does imply CAUSATION

500 - Linear Regression


510 - Dependence or correlation

Dependence:
Variables are dependent on each other if the value of one variable gives information about
the distribution of the other.
What are key statistics of a distribution? For example normal distribution?
Is that statistical correlation always meaningful, especially for prediction purposes? (i.e. predictive
analytics)
Remember that “correlation does not imply causation”
520 - Causality
Causality describes a relationship between two (or more) things (phenomena, events, variables,
etc.) in which a change in one causes a change in another.
In this diagram, A causes B under certain conditions.
So, if we observe an effect, necessarily we can infer there is a cause prior to the effect.
If there is cause, not necessarily the effect will come about.
But if a cause and all other certain conditions are complete, it is very likely that the cause will
produce its effect(s).

511 - Causal thinking & business decision making

Two related scenarios

1. Situational assessment
Consider any business situation (i.e. business problem that needs to be solved)
We would like to assess that situation, then we often ask “how did that happen?”
Often used in Root Cause Analysis
2. Interventions

Advanced analytics & root cause analysis

The machine learning model can be trained to

analyse the equipment’s data output under regular “healthy” operating conditions,
detect “anomalies” (i.e. any pattern of deviation from “healthy” conditions),
to predict the “behavioral” pattern of the anomaly,
if the predicted values exceed the “normal” threshold, an alert is sent.

Applications: early detection of safety issues, machine failures, more efficient electrical consumption,
predicting quality deviation, adjusting process to prevent material waste, etc.

512 - Causality & interventions

Important business decisions involve the use of limited (scarce) resources.


The trade-off in the form of a resource-allocation decision:
Should resources (time, equipment, land, …) be devoted to project A or project B?
We can loosen a constraint but that typically requires other scarce resources.
A decision on which objects to control or change (i.e. managerial intervention) typically precede
any decision on how to control or change them.
Understanding causality is crucial to making effective interventions

513 - Causal modelling

Consider a decision on the purchase of a new equipment:


Quality has two levels: high or low
High quality equipment can perform more tasks, hence increase production productivity but the
parts are more expensive.
Maintenance cost: greater the quality of the equipment, more expensive are the parts, hence
higher maintenance cost.

514 - Foundations for causal graphs

Causal graphs are directed acyclic graphs (DAGs). They have

A set of vertices (or nodes) representing variables in the model


A set of edges (or links) presenting the connections between variables.
Directed path between two nodes: arrow shows a direction from a cause to its effect.

There is no circle in DAGs.

Feedback loops & time dimensions

Consider a relationship between joy and physical exercise.

Is there any causal relationship between them?


If yes, which variable is cause and which is effect?

We can convert circles into directed acyclic graphs in which we have a time dimension.

At period 0: joy is a cause leading to more exercise


At period 1: feedback from exercise (period 1) to joy (period 1)
515 - Structures in casual graphs

There are three building blocks: Chain, Fork, Collider

1. Chain

Example: X learning efforts, Y employability, Z chance of getting a job.

Y depends on X for its value (hence X and Y are dependent)


Z depends on Y for its value (hence Y and Z are dependent)
Z depends on Y which depends on X,
hence X and Z are also dependent: dependence of X and Z is due to Y being able to change.
What if we hold Y constant (fixed): then changes in X are no linked to changes in Z. Therefore
statistically we say that X and Z are conditionally independent given Y.

2. Fork

Example: X is temperature, Y sales of ice cream and Z sales of fan.

Y depends on X for its value (X and Y are dependent)


Z depends on X for its value (X and Z are dependent)
We can still say that (statistically) Y and Z are (statistically) dependent because changes in Y
reflect changes in X which lead to changes in Z.
If you calculate correlation values, what would you expect?
Again correlation does not imply causation.
Easily to see that if holding X fixed, changes in Y no longer linked to changes in Z.

3. Collider

X is competence (at work), Y is Networking , Z: Promotion (at work)


Both X and Y are causes of Z
X and Z (similarly Y and Z) are dependent
X and Y are independent: they neither cause the other nor have a common cause.
However statistically we can see that if we hold Z fixed, if X change then Y must also change
in a certain way. Why?
Hence we say X and Y are conditionally dependent given Z.

516 - Observed associations

We can observe associations between two variables in the data.


However, these associations have two mechanisms

Causal associations
Non-causal associations

So again, correlation (association) does not imply causation.

Draw assumptions before making conclusions!

Consider 3 variables, how many possible causal models?


Statistical associations does not imply causation.
Hence, it is better to use knowledge to draw assumptions (causal graphs) prior to making
conclusions regarding causality.
520 - Causal modelling for Situational assessment and Intervention
521 - Causal modelling for market volume #Situational_assessment

Suppose you are asked to make an assessment of the size of the market for laptop computers.
The following variables are relevant:

Price: average price per unit


Advertising: the amount of money spent on advertising products
Number of Customers visiting the shop
Media Hype: whether independent media sources report on or display related products
Market Volume: the total amount of goods sold for your product category

1. Price & Volume

The causal relationship between Price & Volume?


How about the Number of Customers visiting the shop and Volume?
Any relationship between Price and Number of Customers?

2. Advertising & Volume

Do you expect that higher advertising expenditure will lead to higher sales (market volume)?
But how about the impact of advertising and number of customers on sales?
Also how about the effects of advertising and media hype on sales?

3. Causal model for assessing market value

Now we put all elements together, this is our causal model for situational assessment.
Note that there is no (business) goal / objective in terms of optimisation or decision making.
Rather it assesses how causal factors affect the market value.

522 - Causal modelling for Interventions #Intervention

Example 2: instead of doing a situational assessment, you are now asked to decide how much to spend
on advertising for these products.

You need to set an objective, e.g. high market share (the proportion of sales through your retailers
to the total number sold).
So the decision variable is “Advertise”.
Simplify intervention decision: (1) run an advertising or (2) not doing that.
Further simplify that you will know the price at the time you set “Advertise”.
530 - Influence diagram
Often, rectangle shape refers to strategic option (i.e. decision point, choice variable, value
directly controlled by a strategic agent – decision making agent)
Hexagon shape refers to objective (e.g. profit, value, market share, etc.). Decision are made to
optimise the objective.
Circle shape refers to probabilistic variables that are chance variables, uncertain quantities,
environmental factors and other elements outside the direct control of strategic agents.
540 - From causal diagrams to mathematical equations

A simplest form of empirical model would be using regression model as below

This equation fails to capture the actual relationship among independent variables (x1 --> x6)

541 - Shortcomings

Consider X1, X2 and X4: associations among these variables are clear, hence we call that this
model suffers from multicollinearity problem.
Also, we cannot use standard significance tests to reliably determine which independent
variables exert the most influence.

A solution (not discussed further in this unit)


Possible to use structural equation model (SEM) (stepwise regression) via a two-stage regression.

1. Stage 1

2. Stage 2: using estimated value of the independent variable obtained from stage 1 regression.

Summary

Causal relationships are crucial for (1) situational assessments and (2) interventions, as part
of business analytics.
If there is a cause-and-effect relationship between two variables x and y, there is statistical
association.
But (statistical) correlation/association does not necessarily imply causation.
Casual thinking and graphs are very useful because
They capture both causality and statistical association
They assist with both situational assessment and intervention tasks in business analytics
From managerial perspective, they allow identification of relevant stakeholders (agents,
people, departments, etc.) related in analytics projects as well as resources allocation.

Analytics & Happiness

What values do business analytics deliver?

Happiness/satisfaction matters every corner of our lives: overall life, work, school, business, etc.
Overall aims are to increase satisfaction.
Situational analysis informs interventions: how?

Our use of happiness case study is to illustrate regression analysis.

Your satisfaction (happiness) matters!

Discuss the following questions from your own experience and knowledge

What makes you happy = what are the causes of your own happiness?
What makes you sad = what are the causes of your own sadness?

Draw a casual graph (with directed paths)


Happiness and Income

Excel Trendline Tool

Right click on data series and choose Add trendline frpop-up menu
Check the boxes Display Equatiom on on chart and Display R-squared value on chart

550 - Simple linear regression using least-square

Simple linear regression model

We estimate the parameters (βs) from the sample data


Once estimated, we can

Assess/explain if X is an important factor explaining Y,


“predict” the value of Y given a specific value of X

551 - Least square regression

Residuals are the observed errors associated with estimating the value of the dependent variable
using the regression line.

The best-fitting line minimizes the sum of squares of the residuals.

Simple Linear Regression with Excel

Using Analysis Toolpak: Data > Data Analysis > Regression


Results: Regression Statistics (metrics)

Interpreting Regression Statistics

F-test (Analysis of Variance)

ANOVA conducts an F - test to determine whether variation in Y is due to varying levels of X.


ANOVA is used to test for significance of regression:
H0: population slope coefficient = 0
H1: population slope coefficient ≠ 0
Excel reports the p-value (Significance F).
Rejecting H0 indicates that X explains variation in Y.
Interpreting Coefficients

Intercept: often not important


Log GDP per capita: 3 elements
Direction of the relationship: positive value
The magnitude of the relationship: 0.742, meaning that for each one-point increase in the Log
GDP per capita, the happiness level increase by 0.742.
Statistical strength of the relationship: can be assessed using the hypothesis testing

Testing Hypotheses for Regression Coefficients

We would like to test if the coefficient (log(GDP)) is statistically significant from zero.
If Coefficient (β1) = 0, what does this mean?
If Coefficient (β1) ≠ 0, what does this mean? (you should consider one tailed tests)
Test statistics:

P-value approach

Log GDP per capita:


we can use p-value to assess two-tailed test.
H0: β1 = 0 vs H0: β1 ≠ 0
In this example, p-value is nearly zero, < 5% hence we can conclude that there is sufficient
evidence to conclude that the true β1 is not zero. This means that there exists a relationship
between happiness level and log of GDP per capita.
We can also conduct a one tailed test.

Confidence Intervals for Regression Coefficient

Confidence intervals (Lower 95% and Upper 95% values in the output) provide information about
the unknown values of the true regression coefficients, accounting for sampling error.
For this example, a 95% confidence interval for the income variable is [0.638;0.845].

Prediction

Intercept: often not important


If you know the value of Log GDP per capita (e.g. 7), you can predict the value of Happiness level.
Predicted HL for Vietnam = -1.411 + 0.742 * 9.392 = 5.558
Confidence Intervals & Prediction

Although we predicted for Vietnam


-1.411 + 0.742 * 9.392 = 5.558

if the true population parameters are at the extremes of the confidence intervals, the estimate might be
as low as
-1.411 + 0.638 * 9.392 = 4.581

or as high as
-1.411 + 0.845 * 9.392 = 6.525

Residual analysis

Residual = Actual Y value − Predicted Y value

Outliers: standard residuals outside ± 2 or ± 3 are potential outliers

Residual Outputs – Residual Plot


560 - Multiple linear regression
Consider the case study of happiness (at national level)
What are possible “causes” and/or “factors” that can explain variations in the Happiness level
across countries?

A linear regression model with more than one independent variable is called a multiple linear
regression model.

- Y is the dependent variable, Xi are the independent (explanatory) variables;


- βi are the regression coefficients for the independent variables, ε the error term.

We estimated the particle regression coefficients bi


561 - ANOVA for Multiple Regression

ANOVA tests for significance of the entire model. That is, it computes an F-statistic testing the
hypotheses
𝛼_2=0.175 p-value=0.074
If wealth (log of GDP per capita) increases by 1 unit, holding all the other independent variables
constant, the value of happiness will increase by 0.175, significant at level of 10%
𝛼_3=3.55 p-value=0.000
If social support increases by 1 unit, holding all the other independent variables constant, the
value of happiness will increase by 3.55 , significant at level of 1%

562 - Should I include a new Xi variable?

Some argue that a good regression model should include only significant independent variables.

But not always clear exactly what will happen when we add or remove variables from a model:
variables that are (or are not) significant in one model may (or may not) be significant in another.
Should not consider dropping all insignificant variables at one time,
Should take a more structured approach.

Steps if you find out a variable is insignificant

Step 1: Check measurement of the variable is correct or not (P-value, F-test)


Step 2: If not wrong --> insignificant

Using adjusted R-square

Adding an independent variable to a regression model often increase the value of R-square
Adjusted R-square reflects both the number of Xi variables and sample size.
Adjusted R-square may either increase or decrease when an Xi variable is added or dropped.
An increase in adjusted R-square indicates the model has improved.
But some prefer models what are simpler (i.e. having less Xi variables) when only minor
differences in the adjusted R-square scores.

563 - Systematic Model Building Approach

1. Consider causal graphs


2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of coefficients
and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-square
again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)
600 - Regression
610 - Independent categorical variables
611 - Multiple linear regression

A linear regression model with more than one independent variable is called a multiple linear
regression model.

Y is the dependent variable, Xi are the independent (explanatory) variables;


βi are the regression coefficients for the independent variables, ε is the error term
Using sample data, we estimate the regression coefficients bi

612 - Regression with Independent Categorical Variables

Few examples:

Employee income depends on “managerial / supervisorial” positions


Employee income depends on “having or not having” an MBA degree
The decision to approve a loan depends on whether applicant(s) has a house or not..
The price (value) of properties depends on “types of the property”

These independent variables are categorical.


In regression analysis, we include them using dummy variables.

613 - Case study

Consider Salary Dataset


Preliminary data understanding (reflecting on Data Understanding and Modelling sections your
group assessments as well):
Types of data / variables: do you understand them?
Descriptive statistics
Detection of outliers
Correlation matrix among variables (using Data Analysis Toolpak, built-in Excel add-in)
Plot and diagram analysis
Hypothesis testings
Confidence interval estimate

1. Dummy variable for two categories

A dummy variable indicates whether an observation belongs to a particular category in the data.

Gender: The categorical variable could be 1 if female, and 0 if not female.


Run analysis: current salary is a function of (beginning salary, …., gender)

Consider X1 is dummy variable for gender, X1 either takes a value of zero (not female) or one
(female)
2. Four dummy variables for five categories

Consider “Departments” variable

How many levels / categories?


Administration – Production - Sales – Research & Development - Warehouse
How many dummy variables should we use?
Determine the “Reference” or “Base” category
Remaining fours are benchmarked against this reference category
How to interpret each of these dummy variables?

Warehouse as the base category

D1 = 1 if Department = Admin, =0 if D = other


D2 = 1 if Department = production, otherwise 0
D3 = 1 if Department = sales, otherwise 0
D4 = 1 if Department = R&D, otherwise 0
=> D1 = D2 = D3 = D4 = 0 => Warehouse

Belta 1 --> different between Admin and Warehouse


Belta 2 --> different between Production and Warehouse
=> Use for Categorical, Nominal data only

Department (nominal)
1 --> Admin
2 --> Production
3 --> Sales
4 --> R&D
5 --> Warehouse
3. General rules for dummy variable number

When a categorical variable has k (> 2) levels/categories, we need to add (k−1) additional dummy
variables to the model.

The variable “Department” has five levels


Choose one level, such as “Research and Development” as the reference level/base level
Four dummy variables are needed
Each dummy compares to the reference level.
620 - Non-linear relationship
Case: Months employed and Sales Data: Reynolds
Predicted scales sold = Month employed x Coefficient month employed + intercept

622 - Regression Models with Nonlinear Terms

Curvilinear models may be appropriate when scatter charts or residual plots show nonlinear
relationships.
A second order polynomial might be used

Here ß1 represents the linear effect of X on Y and ß2 represents the curvilinear effect.

Case: Months employed and Sales Data: Reynolds Quadratic regression model
Let’s estimate the maximum sales!!
The knot/ the breakpoint: the value of the independent variable at which the relationship between the
independent variable and the dependent variable changes

Đạo hàm = 0 --> Max (THPT)

X1 = -β1 / 2β2

630 - Interaction terms between independent variables


An interaction occurs when the effect of one variable is dependent on another variable or the
relationship between the dependent variable and one independent variable is different at various
values of a second independent variable
We can test for interactions by defining a new variable as the product of the two variables,

testing whether this variable is significant, leading to an alternative model.

Case: Advertising expenditure – Price and Sales Data: Tyler

Difference in mean sales between advertising expenditures of $50,000 and $100,000 depends on the
price of the product.
At higher selling prices, the effect of increased advertising expenditure diminishes.

Let’s consider the effect when price increases by $1

Let’s consider the effect when Advertising expenditure increases by $1,000

631 - Categorical by Continuous

For example:

Income depends on MBA and Age but you believe that the effect of Age on income differs
between two groups: with and without MBA degree

If MBA = 1, then we have

If MBA = 0, then
632 - Categorical by Categorical

For examples:

MBA with Gender


Gender with Managerial / Supervisory Position

The coefficient of the interaction term of two categorical variables controls for the differences in the
union of the two groups. This coefficient acts as a constant that shifts the model if the observation is in
the union of the two categorical variables.

633 - Continuous by Continuous

For example

Experience and Years of Education

Take the MLR model:

The partial effect of X1 is given by:

And the partial effect of X2 is given by:

Clearly, the partial effect of one of the continuous variables in the interaction term depends on the size
of the term it is interacted with.

640 - A note on multicollinearity


641 - Co- and multi-collinearity problems

Multicollinearity occurs when we have two or more independent variables that are highly
correlated with one another.
If those variables capture similar things, for example income and disposable income or body size
and weight, then we should consider to use only one (better) variable.
Multicollinearity affects statistical significance of t-tests but does not have much impacts on the
predictive power.
If you have two independent variables that are highly correlated with one another, think about:
whether they measure the same thing
They could be results of a chain or a fork
Avoiding using multiple variables that capture the same thing help us avoid
(1) multicollinearity problems (ie. For the use of situational assessment) and
(2) avoid “over-fitting the model” in predictive analytics.

642 - From causal diagrams to mathematical equations

A simplest form of empirical model would be using regression model as below


This equation fails to capture the actual relationship among independent variables (X1 --> X6)

Shortcomings

Consider X1, X2 and X4: associations among these variables are clear, hence we call that this
model suffers from multicollinearity problem.
Also, we cannot use standard significance tests to reliably determine which independent variables
exert the most influence.

A solution (not discussed further in this unit)

Possible to use structural equation model (SEM) (stepwise regression) via a two-stage regression.

1. Stage 1

2. Stage 2: using estimated value of the independent variable obtained from stage 1 regression.
650 - Assessment of model assumptions

You might also like