0% found this document useful (0 votes)
57 views

Business Data Analytics Part 4

Uploaded by

Thao Pjn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Business Data Analytics Part 4

Uploaded by

Thao Pjn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Part 4.

Analyze data
Tasks
Strong collaboration between a data scientist and a
business analyst ensures that the analytics work is
performed within the correct business context.

1/ Develop data analysis plan

2/ Prepare data

3/ Explore data

4/ Perform data analysis

5/ Assess the analytics and system


approach taken
Develop data analysis plan
“If you fail to
plan, you are
planning to
fail!”
― Benjamin Franklin

Image source:
https://artsandculture.google.com/asset/ZgEyj5EEKdux-g
When developing the data analysis plan, the analyst
determines:

1/ which techniques to use

2/ which models will be used

3/ which data sources will be used

4/ how data will be preprocessed


and cleaned
Who creates the plan?

A delivery professional (such as a project The data scientist possesses deep technical
manager or a business analyst) provides insights expertise to decide how the data analysis will be
into the plan or may draft the initial plan for conducted.
review by the data scientist.

Metrics and KPIs can be used to assist the data


scientist in determining if the outcomes from
data analysis are producing the results required
to address the business need. Organizational
knowledge helps business analysis professionals
provide the context for the data scientist's work.
Technique: Linear regression
Linear regression is a data plot that graphs the linear relationship between an independent
and a dependent variable.

Y = 3.69*X -9.59 R2 = 0.992


Usage considerations
Strengths: Limitations:

● A proven method that is used extensively ● Simple construct - may perform poorly
● Easy to understand and explain ● The variables must be truly independent
● The variables should not be serially
correlated
Technique: Seasonality analysis
Linear (or any other simple) regression will not always work, especially when we talk about
cyclic trends, such as seasonality over a timeframe. ARIMA is a forecasting algorithm based
on the idea that the information in the past values of the time series can predict future
values.
Usage considerations
Strengths: Limitations:

● Can handle time-series data with trends ● Slowly gets phased out by more accurate
algorithms
Technique: Classification
Logistic regression is a statistical method for predicting binary classes - in simple words, it
helps attribute an observation to one of the two potential outcomes.

Image source:
https://commons.wikimedia.org/wiki/File:Exam_pass_logistic_curve.jpeg
Usage considerations
Strengths: Limitations:

● Used for binary classification ● Can have high bias towards model
assumptions
● Requires preprocessing and normalization
of data
● There are other means of classification that
work better under specific circumstances
Naive Bayes is a technique for constructing classifiers: models that assign class based on a
set of features.

E.g. predict if a person is male or female based on height, weight, foot size.

For each given individual, a classifier will be created based on probability for a person to be
of specific gender adjusted by probability to be of a specific gender given the
measurements.
A decision tree is a decision support tool that uses a tree-like model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility.

A1B2C3D4

A1B2 C3D4

12 AB 34 CD
Random forests are an ensemble learning method for classification, regression and other
tasks that operates by constructing a multitude of decision trees at training time.

Each tree follows 2 rules:


- Bagging: create each tree by randomly
sampling your original data
A1B2C3D4
AAB 2 C 2 D 4
33B2C3DB

- Feature Randomness: use a subset of


all possible features for each tree

Image source:
https://en.wikipedia.org/wiki/Random_forest#/media/File:Random_forest_diagram_complete.png
Example - step 1: get the testing data

# Red Green Blue Size Stripes Class


cm

1 1 1 0 6 0 Apple

2 0 1 0 8 0 Apple

3 0 0 1 0.5 0 Blueberry

4 0 0 1 0.5 0 Blueberry

5 0.65 0.35 0 8 0 Orange

6 0.65 0.35 0 10 0 Orange

7 0 1 0 32 1 Watermelon

8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness

# Red Green Blue Size


cm
Stripes Class 1. blue > o -> Blueberry
2. size > 8 -> Watermelon
1 1 1 0 6 0 Apple

2 0 1 0 8 0 Apple

3 0 0 1 0.5 0 Blueberry

4 0.1 0 0.9 0.3 0 Blueberry

5 0.65 0.35 0 8 0 Orange

6 0.65 0.35 0 10 0 Orange

7 0 1 0 32 1 Watermelon

8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness

# Red Green Blue Size


cm
Stripes Class 1. stripes > 0 -> Watermelon
2. Green < 1 -> Orange
1 1 1 0 6 0 Apple

2 0 1 0 8 0 Apple

3 0 0 1 0.5 0 Blueberry

4 0.1 0 0.9 0.3 0 Blueberry

5 0.65 0.35 0 8 0 Orange

6 0.65 0.35 0 10 0 Orange

7 0 1 0 32 1 Watermelon

8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness

# Red Green Blue Size


cm
Stripes Class 1. size > 10 -> Watermelon
2. green > 0 -> Orange
1 1 1 0 6 0 Apple

2 0 1 0 8 0 Apple

3 0 0 1 0.5 0 Blueberry

4 0.1 0 0.9 0.3 0 Blueberry

5 0.65 0.35 0 8 0 Orange

6 0.65 0.35 0 10 0 Orange

7 0 1 0 32 1 Watermelon

8 0 1 0 35 1 Watermelon * use Gini impurity metric to find the best split


Example - step 3: apply the trees and pick a winner

Red Green Blue Size Stripes Class


cm

0.7 0.3 0 10 0 ?

1. size > 10 -> Watermelon 1. stripes > 0 -> Watermelon 1. blue > o -> Blueberry
2. green > 0 -> Orange 2. Green < 1 -> Orange 2. size > 8 -> Watermelon

Orange Orange Watermelon

Orange
Usage considerations
Strengths: Limitations:

● Easy to visualise and understand (trees) ● May fall victim to generalisation errors
● Works in most cases with high accuracy (may perform poorly if future data is
significantly different from the training
data)
Other classification tools
● K-Nearest Neighbors Algorithm - grouping
observations together based on similarity in
a set of parameters. Often used to find
“similar items”, or “more like this product”.
● Support vector machine - an algorithm to
find a hyperplane in an N-dimensional space
(N — the number of features) that distinctly
classifies the data points.
● Perceptron - just like SVM tries to find a
hyperplane that classifies the data points. It
uses different math behind it, allowing to
keep training over time.

Image source:
https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:SVM_margin.png
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm#/media/File:Map1NN.png
https://en.wikipedia.org/wiki/Perceptron#/media/File:Perceptron_example.svg
Prepare data
Preparing data involves obtaining access to
the planned data sources and establishing
the relationships and linkages between
sources in order to create a coherent dataset.
— Guide to Business Data Analytics, IIBA
Preparing data

1. Understand relationships
2. Establish joins/linkages
3. Normalize
4. Standardize
5. Scale
6. Convert
7. Cleanse
8. Validate
Explore data
Exploring data involves performing an initial
exploratory analysis to ensure the data being
collected is what was expected from the data
sources.
— Guide to Business Data Analytics, IIBA
The data scientist assesses the data quality to
determine the course of action using the following
checkpoints:

1/ Data integrity

2/ Data validity

3/ Data reliability

4/ Data bias
Perform data analysis
Perform
analysis
Original question

Math question

Model
1. Statistical tests
2. Regression analysis
3. Machine learning
1.
2.
3.

1. Types of data
2. Organisation of data
3. Central tendency
4. Deviation
5. Probability and its distribution
Technical
visualisations
Technical visualizations are used by
data scientists to evolve their
analysis that becomes the detailed
data for driving insights. They may
not be useful for communicating
insights to business stakeholders,
but technical visualizations deepen
the team's understanding.
Image source:
https://en.wikipedia.org/wiki/Autocorrelation#/media/File:Acf_new.svg
Assess the Analytics and System
Approach Taken
We have explored the data

Analytics team
We have analysed the data

But have we answered the research


question?
Data
Data analysis
exploration

no yes Finish analysis

Comfortable with
data being used?
Technique: Simulation
Observe real
Build a model Test a model
world

Use a model for


decision making
Types of simulation:

1/ Risk simulation

2/ Event-based simulation

3/ Dynamic simulation
Usage considerations
Strengths: Limitations:

● Cause-action-reaction chains can be ● Creating effective simulations requires


modelled without disrupting the business. expert knowledge of the system being
● Complex business situations can be simulated.
modelled with accurate inputs. ● The outcome of a simulation experiment
● Simulations are computationally efficient can be difficult to explain due to the many
and involve lower data acquisition cost. variables involved.
● They are accurate for business scenarios ● Other types of modelling techniques are
with many contributing factors and a low considered more effective
amount of data.
● Simulations can be used in modelling
prescriptive actions and predictions under
business constraints.
Technique: Optimisation
Optimization can be described as choosing
the best possible option among multiple
available options under some constraints.

— Guide to Business Data Analytics, IIBA


Decision/
Decision
Objective/Cost/ Constraints
variables
Error Function

Decision model
Usage considerations
Strengths: Limitations:

● Optimization is the mathematical basis of ● The optimized solution may not be the best
most of the predictive, prescriptive, and solution available.
operation research analytical models. ● More complex formulations are difficult to
● Optimization methods converge rapidly explain to the stakeholders.
(equating to finding the optimum solutions ● The process requires very accurate
faster) when applied to large scale and formulation of the constraints.
complex problems using many variables. ● The optimization process in large scale
requires processing power and time.
Case study: simulation and
optimisation

You might also like