Business Data Analytics Part 4
Business Data Analytics Part 4
Analyze data
Tasks
Strong collaboration between a data scientist and a
business analyst ensures that the analytics work is
performed within the correct business context.
2/ Prepare data
3/ Explore data
Image source:
https://artsandculture.google.com/asset/ZgEyj5EEKdux-g
When developing the data analysis plan, the analyst
determines:
A delivery professional (such as a project The data scientist possesses deep technical
manager or a business analyst) provides insights expertise to decide how the data analysis will be
into the plan or may draft the initial plan for conducted.
review by the data scientist.
● A proven method that is used extensively ● Simple construct - may perform poorly
● Easy to understand and explain ● The variables must be truly independent
● The variables should not be serially
correlated
Technique: Seasonality analysis
Linear (or any other simple) regression will not always work, especially when we talk about
cyclic trends, such as seasonality over a timeframe. ARIMA is a forecasting algorithm based
on the idea that the information in the past values of the time series can predict future
values.
Usage considerations
Strengths: Limitations:
● Can handle time-series data with trends ● Slowly gets phased out by more accurate
algorithms
Technique: Classification
Logistic regression is a statistical method for predicting binary classes - in simple words, it
helps attribute an observation to one of the two potential outcomes.
Image source:
https://commons.wikimedia.org/wiki/File:Exam_pass_logistic_curve.jpeg
Usage considerations
Strengths: Limitations:
● Used for binary classification ● Can have high bias towards model
assumptions
● Requires preprocessing and normalization
of data
● There are other means of classification that
work better under specific circumstances
Naive Bayes is a technique for constructing classifiers: models that assign class based on a
set of features.
E.g. predict if a person is male or female based on height, weight, foot size.
For each given individual, a classifier will be created based on probability for a person to be
of specific gender adjusted by probability to be of a specific gender given the
measurements.
A decision tree is a decision support tool that uses a tree-like model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility.
A1B2C3D4
A1B2 C3D4
12 AB 34 CD
Random forests are an ensemble learning method for classification, regression and other
tasks that operates by constructing a multitude of decision trees at training time.
Image source:
https://en.wikipedia.org/wiki/Random_forest#/media/File:Random_forest_diagram_complete.png
Example - step 1: get the testing data
1 1 1 0 6 0 Apple
2 0 1 0 8 0 Apple
3 0 0 1 0.5 0 Blueberry
4 0 0 1 0.5 0 Blueberry
7 0 1 0 32 1 Watermelon
8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness
2 0 1 0 8 0 Apple
3 0 0 1 0.5 0 Blueberry
7 0 1 0 32 1 Watermelon
8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness
2 0 1 0 8 0 Apple
3 0 0 1 0.5 0 Blueberry
7 0 1 0 32 1 Watermelon
8 0 1 0 35 1 Watermelon
Example - step 2: generate random forest using bagging and feature randomness
2 0 1 0 8 0 Apple
3 0 0 1 0.5 0 Blueberry
7 0 1 0 32 1 Watermelon
0.7 0.3 0 10 0 ?
1. size > 10 -> Watermelon 1. stripes > 0 -> Watermelon 1. blue > o -> Blueberry
2. green > 0 -> Orange 2. Green < 1 -> Orange 2. size > 8 -> Watermelon
Orange
Usage considerations
Strengths: Limitations:
● Easy to visualise and understand (trees) ● May fall victim to generalisation errors
● Works in most cases with high accuracy (may perform poorly if future data is
significantly different from the training
data)
Other classification tools
● K-Nearest Neighbors Algorithm - grouping
observations together based on similarity in
a set of parameters. Often used to find
“similar items”, or “more like this product”.
● Support vector machine - an algorithm to
find a hyperplane in an N-dimensional space
(N — the number of features) that distinctly
classifies the data points.
● Perceptron - just like SVM tries to find a
hyperplane that classifies the data points. It
uses different math behind it, allowing to
keep training over time.
Image source:
https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:SVM_margin.png
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm#/media/File:Map1NN.png
https://en.wikipedia.org/wiki/Perceptron#/media/File:Perceptron_example.svg
Prepare data
Preparing data involves obtaining access to
the planned data sources and establishing
the relationships and linkages between
sources in order to create a coherent dataset.
— Guide to Business Data Analytics, IIBA
Preparing data
1. Understand relationships
2. Establish joins/linkages
3. Normalize
4. Standardize
5. Scale
6. Convert
7. Cleanse
8. Validate
Explore data
Exploring data involves performing an initial
exploratory analysis to ensure the data being
collected is what was expected from the data
sources.
— Guide to Business Data Analytics, IIBA
The data scientist assesses the data quality to
determine the course of action using the following
checkpoints:
1/ Data integrity
2/ Data validity
3/ Data reliability
4/ Data bias
Perform data analysis
Perform
analysis
Original question
Math question
Model
1. Statistical tests
2. Regression analysis
3. Machine learning
1.
2.
3.
1. Types of data
2. Organisation of data
3. Central tendency
4. Deviation
5. Probability and its distribution
Technical
visualisations
Technical visualizations are used by
data scientists to evolve their
analysis that becomes the detailed
data for driving insights. They may
not be useful for communicating
insights to business stakeholders,
but technical visualizations deepen
the team's understanding.
Image source:
https://en.wikipedia.org/wiki/Autocorrelation#/media/File:Acf_new.svg
Assess the Analytics and System
Approach Taken
We have explored the data
Analytics team
We have analysed the data
Comfortable with
data being used?
Technique: Simulation
Observe real
Build a model Test a model
world
1/ Risk simulation
2/ Event-based simulation
3/ Dynamic simulation
Usage considerations
Strengths: Limitations:
Decision model
Usage considerations
Strengths: Limitations:
● Optimization is the mathematical basis of ● The optimized solution may not be the best
most of the predictive, prescriptive, and solution available.
operation research analytical models. ● More complex formulations are difficult to
● Optimization methods converge rapidly explain to the stakeholders.
(equating to finding the optimum solutions ● The process requires very accurate
faster) when applied to large scale and formulation of the constraints.
complex problems using many variables. ● The optimization process in large scale
requires processing power and time.
Case study: simulation and
optimisation