Business Analytics - Notes
Business Analytics - Notes
LECTURE NOTES
Subjects
What to expect?
1
➢ Many data sets can be found everywhere -dataoverheid.nl-
Descriptive
➢ Descriptive analytics
➢ All techniques that describe what has happened in the past
➢ Ex. Visualisation techniques, dashboards,
Predictive
➢ Predictive analytics
➢ Techniques that use data in the past to predict behaviour in the future, determine the
impact of factors
➢ Goes one step further than descriptive
➢ Ex. Use of historical sales to predict future sales
o Use of purchasing behaviour of consumers to predict market shares
o What are the risk factors of a new disease?
o Which characteristics determine whether a soccer team is able to win the
match?
Prescriptive
➢ Prescriptive analytics
➢ Indicates the best course of action to take
➢ Based on the data what is the optimal thing to do and make given the restrictions that
are present
➢ Ex. What is the best pricing strategy for a company?
o Which location should a factory be opened to meet customer requirements at
minimum cost?
o What financial investments need to be made to achieve superior returns with as
little risk as possible?
o More advanced than descriptive and predictive
Big data
2
Sub domains of Business analytics
➢ Data- fact and figures (information) collected, analysed, and summarized for
presentation and interpretation
➢ Variable- quantities of interests than can take on many values
➢ Observation- set of values of corresponding to a set of variables
➢ Variation- variety in your data
Data
Subjects
❖ Introduction to Oracle
❖ Data visualisation techniques in Oracle
3
➢ Managing data
➢ Oracle’s top three customers: SailGP (adrenaline-fueled boat racing, powered by wind
speed nature), Premier League and Redbull
➢ 1. Challenge in data management – 1 day of sailing created 40BN rows of data from
thousands of sensors
o Solution – Oracle combines all huge type of data into one database
o User case of Britain – UK crashed their boat in Spain. Why? Because all the data
points are scattered around the place
o Foiling is when boats start ‘flying’
➢ 2. Analytics - too many data to convert
o Converting 900 data points per sec for each boat into real time data
o Because of these ‘data streams’, sailors are able to see how fast other competitors
sail and see more insights on the racetrack
o Automatically calculates speed v distance for sailors, shows them how they can
get to the final destination the quickest (through different destinations)
➢ 3. Machine learning- predicting boat speeds, etc.
o To predict the optimal boat speed and optimal course to take
➢ Oracle labs
➢ Check under week 2 (Oracle material) on Brightspace for Oracle manual and practice
Subjects
❖ Data visualisation
❖ Cluster analysis
Data visualization
➢ Of vital importance
➢ Logical follow-up of 1st week (measures of location, variability, and shapes)
4
➢ Summary table for the data
➢ Good means to communicate the message to others
➢ Famous data visualisation book which writes about examples of misleading charts and
graphs
Table vs graphs
Data-ink ratio
Tables
5
➢
Visualisations in tables
o
➢ Adding sparkline to your tables
o
➢ Heat maps
o
➢ Table + graph
6
o
Graphs: 4 types
➢
➢ Majority of graphs fits in either of the four types
➢ Points
o Scatter charts displays the relationships between two variables
o Correlation relationships
➢ Lines
o Nominal scale – categories that are not sorted
o Ordinal scale – categories that are sorted
o Interval scale – underlying measure of time
7
o
o Line charts should be used in interval scale
o Suitable for trends and patters
o Only directly connect values in adjacent intervals
o Time-series relationships
o Whether or not you include ‘0’ depends on how you illustrate the values
➢ Bars
o Serve to highlight individual values
o Suitable for comparing/raking categorical values
o Good for: Ranking relationships, part to whole and nominal comparison
o
o Can be misleading
o What can be improved while using bar chart? Include or start with ‘0’
(quantitative scale)
o Bar vs column charts
o Pie charts – avoid this chart to properly compare categorial data, avoid 3D and
many slices
➢ Boxes
o Boxplots tell the entire distribution of a variable
➢ Nice visualisation tool -> Gapminder.com
➢ Bubble chart
➢ Geospatial relationships in 2D and 3D in excel or oracle
Choropleths
8
o
➢ Spaghetti graph
Cluster analysis
➢
➢ Concept of similarity
9
➢ Goal/aim is to segment observations into similar (homogenous) groups (the
clusters)
o Ex. Market segmentation
➢ A doctor may reason about a new difficult case by recalling a similar case (either treated
personally or documented in a journal) and its diagnosis
Similarity
➢ Measuring similarity
➢ Observations within cluster similar to other observations within cluster; but unequal to
observations in other clusters
➢ Determine the cluster variables
➢ For quantitative variables, the Euclidean distance is the most common distance
measure
➢ Another option: Manhattan distance
➢ Distance measures:
Similarity measures
10
➢
Afstandsmaten
z-scores
▪
11
▪Single linkage -> Similarity between clusters is determined by the
shortest distance between two observations (nearest)
▪ Complete linkage -> Similarity between clusters is determined by the
longest distance between two observations (furthest)
▪ Group average linkage -> Similarity between clusters is determined by
the average distance across all pairs
▪ Centroid linkage -> Similarity between clusters is determined by the
distance between the "centroids"
➢ Dendrogram shows the output of a hierarchical clustering
o Vertical axis shows distance between clusters
o indicates where "natural" clusters are present
o
➢ K-Means clustering assigns each observation to one of the k clusters in such a way
that the observations within a cluster are as similar as possible
➢ Number of clusters k is specified in advance
➢ The “cluster centroids” are calculated (the “means”)
➢ Step 1
o
➢ Step 2
o
➢ Step 3
12
o
➢ Step 4
o
➢ When do we have a good solution?
o Silhouette score-> the ratio of between-cluster distance, to average within-
cluster distance should exceed 1.0 for useful clusters
➢ Elbow-method
➢ How close are the observations to the centroids?
➢ Both methods (hierarchical and k-means) depend on how similar observations
are to each other
➢ Hierarchical
o Start with all observations as clusters (bottom-up)
o Suitable for smaller datasets
o Visually appealing
o Multiple types of variables
➢ K-means
o Start with k clusters (random; disadvantage?)
o Suitable when you know how many clusters you want and suitable for larger
datasets – also less computationally intensive
o Suitable when one wants to summarize the data with k average observations
with minimal margin of error
o Quantitative/numerical variables only
13
Week 4: Statistics and regression analysis (Predictive Analytics)
Subjects
Samples
Population parameter
o
o p is the probability
o
➢ Point estimates
o To arrive to a certain estimation
➢ We usually already know the point estimates, the parameter is most of the time
unknown
➢ Multiple estimates are possible
14
A distribution of all sample means
➢
o Sample distribution – the idea that we have diff samples and diff averages, and
compare them together in a graph
Statistical testing
➢
➢ Situation in the population:
o 𝐻0 true (no difference in grades) or 𝐻0 not true (there is a difference in grades)
15
➢ Could either be that we find evidence in favour of the null hypothesis
➢ Conclusions based on the samples:
o Do not reject 𝐻0 or reject 𝐻0
o Type I error plays an important role (the hypothesis is true, yet you reject it)
o Significance level α: probability of making type I error (usually 0.01 or 0.05)
o Always ‘reject’ and ‘do not reject’, we do not talk about ‘accepting the null
hypothesis’
➢
o Ex. Hypothesis about the mean in the population
o 𝑥̅ = sample mean
o µ = the hypothesized population mean in the null hypothesis
o SE = standard error
o p value = probability of obtaining results at least as extreme as the observed
results assuming null hypothesis is correct (smaller p value the stronger evidence
in favor of alternative hypothesis)
➢
➢ t is used to determine whether 𝑥̅ deviates from µ
➢ P-value is very important -> means that the probability that we found Is lower than our
average 𝑥̅
➢ P-value is also known as the probability of exceeding t
➢ Smaller p value means you are on the far left or right -> more evidence against the null
hypothesis
➢ α= 0.05
➢ Looking at the graph, the p-value is 0.0039, which means p ≤ α. We can reject the null
hypothesis
Overview of steps
16
➢
Intervals
➢
o 90% of the values are within 1.645 standard deviations from the mean
➢ Interval = estimate ± margin of error
➢ Estimate could be 𝑥̅
o 𝑥̅ is normally distributed
➢ Confidence level
Estimates
➢ Standard error
17
o
➢ T-distribution (takes account of uncertainty)
➢
➢ Interval = estimate ± margin of error
➢ Interval = estimate ± (t-value x standard error)
Correlation coefficient
➢
➢ Rules of thumb:
o Between -0.10 and 0.10 -> small strength
o Between 0.10 and 0.30 (-0.10 and -0.30) -> medium
o Between 0.30 and 0.50 (-0.30 and -0.50) -> large
o More than 0.70 -> too large, may cause multicollinearity
Non-linear
18
➢
➢ Two variables could be impacted by the third variable driving the relationship
Multiple variables
Regressions at Netflix
➢ They created a competition to look for the best regression model to predict user ratings
for films, based on previous ratings
19
Linear regressions model
o
➢ β are parameters
o Asses how y and x are associated with each other
o Estimation, expectation
➢ Ɛ is error term
o There is variation in your data set
o The expected/estimated value of y is β0 + β1x (ŷ)
➢ Generating estimates
➢ It is about the difference between the actual value y of observations i and the predicted
value of observation i
➢
➢ If x decreases by 1, y increases by b1
➢ b1 (slope) is the estimated change in the mean of y when the independent
variable x increases by 1
➢ b0 is the estimated mean of y when the independent variables x equals to 0
20
The fit
➢ Coefficient of determination
➢ Denoted by r2
o Asses how good is the model
o How good is the axis in order to predict y
o How much are we able to improve in drawing the horizontal line
➢ Between 0 to 1 (0% to 100%)
Interpretation
➢
➢ bj is the estimated change in the mean of y when variables xj increases by 1
➢ Control variables (ceteris paribus condition -> other things equal)
➢ Why do we want to add control variables?
o The variables are held constant
o You could be interested in a certain variable yet can freely add other variables
for comparison
o Sometimes difficult to choose
➢ βj = 0
➢
➢ Reject the null hypothesis when there is a significant relationship between y and xj
➢ Do not reject when there is no significant relationship (y does not change when x
changes)
➢ Large t is the same as a small p, so reject when t is large
➢ Reject H0: βj = 0 when p is smaller than α (often 0.05)
21
Interpretation with FIFA data
➢
➢ Steps to find the relationship between value (y) and age (x, the independent
variable):
o
In conclusion, reject H0: βage = 0
o
There is a significant negative relationship (linear) between age and value
o
Why negative? Because the t value is -31,10
o
The mean of value decreases by 173 thousand when age increases by 1 year
o
▪ While holding the values of all other independent variables constant
➢ For acceleration, there is no significant coefficient because the p-value is 0.16>0.05. we
do not reject H0: βacceleration = 0
Subjects
❖ Logistic regression
❖ Classification and classification trees
❖ Correlation and causality
Overfitting
22
➢ Overfitting when the model is good in explaining but not predict well
➢
o Underfitting is too general, and overfitting is too specific
➢ How can this happen?
o When you overlook the general picture/relationship of the dataset with many
variables
o Too complex
o Too focused on some variables only
o Occurs often in regression analysis
➢ Why is this a problem?
o Negatively impacts the model’s ability to generalize
o Can produce misleading values
o Accuracy decreases
➢ Solution: - divide the data into Training set (explaining, 80% of the data);
Validation set (making prediction, 20% of the data); Test set (usually for new
data sets)
➢
➢ Supervised learning are regression and classification techniques -> always have an
outcome variable (y)
o Predictive analytics
o Explain and predict well
o Training and validation sets
➢ Unsupervised learning are clustering and text mining technique -> for descriptive
purposes, explaining rather than predicting
o Descriptive analytics
23
o Explaining
o Training set
➢ What’s the likelihood of getting in a car accident in the UK? (1= gets into accident 0=not
getting into accident)
➢ Lebron James says he was ‘frustrated’ by false positive COVID test (1=positive
0=negative)
➢ Wearable sensors can tell when you are getting sick (1=getting sick 0=not getting sick)
➢ Classifying spam emails (1=yes 0=no)
➢ And many more…..
➢
➢ Error always happens
24
Confusion matrix
➢
➢ Asses the matrix by means of measures
➢ False positive -> predicted a 1 instead of 0
➢ False negative -> predicted a 0 instead of 1
Accuracy
o
➢ The accuracy of the model is 1 minus the overall error rate -> Percentage of true
positives and true negatives
➢ If the accuracy is 80% then the error rate is 20%
➢ Are you always interested in a high accuracy value? If so, why? If not, why/when not?
o
oToo high numbers are not that useful (the zeros are very high in the table
90+900 = 990)
o Let’s say if we want to only focus on the ones, it is only 10+40 =50, which means
that the accuracy is quite low in that case
o The ones are usually more important than the zeros
➢ Sensitivity (recall): percentage of true positives within category 1 (calculating correct
ones)
25
o
➢ Specificity: percentage of true negatives within category 0 (calculating zeros)
➢ What happens to the error rates of both categories when we lower the cut-off value?
Why?
➢
▪ In the graph above, class 1 error rate is 20% which means that the
sensitivity is 80%
▪ Class 0 error rate is 70% which means specificity is 30%
o When you lower the cut-off value, it is easier to classify ones (sensitivity
increases)
o However, the prediction for zeros will be decreased (specificity decreases)
o And vice versa
o So, changing the cut-off value impacts the ability to predict the ones and zeros
26
o
➢ Given any class 0 error rate, you would like to have a high sensitivity
➢ Given any sensitivity level, you would like a low class 0 error rate
➢
➢ Measures to predict just ones
o Precision and F1 (combination of precision and sensitivity (recall))
So there is a trade-off
➢ Same as changing the cut-off value, there is a trade-off between sensitivity and
precision
➢ Makes it difficult to predict well
27
Oracle output
➢
➢ Mainly about comparing different models rather than looking at the values/percentage
individually
Logistic regression
o
o Focus on the y-axis, winner of the best picture
o The values in linear regression can be more than 1 or smaller than 0, which in
this case we are only interested in between 0 and 1
o There is no restriction in terms of prediction (the values may not always be
useful)
28
o
➢ Probability of a 1 is p
o Ex. The probability of receiving a spam email, probability of getting a certain
disease, etc.
➢ Value between 0 and 1
o Use “odds” -> a ratio of 2 probabilities
o p/1-p (so p is receiving a span email and 1-p is not receiving a spam email)
o will be larger than 0
o
➢ In(p/1-p)
o
o This one takes all values
➢ Logistic regression model:
29
Question
➢ What do the training set, validation set, and test set have to do with this?
➢ Training set: to obtain estimates of the betas and to assess the “fit” (R2/Mallow’s Cp)
o Is about explaining
➢ Validation set: to transform p into classifications and to create a confusion matrix
o Prediction purposes (p)
➢ Test set: to use the best model with new information obtained
o Combines the two techniques to use in another different data set
Extra
Ho and Ha
➢
➢ 0 is in the middle
➢ This is the situation of the null hypothesis (the beta is 0)
➢ Any value far away from the 0 tells us how much we are going to reject the null
hypothesis
➢ The red area is the “surprising value” which is far from our expectations
➢ Since there is a value in the red area (the alpha), we can reject H0
➢ Even though we rejected the H0 we could still be wrong
30
Difference between correlation and causation
➢ 1. is a coincidence
o Ex. There is a correlation between the no. of Nobel laureates and chocolate
consumption per country
o Does it actually mean that there will be more Nobel prize winners if a country
consumes more chocolate? No
➢ Spurious correlation exists
➢ 2. There is a reverse causal relationship (from y to x rather from x to y)
o
➢ 3. A factor is missing, a third variable (“omitted variable”)
31
P-values
➢
➢ Usually “Reject H0: βj = 0 when p is small”
➢ Still, error can occur
➢ There is a 5% chance that βj = 0
Subjects
❖ Missing data
❖ Text analysis
Types of non-response
32
o
➢ Incomplete data
o
o Ex. For case ID 3, there are missing numbers in V1, V3 and V5
o Very common in practice, incomplete data often occurs
o
o Probability of a missing value does not depend on the variables in the dataset (it
is entirely random / unsystematic)
o Maintain representativeness
o Usually nothing much to be done here to solve because it is not a major problem
33
➢ Missing at random (MAR)
o
Probability of missing does not depend on the variable itself, but on some other
o
variable(s) in the data
o If there are missings for variable y, then the probability of such a missing
depends on variable x
o Systematic differences
o Ex. missing data on the blood pressure among young people because it is mainly
measured among the elderly
o Easy to solve as we can observe the variable x
➢ Missing not at random (MNAR)
o
o Probability of a missing value depends on the variable y itself
o In the case above, low IQ scores are the missings
o Ex. voting preferences
o Why is this the most complicated form?
▪ We don’t really know the pattern of missing y’s
Solutions
34
Imputation
o
➢ Imputation with the average
o So there are missings for birth weight (y), what will the chart look like after this
type of imputation?
o
➢ Regression imputation
(Dis)advantages of imputation
35
➢
Trump’s Tweets
Text data
➢ Data mining
o Using analysis techniques to better understand patterns and relationships in a
large data set
➢ Text data are everywhere
➢ The purpose of text mining is to translate unstructured text data into useful (numerical)
information – which is then used in data mining
36
o Words mean something different in other contexts
o Emotions, emojis, etc.
o Linguistic structure, some word order plays a role
➢ Texts can be dirty
o We need to clean the text before analysing them
o Spelling, unexpected punctuation, etc
o Synonyms, abbreviations, etc.
Corpus
Preparing text
➢ Tokenization
o Splitting text into tokens (such as words/terms)
➢ Normalization: standardizing text
o Lowercase, no punctuations, restore/remove other characters
o Canada-> canada
o 2mrrw -> tomorrow
➢ Stopwords are removed
o Words that often occur in a language but are not relevant to the analysis
o Ex. a, for, that, the, is, etc.
➢ Stemming of words
o Keeping the “root”
o Prefixes and suffixes are removed
o Plurals become singulars
o Ex. Cars and car are the same term
Term frequency
➢ Counting terms
➢ Binary term-document matrix
o Whether a word occurs in a document or not
37
o
➢ Frequency term-document matrix
o How often does a word occur in a document
o Relative measure is used
IDF
Visualizing IDF
38
TF-IDF
What is the advantage if using TF-IDF values rather than TF values when analysing the state of
the unions of US presidents?
➢ TF-IDF shows you the uniqueness of the the speeches of each president
➢ And also their political priorities/agendas
➢ Clustering documents
o Which documents form a cluster of similar reviews
39
o
➢ Word clouds
o visualization of most common words
o
➢ Distance between documents
o How (dis)similar are two documents d1 and d2
o
➢ N-grams
o list of n consecutive words
40
o
o Words that are always mentioned after each other
o Bigrams, Trigrams,…..
Sentiment analysis
➢ Extracting reviews
➢ Ex. What people think about a new product in a market, opinions about a political party,
how people react to certain marketing actions, etc.
➢ Polarity measure sentiment as a numerical value (between -1 and +1)
o Difference between positive and negative words (as percentage)
o Often: negative, neutral, positive
➢ Subjectivity, whether it is opinion or fact based
➢
➢ Sentiment lexicons
➢ Stance analysis (whether people are in favour of a measure or not in percentage)
41
We covered a lot
Mindmap
Overview of lectures
Oracle: infographics
GIS in Oracle
➢ Geographic information systems charts – merging maps and statistics to present data
collected over different geographical regions
➢ Can be done in numerous ways in Oracle
42
➢
Dashboards
43
➢
Preparations
Silhouette score
➢ “One rule of thumb is that the ratio of between-cluster distance … to average within-cluster
distance should exceed 1.0 for useful clusters”
➢
o So if the distances are 0 then not useful
Other criteria
44
➢ Similarity in 0/1
➢ Matching coefficient
➢ Jaccard’s
➢
➢ Ceteris paribus
Correlation ≠ causality
➢ 1. Coincidental correlations
➢ 2. There is a reverse causal relationship
➢ 3. A factor is missing
Overfitting
➢ Overfitting: the model explains very well but does not predict well
➢ How can this happen? Why is it a problem
➢ Solution: Training set; Validation set; Test set
➢ Which situations lead to poor predictions on new datasets?
Confusion matrix
45
➢ True positives, true negatives, false negatives, and false positives
Confidence interval
Line charts
Lie factor
Text analysis
➢ The text is analysed so that we can identify the profile of the text (DNA of the text)
➢ We're able to investigate fraudulent or non-fraudulent firms
Sentiment analysis
46
Challenges of sentiment analysis
47