0% found this document useful (0 votes)
5 views

Business Analytics - Notes

class of 2022 Leiden university.

Uploaded by

Naomi Zucker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Business Analytics - Notes

class of 2022 Leiden university.

Uploaded by

Naomi Zucker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

BUSINESS ANALYTICS 2021-2022

LECTURE NOTES

Week 1: Introduction Business Analytics; Data properties (Descriptive Analytics)

Subjects

❖ Trends in Business Analytics


❖ Relevance of Business Analytics
❖ Business Analytics applications
❖ Datasets and variables
❖ Summarizing relevant information

What to expect?

➢ The role of big data and business analytics


➢ Data visualization
➢ Theoretical aspects of theories and how to apply the tools
➢ Make informed decisions in business
➢ Three software: Grasple- make exercises and practice, excel, and oracle- data
visualisation
➢ Mostly about data selection, statistics
➢ Breaking down of big data
➢ BA can also help improve the health of people and operations in hospitals

Developments and trends

➢ Three developments spurred massive growth in the use of analytical methods in


business operations
o Tech advances (easier to produce large data)
o Methodological developments (easier to explore and visualize data, faster
algorithms)
o More computing power and storage capability

What is Business Analytics?

➢ “Scientific process of transforming data into insight for making better


decisions”
➢ Raw data is translated into some insights, policies, or recommendations
➢ Fact driven
➢ Ex. Amazon possesses huge databases with purchase, preferences, and
recommendations. Contains info about potential buyers
o Entrepreneurs who apply for bank loan, how likely is the entrepreneurs likely to
pay back the loan. The bank has the entrepreneur’s historical information and
estimate certain probabilities for credit worthy

1
➢ Many data sets can be found everywhere -dataoverheid.nl-

Descriptive

➢ Descriptive analytics
➢ All techniques that describe what has happened in the past
➢ Ex. Visualisation techniques, dashboards,

Predictive

➢ Predictive analytics
➢ Techniques that use data in the past to predict behaviour in the future, determine the
impact of factors
➢ Goes one step further than descriptive
➢ Ex. Use of historical sales to predict future sales
o Use of purchasing behaviour of consumers to predict market shares
o What are the risk factors of a new disease?
o Which characteristics determine whether a soccer team is able to win the
match?

Prescriptive

➢ Prescriptive analytics
➢ Indicates the best course of action to take
➢ Based on the data what is the optimal thing to do and make given the restrictions that
are present
➢ Ex. What is the best pricing strategy for a company?
o Which location should a factory be opened to meet customer requirements at
minimum cost?
o What financial investments need to be made to achieve superior returns with as
little risk as possible?
o More advanced than descriptive and predictive

Big data

➢ Big data are extremely large data sets


➢ The 4V’s
o Volume- volume and file size (data at rest)
o Velocity- speed at which data become available (data in motion)
o Variety- all kinds of forms of data (data in many forms)
o Veracity- uncertainty in the data (data in doubt)
➢ Sometimes the 5fth V is added, the Value of data

2
Sub domains of Business analytics

➢ Finance- prediction of performance


➢ HR – how can we make sure the well-being level of employees are high
➢ Health care analytics- how do we speed up the diagnostics
➢ Sports analytics – how do we optimize the performance of a team, ticket prices
throughout the seasons
➢ Web analytics – where do I place my ads? Where are they most useful, web traffic
➢ Legal applications for courts and lawyers – prediction outcomes of supreme court
decisions

How do datasets look like?

➢ Data- fact and figures (information) collected, analysed, and summarized for
presentation and interpretation
➢ Variable- quantities of interests than can take on many values
➢ Observation- set of values of corresponding to a set of variables
➢ Variation- variety in your data

Data

➢ Data often has a sample rather than the entire population


➢ Population – all elements of interests
➢ Sample- subset of the population, random sampling: goal is to gather a representative
sample of the population data
➢ Quantitative data
➢ Categorical data
➢ How to collect data? Usually, questionnaires and surveys (observational studies)
o Day reconstruction – asking what people do per day

Week 2: Data visualisation in Oracle (Descriptive Analytics)

Subjects

❖ Introduction to Oracle
❖ Data visualisation techniques in Oracle

Accelerating analytics (Oracle software)

➢ Creating a relational database

3
➢ Managing data
➢ Oracle’s top three customers: SailGP (adrenaline-fueled boat racing, powered by wind
speed nature), Premier League and Redbull

What challenges are SailGP facing?

➢ 1. Challenge in data management – 1 day of sailing created 40BN rows of data from
thousands of sensors
o Solution – Oracle combines all huge type of data into one database
o User case of Britain – UK crashed their boat in Spain. Why? Because all the data
points are scattered around the place
o Foiling is when boats start ‘flying’
➢ 2. Analytics - too many data to convert
o Converting 900 data points per sec for each boat into real time data
o Because of these ‘data streams’, sailors are able to see how fast other competitors
sail and see more insights on the racetrack
o Automatically calculates speed v distance for sailors, shows them how they can
get to the final destination the quickest (through different destinations)
➢ 3. Machine learning- predicting boat speeds, etc.
o To predict the optimal boat speed and optimal course to take

Collab with premier league

➢ Calculating win probability

Oracle cloud infrastructure

➢ Oracle labs
➢ Check under week 2 (Oracle material) on Brightspace for Oracle manual and practice

Week 3: Data visualisation and cluster analysis (Descriptive Analytics)

Subjects

❖ Data visualisation
❖ Cluster analysis

Data visualization

➢ Of vital importance
➢ Logical follow-up of 1st week (measures of location, variability, and shapes)

4
➢ Summary table for the data
➢ Good means to communicate the message to others

How to lie with statistics

➢ Famous data visualisation book which writes about examples of misleading charts and
graphs

Table vs graphs

➢ Use tables when:


o Communicating exact numerical values
o Compare few values and numbers
o Precision
o Displaying all kinds (range) of variables of different units of observation
➢ Use graphs when:
o Communicating patterns or trends over time
o Demonstrate how certain phenomena associate with each other
o Relationships between variables

Data-ink ratio

➢ About info that is displayed in a table or a graph


➢ Proportion of ‘data-ink’ to total amount of ink
o Goal is to have high data ink ratio
➢ Data-ink ratio-> fraction of data ink compared to the total amount of ink
➢ Data-ink -> ink used in a table or chart that is necessary to convey the meaning of the
data to the audience
➢ Non-data-ink -> no useful purpose in conveying data

Tables

➢ What can you do to increase the data-ink ratio in tables?


➢ Use horizontal lines to separate variable names from the data, or when a calculation has
been performed
➢ Avoid horizontal or vertical lines
o Work with white spaces
➢ Don’t use too many decimals (i.e., 2.298381717)
➢ Do not repeat dollar (money) signs, percentages, etc. (should only indicate this once)
➢ Align numbers to the right and texts to the left

5

Visualisations in tables

➢ Use when adding something or to emphasize extreme values


➢ Highlighting values

o
➢ Adding sparkline to your tables

o
➢ Heat maps

o
➢ Table + graph

6
o

Improving data-ink ratio

➢ Working with colours


➢ Remove borders around the chart
➢ Use data markers only with a purpose
➢ Provide clear axes
➢ Directly label the data
➢ Before & after:

Graphs: 4 types


➢ Majority of graphs fits in either of the four types
➢ Points
o Scatter charts displays the relationships between two variables
o Correlation relationships
➢ Lines
o Nominal scale – categories that are not sorted
o Ordinal scale – categories that are sorted
o Interval scale – underlying measure of time

7
o
o Line charts should be used in interval scale
o Suitable for trends and patters
o Only directly connect values in adjacent intervals
o Time-series relationships
o Whether or not you include ‘0’ depends on how you illustrate the values
➢ Bars
o Serve to highlight individual values
o Suitable for comparing/raking categorical values
o Good for: Ranking relationships, part to whole and nominal comparison

o
o Can be misleading
o What can be improved while using bar chart? Include or start with ‘0’
(quantitative scale)
o Bar vs column charts
o Pie charts – avoid this chart to properly compare categorial data, avoid 3D and
many slices

➢ Boxes
o Boxplots tell the entire distribution of a variable
➢ Nice visualisation tool -> Gapminder.com
➢ Bubble chart
➢ Geospatial relationships in 2D and 3D in excel or oracle

Choropleths

➢ A choropleth map is a geographic visualization that uses shades of colour, different


colours, or symbols to indicate the values of a quantitative or categorical variable
associated with a geographic region or area
o Vulnerable to area bias

8
o
➢ Spaghetti graph

Declutter and focus

Cluster analysis

➢ More advanced (k-means)


➢ Concept of similarity

9
➢ Goal/aim is to segment observations into similar (homogenous) groups (the
clusters)
o Ex. Market segmentation
➢ A doctor may reason about a new difficult case by recalling a similar case (either treated
personally or documented in a journal) and its diagnosis

Similarity

➢ Measuring similarity
➢ Observations within cluster similar to other observations within cluster; but unequal to
observations in other clusters
➢ Determine the cluster variables
➢ For quantitative variables, the Euclidean distance is the most common distance
measure
➢ Another option: Manhattan distance
➢ Distance measures:

Similarity measures

➢ Similarity in case of 0/1 variables


➢ Matching coefficient- how similar are observations when these variables are binary
variables?
➢ A variant of this is “Jaccard’s coefficient”
➢ Distance= 1—similarity

10

Afstandsmaten

➢ Each observation (e.g. a person) contains measures for q variables


➢ For person u we have (u1, u2, … , uq) and for person v we have (v1, v2, … , vq)

z-scores

➢ Are useful to “correct” for the different units of measurement

Cluster analysis techniques

➢ Hierarchical clustering: as a starting point, each observation belongs to its own


clusters
o The most equal clusters are then combined on the basis of similarity to arrive at
nested clusters
o Clusters need to be merged


11
▪Single linkage -> Similarity between clusters is determined by the
shortest distance between two observations (nearest)
▪ Complete linkage -> Similarity between clusters is determined by the
longest distance between two observations (furthest)
▪ Group average linkage -> Similarity between clusters is determined by
the average distance across all pairs
▪ Centroid linkage -> Similarity between clusters is determined by the
distance between the "centroids"
➢ Dendrogram shows the output of a hierarchical clustering
o Vertical axis shows distance between clusters
o indicates where "natural" clusters are present

o
➢ K-Means clustering assigns each observation to one of the k clusters in such a way
that the observations within a cluster are as similar as possible
➢ Number of clusters k is specified in advance
➢ The “cluster centroids” are calculated (the “means”)
➢ Step 1

o
➢ Step 2

o
➢ Step 3

12
o
➢ Step 4

o
➢ When do we have a good solution?
o Silhouette score-> the ratio of between-cluster distance, to average within-
cluster distance should exceed 1.0 for useful clusters
➢ Elbow-method
➢ How close are the observations to the centroids?
➢ Both methods (hierarchical and k-means) depend on how similar observations
are to each other

Comparing the two

➢ Hierarchical
o Start with all observations as clusters (bottom-up)
o Suitable for smaller datasets
o Visually appealing
o Multiple types of variables
➢ K-means
o Start with k clusters (random; disadvantage?)
o Suitable when you know how many clusters you want and suitable for larger
datasets – also less computationally intensive
o Suitable when one wants to summarize the data with k average observations
with minimal margin of error
o Quantitative/numerical variables only

13
Week 4: Statistics and regression analysis (Predictive Analytics)

Subjects

❖ Samples and statistical testing


❖ Associations
❖ Linear regression analysis

Samples and testing

➢ Statistics tests always start with surveys

Samples

➢ Researchers are Interested in a certain underlying population


➢ Whether the samples you choose
➢ Samples are used to draw conclusions about a population characteristic
o Ex. How LU students feel about the covid crisis
o A census is expensive, takes time, can become outdated or can be unnecessary
➢ Representativeness is important– it is certain characteristics of a population (variation)
➢ Samples need to have a specific variation, representativeness is sometimes neglected

Population parameter

o
o p is the probability

o
➢ Point estimates
o To arrive to a certain estimation
➢ We usually already know the point estimates, the parameter is most of the time
unknown
➢ Multiple estimates are possible

14
A distribution of all sample means

➢ Different averages for different samples


o Sample distribution – the idea that we have diff samples and diff averages, and
compare them together in a graph

Statistical testing

➢ Statistical testing is about comparisons


o Ex. Comparing English proficiency in different central European countries (is it
true that the Netherlands has the highest proficiency in Europe?)
➢ Questions such as: Is there an inequality? A difference among countries? Is it the same?
➢ We work with null hypothesis (𝐻0 ) -> assumption about a characteristic in a
population
o Challenging hypothesis
o Opposite of the null hypothesis is alternative hypothesis (𝑯𝒂 )
➢ The aim is to falsify or not falsify the null hypothesis, given the sample at hand
(evidence)
o Ex. For this course, do I find evidence that supports or rejects the hypothesis
‘the grades for this course have improved every year’
➢ Often there is ‘no relationship’ as the null hypothesis
o ‘No difference’ in grades basically

Mistakes one can make


➢ Situation in the population:
o 𝐻0 true (no difference in grades) or 𝐻0 not true (there is a difference in grades)

15
➢ Could either be that we find evidence in favour of the null hypothesis
➢ Conclusions based on the samples:
o Do not reject 𝐻0 or reject 𝐻0
o Type I error plays an important role (the hypothesis is true, yet you reject it)
o Significance level α: probability of making type I error (usually 0.01 or 0.05)
o Always ‘reject’ and ‘do not reject’, we do not talk about ‘accepting the null
hypothesis’


o Ex. Hypothesis about the mean in the population
o 𝑥̅ = sample mean
o µ = the hypothesized population mean in the null hypothesis
o SE = standard error
o p value = probability of obtaining results at least as extreme as the observed
results assuming null hypothesis is correct (smaller p value the stronger evidence
in favor of alternative hypothesis)


➢ t is used to determine whether 𝑥̅ deviates from µ
➢ P-value is very important -> means that the probability that we found Is lower than our
average 𝑥̅
➢ P-value is also known as the probability of exceeding t
➢ Smaller p value means you are on the far left or right -> more evidence against the null
hypothesis
➢ α= 0.05
➢ Looking at the graph, the p-value is 0.0039, which means p ≤ α. We can reject the null
hypothesis

Overview of steps

16

Sample and uncertainty

➢ Uncertainty due to sampling


➢ Do not interpret the sample result as the exact proximation of reality, take the result
with a grain of salt

Intervals

➢ Uncertainty always matters


o 90% of the values are within 1.645 standard deviations from the mean
➢ Interval = estimate ± margin of error
➢ Estimate could be 𝑥̅
o 𝑥̅ is normally distributed
➢ Confidence level

Estimates

➢ Standard error

17
o
➢ T-distribution (takes account of uncertainty)


➢ Interval = estimate ± margin of error
➢ Interval = estimate ± (t-value x standard error)

big n and uncertainty

➢ More data (bigger sample size n) means less uncertainty


➢ Practical significance -> what is the real-world impact of the result
o Practical significance should always be considered in conjunction with statistical
significance

Correlation coefficient


➢ Rules of thumb:
o Between -0.10 and 0.10 -> small strength
o Between 0.10 and 0.30 (-0.10 and -0.30) -> medium
o Between 0.30 and 0.50 (-0.30 and -0.50) -> large
o More than 0.70 -> too large, may cause multicollinearity

Non-linear

➢ Only measures straight line


➢ Correlation coefficient only looks for one line, measuring linear relationship (limitation)

18

What do you ignore when calculating correlations?

➢ Two variables could be impacted by the third variable driving the relationship

Multiple variables

➢ Regression analysis -> incorporate multiple variables, how a variable is influenced by


a large group of variables
➢ Y variable is something you want to be predicted (dependent)
o Ex. The popularity of FB posts
o Sales of product
o Which department performs best
o Performance of a sports team
➢ X variables is in which way you could predict your y? time, locations, etc.
(independent)
o Ex. time of the day, day in the week, week of the year
o Properties of locations
o Age, education
o Characteristics

Regressions at Netflix

➢ They created a competition to look for the best regression model to predict user ratings
for films, based on previous ratings

19
Linear regressions model

➢ Y is related to an intercept (constant), x and an error term:

o
➢ β are parameters
o Asses how y and x are associated with each other
o Estimation, expectation
➢ Ɛ is error term
o There is variation in your data set
o The expected/estimated value of y is β0 + β1x (ŷ)

Least squares method

➢ Generating estimates
➢ It is about the difference between the actual value y of observations i and the predicted
value of observation i


➢ If x decreases by 1, y increases by b1
➢ b1 (slope) is the estimated change in the mean of y when the independent
variable x increases by 1
➢ b0 is the estimated mean of y when the independent variables x equals to 0

20
The fit

➢ Coefficient of determination
➢ Denoted by r2
o Asses how good is the model
o How good is the axis in order to predict y
o How much are we able to improve in drawing the horizontal line
➢ Between 0 to 1 (0% to 100%)

Multiple regression model

➢ Adding multiple axes


➢ Goal is to find estimates of β1,2,3,…..βq

Interpretation


➢ bj is the estimated change in the mean of y when variables xj increases by 1
➢ Control variables (ceteris paribus condition -> other things equal)
➢ Why do we want to add control variables?
o The variables are held constant
o You could be interested in a certain variable yet can freely add other variables
for comparison
o Sometimes difficult to choose

How are we going to interpret the results of a linear regression?

➢ βj = 0


➢ Reject the null hypothesis when there is a significant relationship between y and xj
➢ Do not reject when there is no significant relationship (y does not change when x
changes)
➢ Large t is the same as a small p, so reject when t is large
➢ Reject H0: βj = 0 when p is smaller than α (often 0.05)

Different types of variables

➢ What does βj mean for different types of variables?


➢ Dummy variables = 1/0
o Ex. Married (yes=1/no=0)
o For married people (1) the mean salary is βj higher than for non-married (0)

21
Interpretation with FIFA data


➢ Steps to find the relationship between value (y) and age (x, the independent
variable):

o
In conclusion, reject H0: βage = 0
o
There is a significant negative relationship (linear) between age and value
o
Why negative? Because the t value is -31,10
o
The mean of value decreases by 173 thousand when age increases by 1 year
o
▪ While holding the values of all other independent variables constant
➢ For acceleration, there is no significant coefficient because the p-value is 0.16>0.05. we
do not reject H0: βacceleration = 0

Week 5: Logistic regression and classification trees (Predictive Analytics)

Subjects

❖ Logistic regression
❖ Classification and classification trees
❖ Correlation and causality

Overfitting

22
➢ Overfitting when the model is good in explaining but not predict well


o Underfitting is too general, and overfitting is too specific
➢ How can this happen?
o When you overlook the general picture/relationship of the dataset with many
variables
o Too complex
o Too focused on some variables only
o Occurs often in regression analysis
➢ Why is this a problem?
o Negatively impacts the model’s ability to generalize
o Can produce misleading values
o Accuracy decreases
➢ Solution: - divide the data into Training set (explaining, 80% of the data);
Validation set (making prediction, 20% of the data); Test set (usually for new
data sets)

(Un) supervised learning


➢ Supervised learning are regression and classification techniques -> always have an
outcome variable (y)
o Predictive analytics
o Explain and predict well
o Training and validation sets
➢ Unsupervised learning are clustering and text mining technique -> for descriptive
purposes, explaining rather than predicting
o Descriptive analytics

23
o Explaining
o Training set

Predictions at the RIVM

Examples of 1/0 outcomes

➢ What’s the likelihood of getting in a car accident in the UK? (1= gets into accident 0=not
getting into accident)
➢ Lebron James says he was ‘frustrated’ by false positive COVID test (1=positive
0=negative)
➢ Wearable sensors can tell when you are getting sick (1=getting sick 0=not getting sick)
➢ Classifying spam emails (1=yes 0=no)
➢ And many more…..

Classifying ones and zeros

➢ How are we going to classify observations as 1 or 0?


➢ Classify observations in category 1 (positive) when the predicted probability is above
a certain value (cut-off)
➢ Cut-off value should be at least 0.50


➢ Error always happens

24
Confusion matrix

➢ How are we going to select the best model?


➢ The matrix shows the overview of the prediction qualities


➢ Asses the matrix by means of measures
➢ False positive -> predicted a 1 instead of 0
➢ False negative -> predicted a 0 instead of 1

Accuracy

➢ How accurate is the classification?


➢ “Overall error rate” -> Percentage of false positives and false negative

o
➢ The accuracy of the model is 1 minus the overall error rate -> Percentage of true
positives and true negatives
➢ If the accuracy is 80% then the error rate is 20%
➢ Are you always interested in a high accuracy value? If so, why? If not, why/when not?

o
oToo high numbers are not that useful (the zeros are very high in the table
90+900 = 990)
o Let’s say if we want to only focus on the ones, it is only 10+40 =50, which means
that the accuracy is quite low in that case
o The ones are usually more important than the zeros
➢ Sensitivity (recall): percentage of true positives within category 1 (calculating correct
ones)

25
o
➢ Specificity: percentage of true negatives within category 0 (calculating zeros)

Shift the cut-off value

➢ What happens to the error rates of both categories when we lower the cut-off value?
Why?


▪ In the graph above, class 1 error rate is 20% which means that the
sensitivity is 80%
▪ Class 0 error rate is 70% which means specificity is 30%
o When you lower the cut-off value, it is easier to classify ones (sensitivity
increases)
o However, the prediction for zeros will be decreased (specificity decreases)
o And vice versa
o So, changing the cut-off value impacts the ability to predict the ones and zeros

Quality classification method

➢ Many cut-off values (thresholds)


➢ Best way Is to draw a graph called ROC curve
o For every cut-off values/threshold, we can see the performance in terms of
sensitivity and specificity
o Also shows the error rates for different cut-off values
o Area below the blue line = AUC (Area Under ROC Curve)

26
o
➢ Given any class 0 error rate, you would like to have a high sensitivity
➢ Given any sensitivity level, you would like a low class 0 error rate

Even more measures


➢ Measures to predict just ones
o Precision and F1 (combination of precision and sensitivity (recall))

So there is a trade-off

➢ Same as changing the cut-off value, there is a trade-off between sensitivity and
precision
➢ Makes it difficult to predict well

27
Oracle output


➢ Mainly about comparing different models rather than looking at the values/percentage
individually

Logistic regression

➢ How do we arrive to confusion matrix?


o By using logistic regression (a little bit different from linear regression)
➢ The goal is to predict a categorical outcome (1 versus 0)
➢ Analogous to the Multiple Regression Model
➢ Why can’t we use linear regression to predict 1 and 0?

o
o Focus on the y-axis, winner of the best picture
o The values in linear regression can be more than 1 or smaller than 0, which in
this case we are only interested in between 0 and 1
o There is no restriction in terms of prediction (the values may not always be
useful)

Advantages of logistic regression

➢ Rather than a straight line we are drawing an s-shape


➢ It avoids predicting very high or small values
➢ Better in terms of probabilities

28
o

How do we create the S?

➢ Probability of a 1 is p
o Ex. The probability of receiving a spam email, probability of getting a certain
disease, etc.
➢ Value between 0 and 1
o Use “odds” -> a ratio of 2 probabilities
o p/1-p (so p is receiving a span email and 1-p is not receiving a spam email)
o will be larger than 0

o
➢ In(p/1-p)

o
o This one takes all values
➢ Logistic regression model:

29
Question

➢ What do the training set, validation set, and test set have to do with this?
➢ Training set: to obtain estimates of the betas and to assess the “fit” (R2/Mallow’s Cp)
o Is about explaining
➢ Validation set: to transform p into classifications and to create a confusion matrix
o Prediction purposes (p)
➢ Test set: to use the best model with new information obtained
o Combines the two techniques to use in another different data set

Extra

Ho and Ha


➢ 0 is in the middle
➢ This is the situation of the null hypothesis (the beta is 0)
➢ Any value far away from the 0 tells us how much we are going to reject the null
hypothesis
➢ The red area is the “surprising value” which is far from our expectations
➢ Since there is a value in the red area (the alpha), we can reject H0
➢ Even though we rejected the H0 we could still be wrong

30
Difference between correlation and causation

➢ Correlation implies that there is a statistical association/relationship between two


variables
o Even though there is a correlation, is does not mean that one variable causes the
other
o Correlation does not imply causation
➢ Causation implies that there is a casual relationship, a change in one variable is directly
causing a change in the other

Correlation ≠ Causality, because

➢ 1. is a coincidence
o Ex. There is a correlation between the no. of Nobel laureates and chocolate
consumption per country
o Does it actually mean that there will be more Nobel prize winners if a country
consumes more chocolate? No
➢ Spurious correlation exists
➢ 2. There is a reverse causal relationship (from y to x rather from x to y)

o
➢ 3. A factor is missing, a third variable (“omitted variable”)

2 and 3 (based on Youtube link)

➢ Ice cream sales -> drowning


o What is wrong here? There is a third factor missing (Z) which is the high
temperatures of the weather
➢ Marrying -> life expectancy
o Reverse relationship - life expectancy influences marital status
▪ Y -> X
➢ Night light -> short-sightedness
o A factor missing (Z) - short-sightedness genes from parents
➢ Self-esteem -> high grades
o Reverse relationship - high grades lead to self-esteem or any other combination
▪ Y -> X

31
P-values


➢ Usually “Reject H0: βj = 0 when p is small”
➢ Still, error can occur
➢ There is a 5% chance that βj = 0

Week 6: Missing data and text analysis

Subjects

❖ Missing data
❖ Text analysis

Too few people participate

➢ Smaller sample than desired


➢ What is the disadvantage?
o Higher margins of error and larger confidence intervals (lower precision)
➢ When is non-response a problem?
o When it is not random – but selective
o Excluding people from the sample that refuses to answer the survey

Types of non-response

➢ No contact – impossible to reach


➢ Refusal – people who refuse to provide an answer
➢ Not able – people who are unable
➢ The larger the sample the higher the confidence level and smaller margin of error

32
o
➢ Incomplete data

o
o Ex. For case ID 3, there are missing numbers in V1, V3 and V5
o Very common in practice, incomplete data often occurs

Types of missing data

➢ Missing completely at random (MCAR)

o
o Probability of a missing value does not depend on the variables in the dataset (it
is entirely random / unsystematic)
o Maintain representativeness
o Usually nothing much to be done here to solve because it is not a major problem

33
➢ Missing at random (MAR)

o
Probability of missing does not depend on the variable itself, but on some other
o
variable(s) in the data
o If there are missings for variable y, then the probability of such a missing
depends on variable x
o Systematic differences
o Ex. missing data on the blood pressure among young people because it is mainly
measured among the elderly
o Easy to solve as we can observe the variable x
➢ Missing not at random (MNAR)

o
o Probability of a missing value depends on the variable y itself
o In the case above, low IQ scores are the missings
o Ex. voting preferences
o Why is this the most complicated form?
▪ We don’t really know the pattern of missing y’s

Solutions

➢ Ignore/delete observations with missing value


➢ Ignore/delete variables with missing values
➢ Systematically filling in missing values with estimated values
o For MCAR & MAR
o Imputation
➢ Others

34
Imputation

➢ Missings for birth weight of the baby (stage of pregnancy)

o
➢ Imputation with the average
o So there are missings for birth weight (y), what will the chart look like after this
type of imputation?

o
➢ Regression imputation

(Dis)advantages of imputation

35

Examples of text analysis

➢ Suppose in a bankruptcy there is a suspicion of fraud


➢ Has the CEO sent emails from employee e-mail boxes? Was the CEO involved in the
fraud or was it the employee?
➢ Possibly money has been channelled away (the fraudulent e-mail boxes)
➢ Company mailboxes
o Analyze the “DNA” of the emails from the CEO
▪ Number of punctuation marks used
▪ Average no. of words in a sent mail
▪ Assessing emotions based on positive and negative words
o Also from other employees
➢ So, what are the chances that a certain email is from the CEO or the employee?

Trump’s Tweets

➢ Text analysis assess sentiment analysis and also predicts


➢ Trump’s twitter account -> Tweets from Trump (Android) himself usually had negative
emotions while tweets from his staff (iPhone) were more positive

Text data

➢ Data mining
o Using analysis techniques to better understand patterns and relationships in a
large data set
➢ Text data are everywhere
➢ The purpose of text mining is to translate unstructured text data into useful (numerical)
information – which is then used in data mining

Why are text data so complex?

➢ Text data are unstructured

36
o Words mean something different in other contexts
o Emotions, emojis, etc.
o Linguistic structure, some word order plays a role
➢ Texts can be dirty
o We need to clean the text before analysing them
o Spelling, unexpected punctuation, etc
o Synonyms, abbreviations, etc.

Corpus

➢ A collection of documents is called a corpus


o Can be a comment, a tweet, a hotel review, a page, or a chapter from a book,
etc,
➢ A document consists of individual terms
➢ Terms are often words but can also be multiple words or sentences
➢ “Bag of words”
o Treat each document as a collection of individual words
o Ignore grammar, word order, sentence structure, punctuation
o Words can occur several times
o Works well for a lot of tasks
o Often used in sentiment analysis

Preparing text

➢ Tokenization
o Splitting text into tokens (such as words/terms)
➢ Normalization: standardizing text
o Lowercase, no punctuations, restore/remove other characters
o Canada-> canada
o 2mrrw -> tomorrow
➢ Stopwords are removed
o Words that often occur in a language but are not relevant to the analysis
o Ex. a, for, that, the, is, etc.
➢ Stemming of words
o Keeping the “root”
o Prefixes and suffixes are removed
o Plurals become singulars
o Ex. Cars and car are the same term

Term frequency

➢ Counting terms
➢ Binary term-document matrix
o Whether a word occurs in a document or not

37
o
➢ Frequency term-document matrix
o How often does a word occur in a document
o Relative measure is used

IDF

➢ Word is less important if it occurs in the corpus too often


➢ Inverse document frequency (IDF): “a boost for terms being unique”
➢ No. of documents divided by no. of documents containing the term t
➢ If 1 document contains t, IDF is large -> there is a boost
➢ If all documents contain t, IDF is small -> no boost
➢ Formally: inverse IDF of term t equals

Visualizing IDF

➢ 100 documents in the corpus

38
TF-IDF

➢ How often does a term t occur in a document d?


➢ Multiplying the term frequency with the IDF
➢ TF-IDF
➢ High value of TF-IDF means that the term t often appears in a document but not often in
other documents/corpus
➢ A low value (close to 0) indicates that term t does not appear often in the document and
appears often in others

Another variation of Term-document matrix

➢ TF-IDF values for each document d and each term t

What is the advantage if using TF-IDF values rather than TF values when analysing the state of
the unions of US presidents?

➢ TF-IDF shows you the uniqueness of the the speeches of each president
➢ And also their political priorities/agendas

What else can we do

➢ Clustering documents
o Which documents form a cluster of similar reviews

39
o
➢ Word clouds
o visualization of most common words

o
➢ Distance between documents
o How (dis)similar are two documents d1 and d2

o
➢ N-grams
o list of n consecutive words

40
o
o Words that are always mentioned after each other
o Bigrams, Trigrams,…..

Sentiment analysis

➢ Extracting reviews
➢ Ex. What people think about a new product in a market, opinions about a political party,
how people react to certain marketing actions, etc.
➢ Polarity measure sentiment as a numerical value (between -1 and +1)
o Difference between positive and negative words (as percentage)
o Often: negative, neutral, positive
➢ Subjectivity, whether it is opinion or fact based


➢ Sentiment lexicons
➢ Stance analysis (whether people are in favour of a measure or not in percentage)

Week 7: Recap of the course; Exam preparation

Role of data and analytics in companies

➢ Two books uploaded on BS

41
We covered a lot

➢ Role of data, analytics and visualisation in firms


➢ Techniques to analyse

Mindmap

➢ Good idea to relate the topics to each other

Overview of lectures

Oracle: infographics

➢ Popular way to communicate data visualization message to audience

GIS in Oracle

➢ Geographic information systems charts – merging maps and statistics to present data
collected over different geographical regions
➢ Can be done in numerous ways in Oracle

42

GIS in Oracle: heat maps

Dashboards

➢ Data dashboard: data-visualization tool that illustrates multiple metrics and


automatically updates these metrics as new data
➢ More emergence of dashboards thanks to covid-19

43

Preparations

➢ You should be able to interpret tables and graphs


➢ Interpret indicators/PivotTables and visualisations (1), regression and confusion
matrices (2) and text analysis (3)
➢ Not needed to calculate t-values and p-values
o The intuition behind it does matter
o know how to do something in Oracle/Excel/Voyant
o Memorize formula
➢ Confusion matrix

Silhouette score
➢ “One rule of thumb is that the ratio of between-cluster distance … to average within-cluster
distance should exceed 1.0 for useful clusters”


o So if the distances are 0 then not useful

Other criteria

➢ Size of the clusters


➢ Interpretation clusters
➢ How close are the observations to be centroids?

44
➢ Similarity in 0/1
➢ Matching coefficient
➢ Jaccard’s

Interpret the coefficients


➢ Ceteris paribus

Large samples and p-values

➢ P-value is 0 with large n


➢ “Practical significance should always be considered in conjunction with statistical
significance.” It is the “The real-world impact the result of statistical inference will have
on business decisions.”

Correlation ≠ causality

➢ 1. Coincidental correlations
➢ 2. There is a reverse causal relationship
➢ 3. A factor is missing

Overfitting

➢ Overfitting: the model explains very well but does not predict well
➢ How can this happen? Why is it a problem
➢ Solution: Training set; Validation set; Test set
➢ Which situations lead to poor predictions on new datasets?

Confusion matrix

➢ Shows the correct and incorrect classifications

45
➢ True positives, true negatives, false negatives, and false positives

Confidence interval

Line charts

➢ Suitable for trends and patterns


➢ Interval scale

Lie factor

Last tutorial (24 December)

Text analysis

➢ The text is analysed so that we can identify the profile of the text (DNA of the text)
➢ We're able to investigate fraudulent or non-fraudulent firms

Sentiment analysis

➢ Used in extracting reviews, politics, and social media


➢ Focus on subjectivity (polarity ->whether they are positive negative or neutral)

46
Challenges of sentiment analysis

➢ Language aspects, cannot deal with sarcasm well


➢ Implicit opinions
➢ TF is not the solution more about the associations between words
➢ Negations
➢ Emojis, quotes

Some final remarks

➢ MNAR is the most difficult


➢ k represents the number of neighbors
➢ Should be able to Interpret tables and graphs
➢ No calculations
➢ Nothing on oracle/excel/voyant
➢ Study the confusion matrix well (and their respective measures)
➢ Multiple choice 10-15 questions
➢ Open questions will be questions from the weekly lecture themes
➢ Practice on Grasple

47

You might also like