0% found this document useful (0 votes)
12 views43 pages

Standardization & Probability: Empirical Methodologies & Theory of Science

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views43 pages

Standardization & Probability: Empirical Methodologies & Theory of Science

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Standardization &

probability

Empirical Methodologies & Theory of Science


04.10.2024 2

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
04.10.2024 3

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
04.10.2024 4

What do you think is Standardization?


04.10.2024 5

N: “How was your kebab?”


R: “It was a 5!”
> Raw scores make little sense
> Standardization helps!
04.10.2024 6

Standardization is the process of establishing and applying a set of


uniform criteria, guidelines, or practices to ensure consistency and
quality across different entities, processes, or products.
In industry, the goal of standardization is to ensure that products,
services, or processes are reliable, compatible, and understandable.
We can translate that to statistics.
04.10.2024 7

*
In statistics, standardization is a way to transform data so that it
becomes easier to compare different variables, especially when they’re
on different scales or units.
For example, you might have one set of data measured in kilograms and
another in centimeters—standardizing helps you compare them more
easily by putting them on the same scale.
04.10.2024 8

N: “How was your kebab?”


R: “It was a 5 out of 5!”
04.10.2024 9

*
However!
Even with standardization, data may
require attentive interpretation!

• Implicit standards
• Manipulation
• Bad questions or measurements
04.10.2024 10
04.10.2024 11

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
04.10.2024 12

Why do you think, we do Standardization?


04.10.2024 13

Imagine you have two types of measurements:


• Kebab quality measured in stars: 0 to 5.
• Spiciness measured in scoville: 0 to 9 mio.

And now you are interested if your kebab quality depends on the spiciness!
04.10.2024 14
04.10.2024 15

Since kebab quality and spiciness are measured on very different scales,
comparing or analyzing them directly could be misleading.
Standardization helps by transforming these values to a common scale —
one where the mean (average) is zero and the spread (standard deviation) is
the same for all the data.
04.10.2024 16

*Standardization – Why?
1. Comparability: By transforming variables to a common scale, you can compare different
datasets or features that originally have different ranges or units.
We want to compare kebab quality and spiciness score!
2. Modeling: Many machine learning (ML) algorithms work better or only when input
features are standardized because it prevents some features from dominating.
We want to be able to put it into ML algorithms (bc that’s everything today)!
3. Normalization of Distributions: Standardizing data can convert different distributions
into the same scale, which helps identify patterns or anomalies.
We want to see whether e.g. some spicy kebabs are always better!
4. Interpretation of Z-scores: Z-scores tell you how many standard deviations a data point
is from the mean. This allows to interpret its relative position within its distribution.
We want to be able if the Kebabistan kebab is really above average! ;)
04.10.2024 17

Real Life Example Kebab Standard kebab Standardized


Kebab Place
score formula values
On the right you see my kebab ratings of
Kösem 3.6
Nørrebrogade places and an estimated
Kebabistan 3.4
spiciness.
Dürüm synfonie 3.6
Now, we’ll do some stats! Kebab bar 2.0
1. Calculate the mean for the scores. Fuldkorn kebab 2.8
2. Calculate the difference to the mean for Ramo's 3.0
each observation. Gaza grill 3.2
3. Calculate the mean of the differences — Durum bar 3.0
that’s the standard deviation!
Flamingo 3.0
(ABSOLUTE DIFFs = UDEN FORTEGN) Berlin Döner
2.4
Kebab
04.10.2024 18

Ras’ Slide (relevant for the exercise)


Højde Afstand til gennemsnit (Højde – gns)

177,0 177,0 - 170 = 7,0

• Calculate the difference to the mean 159,5 159,5 - 170 = 10,5

for these observations 182,0 182,0 - 170 = 12

159,9 159,9 - 170 = 10,1


• Now: Calculate the average of the
170,5 170,5 - 170 = 0,5
differences
166,9 166,9 - 170 = 3,1
• (ABSOLUTE DIFFs = UDEN FORTEGN)
163,0 163,0 - 170 = 7,0
• I get 9,62 cm 152,8 152,8 - 170 = 17,2

190,5 190,5 - 170 = 20,5

178,3 178,3 - 170 = 8,3

Mean: 170,04 Mean: 9.62


04.10.2024 19

Ras’ Slide
(relevant for the exercise)
• The average height is 175 cm but
people’s height differ
• Standard deviation: How far are people’s
height from the average
• ...on average?
• (NOT!!! the exact definition but a useful
mnemonic rule)
• Full definition next time
• Approx. ⅔ of the observations lie between
+- 1 SD (Standard Deviation)
• … IF the observations follow a normal
distribution (bell curve)
04.10.2024 20

• The mean kebab score is 3 stars but kebab scores differ!


• Standard deviation: How far are kebab scores from the mean
• ...on average?
• Kebab score have a standard deviation of 0.36.
• We write σ (sigma) = 0.36.
04.10.2024 21

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
04.10.2024 22

How do you think, we do Standardization?


04.10.2024 23

*
This is a
Standard Distribution
We want our data to
look like this (important:
look at the values!).
04.10.2024 24

* Standardization — How?
As for everything, we have a formula. For this we need variables.
1. Z is what we want to get out, the Ztandardised Zcore
2. σ (sigma) is the standard deviation.
3. X is the data point.
4. μ (mu) is the mean.
Standardization Formula
04.10.2024 25

Kebab Kebab score Standard kebab formula Standardized values

Kösem 3.6 Z = (3.6 − 3)/0.36 3.6 → 1.67

Kebabistan 3.4 Z = (3.4 − 3)/0.36 3.4 → 1.11

Dürüm synfonie 3.6 Z = (3.6 − 3)/0.36 3.6 → 1.67

Kebab bar 2.0 Z = (2.0 − 3)/0.36 2.0 → -2.77

Fuldkorn kebab 2.8 Z = (2.8 − 3)/0.36 2.8 → -0.55

Ramo's 3.0 Z = (3.0 − 3)/0.36 3.0 → 0.00

Gaza grill 3.2 Z = (3.2 − 3)/0.36 3.2 → 0.55

Durum bar 3.0 Z = (3.0 − 3)/0.36 3.0 → 0.00

Flamingo 3.0 Z = (3.0 − 3)/0.36 3.0 → 0.00

Berlin Döner Kebab 2.4 Z = (2.4 − 3)/0.36 2.4 → -1.67


04.10.2024 26

And if we would plot this


now, it’d look like this:
(And if we’d do the same to
the scoville values, we’d have
the same shape and can
compare them :))
04.10.2024 27

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
28

Probability
04.10.2024 29

Probability
I love probability, it’s everywhere.
And it has the power to express uncertainty.

Traditional examples are usually a coin, dice, or drawings cards.


But there are many more examples!

Task: Everyone comes up with an example now (30 secs).


04.10.2024 30

*How do we calculate probability?


3. Sample Space (S)
The sample space is the set of all possible outcomes.
E.g., for rolling a six-sided die, the sample space is 𝑆 = {1,2,3,4,5,6}.
4. Event Space (E)
An event is a specific outcome and the event space a group of outcomes.
E.g., for a die roll, an event could be rolling an even number (𝐸={2,4,6}).
• In real examples mapping event spaces can be tricky (as you’ll see in a sec).
5. Probability (P)
Probability is a measure of how likely an event is to occur. It is defined as: 𝑃
(𝐸) = Number of outcomes (of interest) / Total number of possible outcomes
This is called Simple Probability.
04.10.2024 31

Example of Simple Probability


Is it my birthday today?
Sample space: {1, 2, …, 365}
Event space: {X}
Theoretical Probability: 1 / 365 = 0.0027
04.10.2024 32

*
Conditional Probability
Is the probability of an event happening given that
another event has already occurred.
It's a way to update our probabilities when we
have additional information.
Task: Everyone comes up with an example of
conditional probability now (30 secs).
04.10.2024 33

Simple Probability —
What is the chance of dicing one six?
Throw: 1 2 3 4 5 6
X
04.10.2024 34
04.10.2024 35

Conditional Probability —
What is the chance of dicing two sixes?
Two 1 2 3 4 5 6
throws
1

6 X
04.10.2024 36

What we’re doing today


1. Standardization
a. What?
b. Why?
c. How?
2. Probability
3. Worked example
04.10.2024 37

A Classic Example: The Birthday Problem


What is the chance that two people in here have
the same birthday?
• Not the same age, just the same day ;)
• Can be any day in the year!
Task: Discuss for 3 minutes!
04.10.2024 38

Let’s test! ;)
1. Go to https://www.random.org/integers/ (or scan the QR Code)
2. Make a list of X numbers between 1 and 365, organized in a single column;
X = number of people in this room
3. It should look like this (to the right).
04.10.2024 39

• How many of you had no duplicate numbers?


• How does that reflect on your estimate for the minimum
number required for two to share a birthday?
04.10.2024 40

TIME FOR KNIME


04.10.2024 41

Opposite Approach (no shared birthdays)


It's easier to first calculate the opposite: that no one shares a birthday, and then subtract it
from 1 (1 is always 100%, all possibilities).
Imagine everyone standing in a row. The very first person A cannot share a birthday with
themselves. Person B must have a different birthday from person A. So, there are 364
available. The probability of not sharing a birthday with the person A is 364 / 365.
For person C, they must have a different birthday from person A and B, so there are 363
days left. The probability of no shared birthday for the third person is 363 / 365.
So we write down 364 / 365 * 363 / 365 … until we’ve reached the end of the row of people.
In other words, until we’ve reached X people. For X = 23, P(no shared birthday) ≈ 0.4927.
So P(shared birthday for 23 people) = 1 - P(no shared birthday) ≈ 1 - 0.4927 = 0.5173
04.10.2024 42
04.10.2024 43

Thanks! :)

You might also like