DV Stat
DV Stat
1 Introduction
3 Histogram
4 Dispersion
5 Correlation
2 / 23
Introduction
3 / 23
Describing a Single Set of Data
4 / 23
Histogram
5 / 23
Some statistics on data
6 / 23
Central Tendencies
7 / 23
Central Tendencies (Contd.)
Median
We’ll also sometimes be interested in the median, which is the
middle-most value (if the number of data points is odd) or the
average of the two middle- most values (if the number of data
points is even).
Note
The data points should be sorted.
Note
Notice that, unlike the mean, the median doesn’t fully depend on every value
in your data. For example, if you make the largest point larger (or the smallest
point smaller), the middle points remain unchanged, which means so does
the median.
8 / 23
Central Tendencies
9 / 23
Quantile
10 / 23
Mode
11 / 23
Dispersion
12 / 23
Variance
Variance
A more complex measure of dispersion is the variance, which is
computed as:
(xi − x̄)2
P
2
S =
n−1
(Here, x̄ is the mean of the data set.)
This can be implemented as:
13 / 23
Standard Deviation
The range will similarly be in that same unit in which the data is.
The variance, on the other hand, has units that are the square of
the original units.
Standard Deviation
We often look instead at the standard deviation, which is simply
the square root of the variance:
import math
def standard deviation(xs: List[float]) -> float:
return math.sqrt(variance(xs))
14 / 23
Inter-quartile Range
15 / 23
Covariance
Covariance
Variance measures how a single variable deviates from its mean
whereas covariance measures how two variables vary alongside
each other from their means:
16 / 23
Covariance (Contd.)
17 / 23
Correlation
Correlation
It’s more common to look at the correlation, which divides out the
standard deviations of both variables:
Note
The correlation is unitless and always lies between –1 (perfect anti-correlation) and 1 (perfect
correlation).
18 / 23
More about Correlation
19 / 23
More about correlation
Correlation tells you nothing about how large the relationship is.
For example:
The variables:
x = [-2, -1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
are perfectly correlated, but (depending on what you’re measuring)
it’s quite possible that this relationship isn’t all that interesting.
20 / 23
Correlation and Causation
21 / 23
Contents
1 Introduction
2 Dependence and independence of events
3 Conditional Probability
4 Bayes’s Theorem
5 Random Variables
6 Continuous Distributions
7 Probability Density Function
8 Cumulative Distribution Function
9 The Normal Distribution
10 The Central Limit Theorem
2 / 20
Introduction
3 / 20
Dependence and independence of events
4 / 20
Conditional Probability
5 / 20
Conditional Probability (Contd.)
When E and F are independent, you can check that this gives:
P(E | F ) = P(E)
which is the mathematical way of expressing that knowing F
occurred gives us no additional information about whether E
occurred.
6 / 20
Bayes’s Theorem
7 / 20
Bayes’s Theorem
The event F can be split into the two mutually exclusive events “F
and E” and “F and not E.” If we write -E for “not E” (i.e., “E doesn’t
happen”), then:
P(F ) = P(F , E) + P(F , −E)
so that:
P(E| F ) = P(F | E)P(E)/[P(F | E)P(E) + P(F | −E)P(−E)]
which is how Bayes’s theorem is often stated.
8 / 20
Random Variables
9 / 20
Continuous Distributions
10 / 20
Probability Density Function
11 / 20
Cumulative Distribution Function
12 / 20
The Normal Distribution
13 / 20
Normal Distribution (Contd.)
14 / 20
Normal Distribution (Contd.)
15 / 20
The Central Limit Theorem
16 / 20
Central Limit Theorem (Contd.)
17 / 20
Central Limit Theorem (Contd.)
The
pmean of a Bernoulli(p) variable is p, and its standard deviation
is p(1 − p).
The central limit theorem says that as n gets large, a Binomial(n,p)
variable is approximately a normal random
p variable with mean
µ = np and standard deviation σ = np(1 − p).
18 / 20
References
[1] Data Science from Scratch: First Principles with Python by Joel Grus
19 / 20
Thank You
Any Questions?
20 / 20