0% found this document useful (0 votes)
14 views39 pages

DV Stat

This document discusses foundational concepts in statistics including describing and summarizing data through histograms, measures of central tendency, dispersion, and correlation. Key concepts covered include mean, median, mode, variance, standard deviation, covariance, correlation, and their mathematical definitions.

Uploaded by

Aashutosh Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

DV Stat

This document discusses foundational concepts in statistics including describing and summarizing data through histograms, measures of central tendency, dispersion, and correlation. Key concepts covered include mean, median, mode, variance, standard deviation, covariance, correlation, and their mathematical definitions.

Uploaded by

Aashutosh Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Contents

1 Introduction

2 Single set of data

3 Histogram

4 Dispersion

5 Correlation

2 / 23
Introduction

Statistics refers to the mathematics and techniques with which we


understand data.
Using statistics, we will try to understand the distribution of data
which in turn can be applied to form good machine learning
models.

3 / 23
Describing a Single Set of Data

One obvious description of any dataset is simply the data itself:


Eg: num friends = [100, 49, 41, 40, 25,...]
For a small enough dataset, this might even be the best
description.
But for a larger dataset, this is unwidely and probably opaque.

4 / 23
Histogram

As a first approach, we can create a histogram using Counter


and plt.bar
As an example we put the friend counts as taken above, into a
histogram using the following code:

from collections import Counter


import matplotlib.pyplot as plt
friend counts = Counter(num friends)
xs = range(101) #largest value is 100
ys = [friend counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25]) #x-axis and y-axis limits
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()

5 / 23
Some statistics on data

Probably the simplest statistic is the number of data points:


num points = len(num friends)
We might probably also be interested in the largest and smallest
values:
largest value = max(num friends)
smallest value = min(num friends)
To know the values in specific positions, we’ll do the following:
sorted values = sorted(num friends)
smallest value = sorted values[0]
second smallest value = sorted values[1]
second largest value = sorted values[-2]

6 / 23
Central Tendencies

Usually, we’ll want some notion of where our data is centered.


Mean
Most commonly we’ll use the mean (or average), which is just the
sum of the data divided by its count:
x1 + . . . + xn
Mean =
n
We can implement it as follows:
def mean(xs: List[float]) -> float:
return sum(xs) / len(xs)

7 / 23
Central Tendencies (Contd.)

Median
We’ll also sometimes be interested in the median, which is the
middle-most value (if the number of data points is odd) or the
average of the two middle- most values (if the number of data
points is even).

Note
The data points should be sorted.

Note
Notice that, unlike the mean, the median doesn’t fully depend on every value
in your data. For example, if you make the largest point larger (or the smallest
point smaller), the middle points remain unchanged, which means so does
the median.

8 / 23
Central Tendencies

Defining median for odd number of elements :

def median odd(xs: List[float]) -> float:


return sorted(xs)[len(xs) // 2]

Defining median for even number of elements :

def median even(xs: List[float]) -> float:


sorted xs = sorted(xs)
hi midpoint = len(xs) // 2
return (sorted xs[hi midpoint - 1] + sorted xs[hi midpoint]) / 2

Combining both to form a common function:

def median(v: List[float]) -> float:


return median even(v) if len(v) % 2 == 0 else median odd(v)

9 / 23
Quantile

A generalization of the median is the quantile, which represents


the value under which a certain percentile of the data lies (the
median represents the value under which 50% of the data lies):

def quantile(xs: List[float], p: float) -> float:


p index = int(p * len(xs))
return sorted(xs)[p index]

10 / 23
Mode

Mode is the most commonly occurring value in a dataset:

def mode(x: List[float]) -> List[float]:


counts = Counter(x)
max count = max(counts.values())
return [x i for x i, count in counts.items()
if count == max count]

11 / 23
Dispersion

Dispersion refers to measures of how spread out our data is.


Typically they’re statistics for which values near zero signify not
spread out at all and for which large values (whatever that means)
signify very spread out.
Range
A very simple measure is the range, which is just the difference
between the largest and smallest elements:

def data range(xs: List[float]) -> float:


return max(xs) - min(xs)

12 / 23
Variance

Variance
A more complex measure of dispersion is the variance, which is
computed as:
(xi − x̄)2
P
2
S =
n−1
(Here, x̄ is the mean of the data set.)
This can be implemented as:

from scratch.linear algebra import sum of squares


def de mean(xs: List[float]) -> List[float]:
x bar = mean(xs)
return [x - x bar for x in xs]
def variance(xs: List[float]) -> float:
assert len(xs) >= 2
n = len(xs)
deviations = de mean(xs)
return sum of squares(deviations) / (n - 1)

13 / 23
Standard Deviation

The range will similarly be in that same unit in which the data is.
The variance, on the other hand, has units that are the square of
the original units.
Standard Deviation
We often look instead at the standard deviation, which is simply
the square root of the variance:

import math
def standard deviation(xs: List[float]) -> float:
return math.sqrt(variance(xs))

14 / 23
Inter-quartile Range

A more robust alternative computes the difference between the


75th percentile value and the 25th percentile value:

def interquartile range(xs: List[float]) -> float:


return quantile(xs, 0.75) - quantile(xs, 0.25)

15 / 23
Covariance

Covariance
Variance measures how a single variable deviates from its mean
whereas covariance measures how two variables vary alongside
each other from their means:

from scratch.linear algebra import dot


def covariance(xs: List[float], ys: List[float]) -> float:
assert len(xs) == len(ys)
return dot(de mean(xs), de mean(ys)) / (len(xs) - 1)

16 / 23
Covariance (Contd.)

A “large” positive covariance means that x tends to be large when


y is large and small when y is small.
A “large” negative covariance means the opposite—that x tends to
be small when y is large and vice versa.

17 / 23
Correlation

Correlation
It’s more common to look at the correlation, which divides out the
standard deviations of both variables:

def correlation(xs: List[float], ys: List[float]) -> float:


stdev x = standard deviation(xs)
stdev y = standard deviation(ys)
if stdev x > 0 and stdev y > 0:
return covariance(xs, ys) / stdev x / stdev y
else:
return 0

Note
The correlation is unitless and always lies between –1 (perfect anti-correlation) and 1 (perfect
correlation).

18 / 23
More about Correlation

A correlation of zero indicates that there is no linear relationship


between the two variables.
However, there may be other sorts of relationships. For example,
if:
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]
then x and y have zero correlation.
But they certainly have a relationship— each element of y equals
the absolute value of the corresponding element of x.
Correlation looks for a relationship in which knowing how x i
compares to mean(x) gives us information about how y i
compares to mean(y).

19 / 23
More about correlation

Correlation tells you nothing about how large the relationship is.
For example:
The variables:
x = [-2, -1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
are perfectly correlated, but (depending on what you’re measuring)
it’s quite possible that this relationship isn’t all that interesting.

20 / 23
Correlation and Causation

If x and y are strongly correlated, that might mean that x causes y,


that y causes x, that each causes the other, that some third factor
causes both, or nothing at all.
One way to feel more confident about causality is by conducting
randomized trials.

21 / 23
Contents

1 Introduction
2 Dependence and independence of events
3 Conditional Probability
4 Bayes’s Theorem
5 Random Variables
6 Continuous Distributions
7 Probability Density Function
8 Cumulative Distribution Function
9 The Normal Distribution
10 The Central Limit Theorem

2 / 20
Introduction

Probability is a way of quantifying the uncertainty associated with


events chosen from some universe of events.
Notationally, we write P(E) to mean “the probability of the event E.”

3 / 20
Dependence and independence of events

Mathematically, we say that two events E and F are independent if


the probability that they both happen is the product of the
probabilities that each one happens:
P(E,F) = P(E)P(F)
For instance, if we flip a fair coin twice, knowing whether the first
flip is heads gives us no information about whether the second flip
is heads. These events are independent.
On the other hand, knowing whether the first flip is heads certainly
gives us information about whether both flips are tails. (If the first
flip is heads, then definitely it’s not the case that both flips are
tails.) These two events are dependent.

4 / 20
Conditional Probability

If two events E and F are not necessarily independent (and if the


probability of F is not zero), then we define the probability of E
“conditional on F” as:
P(E | F ) = P(E, F )/P(F )
We can say that this is the probability that E happens, given that
we know that F happens.
We often rewrite this as:
P(E, F ) = P(E | F )P(F )

5 / 20
Conditional Probability (Contd.)

When E and F are independent, you can check that this gives:
P(E | F ) = P(E)
which is the mathematical way of expressing that knowing F
occurred gives us no additional information about whether E
occurred.

6 / 20
Bayes’s Theorem

Bayes’s theorem is a way of “reversing” conditional probabilities.


Let’s say we need to know the probability of some event E
conditional on some other event F occurring. But we only have
information about the probability of F conditional on E occurring.
Using the definition of conditional probability twice tells us that:
P(E | F ) = P(E, F )/P(F ) = P(F | E)P(E)/P(F )

7 / 20
Bayes’s Theorem

The event F can be split into the two mutually exclusive events “F
and E” and “F and not E.” If we write -E for “not E” (i.e., “E doesn’t
happen”), then:
P(F ) = P(F , E) + P(F , −E)
so that:
P(E| F ) = P(F | E)P(E)/[P(F | E)P(E) + P(F | −E)P(−E)]
which is how Bayes’s theorem is often stated.

8 / 20
Random Variables

A random variable is a variable whose possible values have an


associated probability distribution.
Eg: A very simple random variable equals 1 if a coin flip turns up
heads and 0 if the flip turns up tails.
The expected value of a random variable, which is the average of
its values weighted by their probabilities.
Eg: The coin flip variable has an expected value of
1/2 (= 0 * 1/2 + 1 * 1/2)
and the range(10) variable has an expected value of 4.5.

9 / 20
Continuous Distributions

A coin flip corresponds to a discrete distribution—one that


associates positive probability with discrete outcomes.
A continuous distribution describes the probabilities of the
possible values of a continuous random variable i.e. a random
variable which has infinite and uncountable set of possible values
as number of outcomes.
Eg: The uniform distribution puts equal weight on all the
numbers between 0 and 1.

10 / 20
Probability Density Function

Because there are infinitely many numbers between 0 and 1, this


means that the weight it assigns to individual points must
necessarily be zero.
For this reason, we represent a continuous distribution with a
probability density function (PDF) such that the probability of
seeing a value in a certain interval equals the integral of the
density function over the interval.
The density function for the uniform distribution is just:
def uniform pdf(x: float) -> float:
return 1 if 0 <= x < 1 else 0

11 / 20
Cumulative Distribution Function

We will often be more interested in the cumulative distribution


function (CDF), which gives the probability that a random variable
is less than or equal to a certain value.
CDF for the uniform distribution will be:
def uniform cdf(x: float) -> float:
if x < 0: return 0
elif x < 1: return x
else: return 1

12 / 20
The Normal Distribution

The normal distribution is the classic bell curve–shaped


distribution and is completely determined by two parameters: its
mean µ (mu) and its standard deviation σ (sigma).
The mean indicates where the bell is centered, and the standard
deviation how “wide” it is.
(x−µ)2

It has the PDF: f (x | µ, σ) = √1 e 2σ 2
2πσ

13 / 20
Normal Distribution (Contd.)

It can be implemented as:


import math
SQRT TWO PI = math.sqrt(2 * math.pi)
def normal pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (SQRT TWO PI * sigma))

When µ = 0 and σ = 1, it’s called the standard normal distribution.


If Z is a standard normal random variable, then it turns out that:
X = σZ + µ is also normal but with mean µ and standard deviation σ.
Conversely, if X is a normal random variable with mean µ and standard
deviation σ,
Z = (X − µ)/σ is a standard normal variable.

14 / 20
Normal Distribution (Contd.)

The CDF for the normal distribution cannot be written in an


“elementary” manner, but we can write it using Python’s
math.erf error function:
def normal cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

15 / 20
The Central Limit Theorem

If x1, ..., xn are random variables with mean µ and standard


deviation σ, and if n is large, then:
1
n (x1 + x2 + . . . + xn )
is approximately normally distributed with mean µ and standard
deviation √σn .
Equivalently (but often more usefully),
(x1 +x2 +...+xn )−µn

σ n
is approximately normally distributed with mean 0 and standard
deviation 1.

16 / 20
Central Limit Theorem (Contd.)

A Binomial(n,p) random variable is simply the sum of n


independent Bernoulli(p) random variables, each of which equals
1 with probability p and 0 with probability 1 – p:

def bernoulli trial(p: float) -> int:


return 1 if random.random() < p else 0

def binomial(n: int, p: float) -> int:


return sum(bernoulli trial(p) for in range(n))

17 / 20
Central Limit Theorem (Contd.)

The
pmean of a Bernoulli(p) variable is p, and its standard deviation
is p(1 − p).
The central limit theorem says that as n gets large, a Binomial(n,p)
variable is approximately a normal random
p variable with mean
µ = np and standard deviation σ = np(1 − p).

18 / 20
References

[1] Data Science from Scratch: First Principles with Python by Joel Grus

19 / 20
Thank You
Any Questions?

20 / 20

You might also like