0% found this document useful (0 votes)

14 views39 pages

DV Stat

This document discusses foundational concepts in statistics including describing and summarizing data through histograms, measures of central tendency, dispersion, and correlation. Key concepts covered include mean, median, mode, variance, standard deviation, covariance, correlation, and their mathematical definitions.

Uploaded by

Aashutosh Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views39 pages

DV Stat

Uploaded by

Aashutosh Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

1 Introduction

2 Single set of data

3 Histogram

4 Dispersion

5 Correlation

2 / 23
Introduction

Statistics refers to the mathematics and techniques with which we

understand data.
Using statistics, we will try to understand the distribution of data
which in turn can be applied to form good machine learning
models.

3 / 23
Describing a Single Set of Data

One obvious description of any dataset is simply the data itself:

Eg: num friends = [100, 49, 41, 40, 25,...]
For a small enough dataset, this might even be the best
description.
But for a larger dataset, this is unwidely and probably opaque.

4 / 23
Histogram

As a first approach, we can create a histogram using Counter

and plt.bar
As an example we put the friend counts as taken above, into a
histogram using the following code:

from collections import Counter

import matplotlib.pyplot as plt
friend counts = Counter(num friends)
xs = range(101) #largest value is 100
ys = [friend counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25]) #x-axis and y-axis limits
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()

5 / 23
Some statistics on data

Probably the simplest statistic is the number of data points:

num points = len(num friends)
We might probably also be interested in the largest and smallest
values:
largest value = max(num friends)
smallest value = min(num friends)
To know the values in specific positions, we’ll do the following:
sorted values = sorted(num friends)
smallest value = sorted values[0]
second smallest value = sorted values[1]
second largest value = sorted values[-2]

6 / 23
Central Tendencies

Usually, we’ll want some notion of where our data is centered.

Mean
Most commonly we’ll use the mean (or average), which is just the
sum of the data divided by its count:
x1 + . . . + xn
Mean =
n
We can implement it as follows:
def mean(xs: List[float]) -> float:
return sum(xs) / len(xs)

7 / 23
Central Tendencies (Contd.)

Median
We’ll also sometimes be interested in the median, which is the
middle-most value (if the number of data points is odd) or the
average of the two middle- most values (if the number of data
points is even).

Note
The data points should be sorted.

Note
Notice that, unlike the mean, the median doesn’t fully depend on every value
in your data. For example, if you make the largest point larger (or the smallest
point smaller), the middle points remain unchanged, which means so does
the median.

8 / 23
Central Tendencies

Defining median for odd number of elements :

def median odd(xs: List[float]) -> float:

return sorted(xs)[len(xs) // 2]

Defining median for even number of elements :

def median even(xs: List[float]) -> float:

sorted xs = sorted(xs)
hi midpoint = len(xs) // 2
return (sorted xs[hi midpoint - 1] + sorted xs[hi midpoint]) / 2

Combining both to form a common function:

def median(v: List[float]) -> float:

return median even(v) if len(v) % 2 == 0 else median odd(v)

9 / 23
Quantile

A generalization of the median is the quantile, which represents

the value under which a certain percentile of the data lies (the
median represents the value under which 50% of the data lies):

def quantile(xs: List[float], p: float) -> float:

p index = int(p * len(xs))
return sorted(xs)[p index]

10 / 23
Mode

Mode is the most commonly occurring value in a dataset:

def mode(x: List[float]) -> List[float]:

counts = Counter(x)
max count = max(counts.values())
return [x i for x i, count in counts.items()
if count == max count]

11 / 23
Dispersion

Dispersion refers to measures of how spread out our data is.

Typically they’re statistics for which values near zero signify not
spread out at all and for which large values (whatever that means)
signify very spread out.
Range
A very simple measure is the range, which is just the difference
between the largest and smallest elements:

def data range(xs: List[float]) -> float:

return max(xs) - min(xs)

12 / 23
Variance

Variance
A more complex measure of dispersion is the variance, which is
computed as:
(xi − x̄)2
P
2
S =
n−1
(Here, x̄ is the mean of the data set.)
This can be implemented as:

from scratch.linear algebra import sum of squares

def de mean(xs: List[float]) -> List[float]:
x bar = mean(xs)
return [x - x bar for x in xs]
def variance(xs: List[float]) -> float:
assert len(xs) >= 2
n = len(xs)
deviations = de mean(xs)
return sum of squares(deviations) / (n - 1)

13 / 23
Standard Deviation

The range will similarly be in that same unit in which the data is.
The variance, on the other hand, has units that are the square of
the original units.
Standard Deviation
We often look instead at the standard deviation, which is simply
the square root of the variance:

import math
def standard deviation(xs: List[float]) -> float:
return math.sqrt(variance(xs))

14 / 23
Inter-quartile Range

A more robust alternative computes the difference between the

75th percentile value and the 25th percentile value:

def interquartile range(xs: List[float]) -> float:

return quantile(xs, 0.75) - quantile(xs, 0.25)

15 / 23
Covariance

Covariance
Variance measures how a single variable deviates from its mean
whereas covariance measures how two variables vary alongside
each other from their means:

from scratch.linear algebra import dot

def covariance(xs: List[float], ys: List[float]) -> float:
assert len(xs) == len(ys)
return dot(de mean(xs), de mean(ys)) / (len(xs) - 1)

16 / 23
Covariance (Contd.)

A “large” positive covariance means that x tends to be large when

y is large and small when y is small.
A “large” negative covariance means the opposite—that x tends to
be small when y is large and vice versa.

17 / 23
Correlation

Correlation
It’s more common to look at the correlation, which divides out the
standard deviations of both variables:

def correlation(xs: List[float], ys: List[float]) -> float:

stdev x = standard deviation(xs)
stdev y = standard deviation(ys)
if stdev x > 0 and stdev y > 0:
return covariance(xs, ys) / stdev x / stdev y
else:
return 0

Note
The correlation is unitless and always lies between –1 (perfect anti-correlation) and 1 (perfect
correlation).

18 / 23
More about Correlation

A correlation of zero indicates that there is no linear relationship

between the two variables.
However, there may be other sorts of relationships. For example,
if:
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]
then x and y have zero correlation.
But they certainly have a relationship— each element of y equals
the absolute value of the corresponding element of x.
Correlation looks for a relationship in which knowing how x i
compares to mean(x) gives us information about how y i
compares to mean(y).

19 / 23
More about correlation

Correlation tells you nothing about how large the relationship is.
For example:
The variables:
x = [-2, -1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
are perfectly correlated, but (depending on what you’re measuring)
it’s quite possible that this relationship isn’t all that interesting.

20 / 23
Correlation and Causation

If x and y are strongly correlated, that might mean that x causes y,

that y causes x, that each causes the other, that some third factor
causes both, or nothing at all.
One way to feel more confident about causality is by conducting
randomized trials.

21 / 23
Contents

1 Introduction
2 Dependence and independence of events
3 Conditional Probability
4 Bayes’s Theorem
5 Random Variables
6 Continuous Distributions
7 Probability Density Function
8 Cumulative Distribution Function
9 The Normal Distribution
10 The Central Limit Theorem

2 / 20
Introduction

Probability is a way of quantifying the uncertainty associated with

events chosen from some universe of events.
Notationally, we write P(E) to mean “the probability of the event E.”

3 / 20
Dependence and independence of events

Mathematically, we say that two events E and F are independent if

the probability that they both happen is the product of the
probabilities that each one happens:
P(E,F) = P(E)P(F)
For instance, if we flip a fair coin twice, knowing whether the first
flip is heads gives us no information about whether the second flip
is heads. These events are independent.
On the other hand, knowing whether the first flip is heads certainly
gives us information about whether both flips are tails. (If the first
flip is heads, then definitely it’s not the case that both flips are
tails.) These two events are dependent.

4 / 20
Conditional Probability

If two events E and F are not necessarily independent (and if the

probability of F is not zero), then we define the probability of E
“conditional on F” as:
P(E | F ) = P(E, F )/P(F )
We can say that this is the probability that E happens, given that
we know that F happens.
We often rewrite this as:
P(E, F ) = P(E | F )P(F )

5 / 20
Conditional Probability (Contd.)

When E and F are independent, you can check that this gives:
P(E | F ) = P(E)
which is the mathematical way of expressing that knowing F
occurred gives us no additional information about whether E
occurred.

6 / 20
Bayes’s Theorem

Bayes’s theorem is a way of “reversing” conditional probabilities.

Let’s say we need to know the probability of some event E
conditional on some other event F occurring. But we only have
information about the probability of F conditional on E occurring.
Using the definition of conditional probability twice tells us that:
P(E | F ) = P(E, F )/P(F ) = P(F | E)P(E)/P(F )

7 / 20
Bayes’s Theorem

The event F can be split into the two mutually exclusive events “F
and E” and “F and not E.” If we write -E for “not E” (i.e., “E doesn’t
happen”), then:
P(F ) = P(F , E) + P(F , −E)
so that:
P(E| F ) = P(F | E)P(E)/[P(F | E)P(E) + P(F | −E)P(−E)]
which is how Bayes’s theorem is often stated.

8 / 20
Random Variables

A random variable is a variable whose possible values have an

associated probability distribution.
Eg: A very simple random variable equals 1 if a coin flip turns up
heads and 0 if the flip turns up tails.
The expected value of a random variable, which is the average of
its values weighted by their probabilities.
Eg: The coin flip variable has an expected value of
1/2 (= 0 * 1/2 + 1 * 1/2)
and the range(10) variable has an expected value of 4.5.

9 / 20
Continuous Distributions

A coin flip corresponds to a discrete distribution—one that

associates positive probability with discrete outcomes.
A continuous distribution describes the probabilities of the
possible values of a continuous random variable i.e. a random
variable which has infinite and uncountable set of possible values
as number of outcomes.
Eg: The uniform distribution puts equal weight on all the
numbers between 0 and 1.

10 / 20
Probability Density Function

Because there are infinitely many numbers between 0 and 1, this

means that the weight it assigns to individual points must
necessarily be zero.
For this reason, we represent a continuous distribution with a
probability density function (PDF) such that the probability of
seeing a value in a certain interval equals the integral of the
density function over the interval.
The density function for the uniform distribution is just:
def uniform pdf(x: float) -> float:
return 1 if 0 <= x < 1 else 0

11 / 20
Cumulative Distribution Function

We will often be more interested in the cumulative distribution

function (CDF), which gives the probability that a random variable
is less than or equal to a certain value.
CDF for the uniform distribution will be:
def uniform cdf(x: float) -> float:
if x < 0: return 0
elif x < 1: return x
else: return 1

12 / 20
The Normal Distribution

The normal distribution is the classic bell curve–shaped

distribution and is completely determined by two parameters: its
mean µ (mu) and its standard deviation σ (sigma).
The mean indicates where the bell is centered, and the standard
deviation how “wide” it is.
(x−µ)2
−
It has the PDF: f (x | µ, σ) = √1 e 2σ 2
2πσ

13 / 20
Normal Distribution (Contd.)

It can be implemented as:

import math
SQRT TWO PI = math.sqrt(2 * math.pi)
def normal pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (SQRT TWO PI * sigma))

When µ = 0 and σ = 1, it’s called the standard normal distribution.

If Z is a standard normal random variable, then it turns out that:
X = σZ + µ is also normal but with mean µ and standard deviation σ.
Conversely, if X is a normal random variable with mean µ and standard
deviation σ,
Z = (X − µ)/σ is a standard normal variable.

14 / 20
Normal Distribution (Contd.)

The CDF for the normal distribution cannot be written in an

“elementary” manner, but we can write it using Python’s
math.erf error function:
def normal cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

15 / 20
The Central Limit Theorem

If x1, ..., xn are random variables with mean µ and standard

deviation σ, and if n is large, then:
1
n (x1 + x2 + . . . + xn )
is approximately normally distributed with mean µ and standard
deviation √σn .
Equivalently (but often more usefully),
(x1 +x2 +...+xn )−µn
√
σ n
is approximately normally distributed with mean 0 and standard
deviation 1.

16 / 20
Central Limit Theorem (Contd.)

A Binomial(n,p) random variable is simply the sum of n

independent Bernoulli(p) random variables, each of which equals
1 with probability p and 0 with probability 1 – p:

def bernoulli trial(p: float) -> int:

return 1 if random.random() < p else 0

def binomial(n: int, p: float) -> int:

return sum(bernoulli trial(p) for in range(n))

17 / 20
Central Limit Theorem (Contd.)

The
pmean of a Bernoulli(p) variable is p, and its standard deviation
is p(1 − p).
The central limit theorem says that as n gets large, a Binomial(n,p)
variable is approximately a normal random
p variable with mean
µ = np and standard deviation σ = np(1 − p).

18 / 20
References

[1] Data Science from Scratch: First Principles with Python by Joel Grus

19 / 20
Thank You
Any Questions?

20 / 20

002 Probability-and-Statistics-Part-1-Data
No ratings yet
002 Probability-and-Statistics-Part-1-Data
84 pages
Facial K: Dynamic Selfie Filters Using ML
No ratings yet
Facial K: Dynamic Selfie Filters Using ML
10 pages
6.ABC Analysis
No ratings yet
6.ABC Analysis
32 pages
General Physics Lesson 3
No ratings yet
General Physics Lesson 3
10 pages
Uplift Force, Seepage, and Exit Gradient Under Diversion Dams
No ratings yet
Uplift Force, Seepage, and Exit Gradient Under Diversion Dams
11 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Osl Languagespec
No ratings yet
Osl Languagespec
101 pages
Wavelets and Multiresolution Processing PDF
No ratings yet
Wavelets and Multiresolution Processing PDF
15 pages
Diagonalizationrevisited11 10 14
No ratings yet
Diagonalizationrevisited11 10 14
34 pages
Inequalities+Mods Concept Sheet
No ratings yet
Inequalities+Mods Concept Sheet
7 pages
Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation
No ratings yet
Data Science 1 2023 - Lecture 02 - Mathematical Preliminaries and Correlation
49 pages
TBC Network Adjustment Settings Australia
No ratings yet
TBC Network Adjustment Settings Australia
19 pages
CO Distribution Cycle
No ratings yet
CO Distribution Cycle
10 pages
Lec10 PDF
No ratings yet
Lec10 PDF
26 pages
Design of A Low-Profile Two-Axis Solar Tracker
No ratings yet
Design of A Low-Profile Two-Axis Solar Tracker
8 pages
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
No ratings yet
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
6 pages
Design of Quadrilateral Learning With RME Approach For Junior High School Students
No ratings yet
Design of Quadrilateral Learning With RME Approach For Junior High School Students
13 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
ML Course Slides
No ratings yet
ML Course Slides
356 pages
ASET Abstract Reasoning Sample Test2
100% (1)
ASET Abstract Reasoning Sample Test2
15 pages
VLSI Design Automation Syllabus Modified
No ratings yet
VLSI Design Automation Syllabus Modified
3 pages
Introduction To Trigonometry
No ratings yet
Introduction To Trigonometry
11 pages
Stats Review
No ratings yet
Stats Review
65 pages
Lecture Notes
No ratings yet
Lecture Notes
138 pages
Session 3
No ratings yet
Session 3
61 pages
MLCourse Slides
No ratings yet
MLCourse Slides
356 pages
Business Statistics and Analysis Course 2&3
No ratings yet
Business Statistics and Analysis Course 2&3
42 pages
Activity Diagrams
No ratings yet
Activity Diagrams
26 pages
Worksheet 1 - Graph of Motion
No ratings yet
Worksheet 1 - Graph of Motion
2 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
All BlueJ Program
No ratings yet
All BlueJ Program
14 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Lambert Conformal Conic Projection For India
No ratings yet
Lambert Conformal Conic Projection For India
4 pages
MIDS Unit 2
No ratings yet
MIDS Unit 2
18 pages
1 Intro-Statistics
No ratings yet
1 Intro-Statistics
61 pages
Non Homogenous BVP Notes
No ratings yet
Non Homogenous BVP Notes
25 pages
MECH3004 Assignment2-3
100% (1)
MECH3004 Assignment2-3
8 pages
Adv Math 02
No ratings yet
Adv Math 02
4 pages
Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
Practical Research 11 STEM
No ratings yet
Practical Research 11 STEM
3 pages
Markov Point Processes and Their Applications (M N M Van Lieshout)
No ratings yet
Markov Point Processes and Their Applications (M N M Van Lieshout)
182 pages
Parabolas (All Lectures)
No ratings yet
Parabolas (All Lectures)
8 pages
What Is A Data Set?
No ratings yet
What Is A Data Set?
19 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
Statests
No ratings yet
Statests
20 pages
2565 Phaiboon Friction Loss
No ratings yet
2565 Phaiboon Friction Loss
10 pages
MLCourse Slides
No ratings yet
MLCourse Slides
427 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
MMW Notes
No ratings yet
MMW Notes
10 pages
Descriptive Stat Lec 1
No ratings yet
Descriptive Stat Lec 1
32 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
Stats 1 Formulae
No ratings yet
Stats 1 Formulae
26 pages
Statistics S1 Theory
No ratings yet
Statistics S1 Theory
8 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Statistics Notes
No ratings yet
Statistics Notes
16 pages
Statistics For Data Science PDF
No ratings yet
Statistics For Data Science PDF
16 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
တက္ကသိုလ်ဝင်တန်း သင်္ချာ သင်ရိုးကုန်
No ratings yet
တက္ကသိုလ်ဝင်တန်း သင်္ချာ သင်ရိုးကုန်
973 pages
Mba Statistics Midterm Review Sheet
No ratings yet
Mba Statistics Midterm Review Sheet
1 page
Statistics Notes
No ratings yet
Statistics Notes
32 pages
WG - Calculated AIR DENSITY
No ratings yet
WG - Calculated AIR DENSITY
2 pages
Module Wise Important Formulae
No ratings yet
Module Wise Important Formulae
45 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Data Management
No ratings yet
Data Management
36 pages
Unit III Dev Data Exploration and Visualization
No ratings yet
Unit III Dev Data Exploration and Visualization
9 pages
AP Statistics Portfolio Q2
No ratings yet
AP Statistics Portfolio Q2
17 pages
Datascience Python Bayes
No ratings yet
Datascience Python Bayes
124 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Statistics and Probability
No ratings yet
Statistics and Probability
43 pages
Lecture 1
No ratings yet
Lecture 1
12 pages
Module 4
No ratings yet
Module 4
195 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Lesson 4 Notes
No ratings yet
Lesson 4 Notes
14 pages
F.Y. Maths PPT On Probability and Statistics
No ratings yet
F.Y. Maths PPT On Probability and Statistics
10 pages
Prelim Coverage
No ratings yet
Prelim Coverage
6 pages
Statistics and Probabilities Quarter 1
No ratings yet
Statistics and Probabilities Quarter 1
6 pages
Stastics For Data Science1 (Quiz1 Notes)
No ratings yet
Stastics For Data Science1 (Quiz1 Notes)
2 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Statistics
No ratings yet
Statistics
36 pages
Statistics Guide
No ratings yet
Statistics Guide
27 pages
ML2 Math Algo
No ratings yet
ML2 Math Algo
72 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
L2 - Mathematical Preliminaries
No ratings yet
L2 - Mathematical Preliminaries
24 pages
Statistics
No ratings yet
Statistics
7 pages
Probab, Stats
No ratings yet
Probab, Stats
17 pages
IDS2
No ratings yet
IDS2
14 pages
00 Probability 2
No ratings yet
00 Probability 2
19 pages

DV Stat

Uploaded by

DV Stat

Uploaded by

Contents

2 Single set of data

Statistics refers to the mathematics and techniques with which we

One obvious description of any dataset is simply the data itself:

As a first approach, we can create a histogram using Counter

from collections import Counter

Probably the simplest statistic is the number of data points:

Usually, we’ll want some notion of where our data is centered.

Defining median for odd number of elements :

def median odd(xs: List[float]) -> float:

Defining median for even number of elements :

def median even(xs: List[float]) -> float:

Combining both to form a common function:

def median(v: List[float]) -> float:

A generalization of the median is the quantile, which represents

def quantile(xs: List[float], p: float) -> float:

Mode is the most commonly occurring value in a dataset:

def mode(x: List[float]) -> List[float]:

Dispersion refers to measures of how spread out our data is.

def data range(xs: List[float]) -> float:

from scratch.linear algebra import sum of squares

A more robust alternative computes the difference between the

def interquartile range(xs: List[float]) -> float:

from scratch.linear algebra import dot

A “large” positive covariance means that x tends to be large when

def correlation(xs: List[float], ys: List[float]) -> float:

A correlation of zero indicates that there is no linear relationship

If x and y are strongly correlated, that might mean that x causes y,

Probability is a way of quantifying the uncertainty associated with

Mathematically, we say that two events E and F are independent if

If two events E and F are not necessarily independent (and if the

Bayes’s theorem is a way of “reversing” conditional probabilities.

A random variable is a variable whose possible values have an

A coin flip corresponds to a discrete distribution—one that

Because there are infinitely many numbers between 0 and 1, this

We will often be more interested in the cumulative distribution

The normal distribution is the classic bell curve–shaped

It can be implemented as:

When µ = 0 and σ = 1, it’s called the standard normal distribution.

The CDF for the normal distribution cannot be written in an

If x1, ..., xn are random variables with mean µ and standard

A Binomial(n,p) random variable is simply the sum of n

def bernoulli trial(p: float) -> int:

def binomial(n: int, p: float) -> int:

You might also like