0% found this document useful (0 votes)
45 views

Probability Mass Functions: Allen Downey

This document discusses probability mass functions (PMFs), cumulative distribution functions (CDFs), and kernel density estimation (KDE) for exploratory data analysis in Python. It uses survey data from the General Social Survey (GSS) to create PMFs and CDFs of variables like age, income, and education. PMFs show the probability of getting each unique value, while CDFs show the cumulative probability. KDE can model continuous distributions and compare them to theoretical distributions like the normal curve. Together these techniques provide insights into a dataset's distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Probability Mass Functions: Allen Downey

This document discusses probability mass functions (PMFs), cumulative distribution functions (CDFs), and kernel density estimation (KDE) for exploratory data analysis in Python. It uses survey data from the General Social Survey (GSS) to create PMFs and CDFs of variables like age, income, and education. PMFs show the probability of getting each unique value, while CDFs show the cumulative probability. KDE can model continuous distributions and compare them to theoretical distributions like the normal curve. Together these techniques provide insights into a dataset's distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Probability mass

functions
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N

Allen Downey
Professor, Olin College
GSS
Annual sample of U.S. population.

Asks about demographics, social and political beliefs.

Widely used by policy makers and researchers.

EXPLORATORY DATA ANALYSIS IN PYTHON


Read the data
gss = pd.read_hdf('gss.hdf5', 'gss')

gss.head()

year sex age cohort race educ realinc wtssall


0 1972 1 26.0 1946.0 1 18.0 13537.0 0.8893
1 1972 2 38.0 1934.0 1 12.0 18951.0 0.4446
2 1972 1 57.0 1915.0 1 12.0 30458.0 1.3339
3 1972 2 61.0 1911.0 1 14.0 37226.0 0.8893
4 1972 1 59.0 1913.0 1 12.0 30458.0 0.8893

EXPLORATORY DATA ANALYSIS IN PYTHON


educ = gss['educ']
plt.hist(educ.dropna(), label='educ')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


PMF
pmf_educ = Pmf(educ, normalize=False)
pmf_educ.head()

0.0 566
1.0 118
2.0 292
3.0 686
4.0 746
Name: educ, dtype: int64

EXPLORATORY DATA ANALYSIS IN PYTHON


PMF
pmf_educ[12]

47689

EXPLORATORY DATA ANALYSIS IN PYTHON


pmf_educ = Pmf(educ, normalize=True)

pmf_educ.head()

0.0 0.003663
1.0 0.000764
2.0 0.001890
3.0 0.004440
4.0 0.004828
Name: educ, dtype: int64

pmf_educ[12]

0.30863869940587907

EXPLORATORY DATA ANALYSIS IN PYTHON


pmf_educ.bar(label='educ')
plt.xlabel('Years of education')
plt.ylabel('PMF')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


Histogram vs. PMF

EXPLORATORY DATA ANALYSIS IN PYTHON


Let's make some
PMFs!
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N
Cumulative
distribution
functions
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N

Allen Downey
Professor, Olin College
From PMF to CDF
If you draw a random element from a distribution:

PMF (Probability Mass Function) is the probability that you


get exactly x

CDF (Cumulative Distribution Function) is the probability that


you get a value <= x

for a given value of x.

EXPLORATORY DATA ANALYSIS IN PYTHON


Example
PMF of {1, 2, 2, 3, 5} CDF is the cumulative sum of
the PMF.
PMF(1) = 1/5
CDF(1) = 1/5
PMF(2) = 2/5
CDF(2) = 3/5
PMF(3) = 1/5
CDF(3) = 4/5
PMF(5) = 1/5
CDF(5) = 1

EXPLORATORY DATA ANALYSIS IN PYTHON


cdf = Cdf(gss['age'])
cdf.plot()
plt.xlabel('Age')
plt.ylabel('CDF')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


Evaluating the CDF
q = 51
p = cdf(q)
print(p)

0.66

EXPLORATORY DATA ANALYSIS IN PYTHON


Evaluating the inverse CDF
p = 0.25
q = cdf.inverse(p)
print(q)

30

p = 0.75
q = cdf.inverse(p)
print(q)

57

EXPLORATORY DATA ANALYSIS IN PYTHON


Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N
Comparing
distributions
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N

Allen Downey
Professor, Olin College
Multiple PMFs
male = gss['sex'] == 1
age = gss['age']
male_age = age[male]
female_age = age[~male]
Pmf(male_age).plot(label='Male')
Pmf(female_age).plot(label='Female')
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
Multiple CDFs
Cdf(male_age).plot(label='Male')
Cdf(female_age).plot(label='Female')

plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
Income distribution
income = gss['realinc']
pre95 = gss['year'] < 1995
Pmf(income[pre95]).plot(label='Before 1995')
Pmf(income[~pre95]).plot(label='After 1995')
plt.xlabel('Income (1986 USD)')
plt.ylabel('PMF')
plt.show()

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
Income CDFs
Cdf(income[pre95]).plot(label='Before 1995')
Cdf(income[~pre95]).plot(label='After 1995')

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N
Modeling
distributions
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N

Allen Downey
Professor, Olin College
The normal distribution
sample = np.random.normal(size=1000)
Cdf(sample).plot()

EXPLORATORY DATA ANALYSIS IN PYTHON


The normal CDF
from scipy.stats import norm

xs = np.linspace(-3, 3)
ys = norm(0, 1).cdf(xs)

plt.plot(xs, ys, color='gray')

Cdf(sample).plot()

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
The bell curve
xs = np.linspace(-3, 3)
ys = norm(0,1).pdf(xs)
plt.plot(xs, ys, color='gray')

EXPLORATORY DATA ANALYSIS IN PYTHON


EXPLORATORY DATA ANALYSIS IN PYTHON
KDE plot
import seaborn as sns
sns.kdeplot(sample)

EXPLORATORY DATA ANALYSIS IN PYTHON


KDE and PDF
xs = np.linspace(-3, 3)
ys = norm.pdf(xs)
plt.plot(xs, ys, color='gray')
sns.kdeplot(sample)

EXPLORATORY DATA ANALYSIS IN PYTHON


PMF, CDF, KDE
Use CDFs for exploration.

Use PMFs if there are a small number of unique values.

Use KDE if there are a lot of values.

EXPLORATORY DATA ANALYSIS IN PYTHON


Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N

You might also like