0% found this document useful (0 votes)
31 views59 pages

Unit 4 - Statistical Thinking

This document discusses statistical distributions and methods for visualizing and summarizing them. It covers histograms, probability mass functions, and descriptive statistics like mean, variance, and standard deviation. Histograms map values to frequencies while probability mass functions map values to probabilities. The document shows examples of plotting histograms and PMFs in Python using libraries like Pandas and Seaborn. It also discusses identifying outliers and comparing multiple distributions to analyze phenomena like the class size paradox.

Uploaded by

VivuEtukuru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views59 pages

Unit 4 - Statistical Thinking

This document discusses statistical distributions and methods for visualizing and summarizing them. It covers histograms, probability mass functions, and descriptive statistics like mean, variance, and standard deviation. Histograms map values to frequencies while probability mass functions map values to probabilities. The document shows examples of plotting histograms and PMFs in Python using libraries like Pandas and Seaborn. It also discusses identifying outliers and comparing multiple distributions to analyze phenomena like the class size paradox.

Uploaded by

VivuEtukuru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Statistical Thinking

Unit 4

By
Dr. G. Sunitha
Professor
Department of AI & ML

School of Computing

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102


Distribution of a Variable

• One of the best ways to describe a variable is to report the distribution of the
variable.

– The values that appear in a sample and the frequency of each.

• The most common representation of a distribution is a histogram, which is a


graph that shows the frequency of each value.

– Graph that shows mapping from values to frequencies.

• A histogram is a complete description of the distribution of a sample; that is,


given a histogram, we could reconstruct the values in the sample (although not
their order).

2
Representing Histograms
# In Python, an efficient way to compute frequencies is with a dictionary.

# initializing the list


lst = ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B’] Output:

# creating an empty dictionary {'A': 3, 'B': 3, 'C': 1, 'D': 2}


hist = {}

# looping though list to find frequency of elements


for x in lst :
hist[x] = hist.get(x, 0) + 1

# displaying the frequency


print(hist)

3
Representing Histograms . . .
# Alternatively, Counter class from the collections module can be used.

# importing the module


from collections import Counter Output:

# initializing the list Counter (


{'A': 3, 'B': 3, 'C': 1, 'D': 2}
lst = ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B']
)

# using Counter to find frequency of elements


hist = Counter(lst)

# displaying the frequency


print(hist)

4
Representing Histograms . . .
# Alternatively, value_counts() method from Pandas library can be used.

import pandas as pd

# initializing the Series object


lst = pd.Series ( ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B'] )
Output:
# displaying the frequency
print(lst.value_counts()) A 3
B 3
C 1
D 2

5
Plotting Histograms
National Survey of Family Growth (NSFG) Data - This dataset contains information
on marriage, pregnancy, infertility, contraception use, reproductive health and
family life.

6
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Birth Weight(Pounds)')
plt.ylabel('Frequency')

# Visualizing histogram
sns.histplot(data = df ['birthwgt_lb1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13])
7
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Birth Weight(Ounces)')
plt.ylabel('Frequency')

# Visualizing Histogram
sns.histplot(data = df ['birthwgt_oz1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13] )

8
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns Mode = 40

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels and title


plt.xlabel('Pregnancy Length')
plt.ylabel('Frequency')

# Visualizing Histogram
sns.histplot(data = df ['prglngth'] )
9
Outliers
• Looking at histograms, it is easy to identify the most common values and the
shape of the distribution, but rare values are not always visible.

• Histograms are useful because they make the most frequent values immediately
apparent. But they are not the best choice for comparing two distributions.

• it is a good idea to check for outliers, which are extreme values that might be
errors in measurement and recording, or might be accurate reports of rare
events.

• The best way to handle outliers depends on “domain knowledge”; that is,
information about where the data come from and what they mean. And it
depends on what analysis you are planning to perform.

10
Outliers

Tails

11
Summarizing Distributions
• A histogram is a complete description of the distribution of a sample; that is,
given a histogram, we could reconstruct the values in the sample (although not
their order).
• Often we want to summarize the distribution with a few descriptive statistics.
• Central Tendency
– A characteristic of a sample or population; intuitively, it is an average or
typical value.
– Do the values tend to cluster around a particular point?
• Mode
– The most frequent value in a sample, or one of the most frequent values.
– Is there more than one cluster?
• Spread
– A measure of how spread out the values in a distribution are.
– How much variability is there in the values?

12
Summarizing Distributions . . .
• Tails
– The part of a distribution at the high and low extremes.
– How quickly do the probabilities drop off as we move away from the
modes?

Tails

• Outliers
– A value far from the central tendency.
– Are there extreme values far from the modes?

13
Summarizing Distributions . . .
• Summary Statistic
– A statistic that quantifies some aspect of a distribution, like central tendency
or spread.
• Variance
– A summary statistic often used to quantify spread.
• Standard Deviation
– The square root of variance, also used as a measure of spread.
• Effect Size
– A summary statistic intended to quantify the size of an effect like a
difference between groups.
• Normal Distribution
– An idealization of a bell-shaped distribution; also known as a Gaussian
distribution.
• Uniform Distribution
– A distribution in which all values have the same frequency.

14
Summarizing Distributions . . .
• Statistics designed to answer these questions are called Summary Statistics.

• By far the most common summary statistic is the Mean, which is meant to
describe the central tendency of the distribution.

15
Variance
• If there is no single number that summarizes a variable, we can do a little better
with two numbers: Mean and Variance.

• Variance is a summary statistic intended to describe the variability or spread of a


distribution.

• The variance of a set of values is

• The term is called the “deviation from the mean,” so variance is the
mean squared deviation.

• The square root of variance, S, is the Standard Deviation.

• Pandas data structures provides methods to compute mean, variance and


standard deviation.

16
Probability Mass Functions (PMF)

• Probability Mass Function (PMF) maps each value to its probability.


• Probability is a frequency expressed as a fraction of the sample size, n.
• Normalization is the process of dividing a frequency by a sample size to get a
probability.
• The biggest difference is that Histogram maps values to integer counters; PMF
maps values to floating-point probabilities.

17
Plotting PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels and title


plt.xlabel('Pregnancy Length')
plt.ylabel('Probability')

# Visualizing PMF
sns.histplot(data = df['prglngth'] , stat = 'probability’ )

18
The Class Size Paradox - Plotting Multiple PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)

# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]

# Visualizing Multiple PMFs


sns.histplot(data = Actual , stat = 'probability', element = 'step’ )
sns.histplot(data = Predicted , stat = 'probability', element = 'step’ )

19
The Class Size Paradox - Plotting Multiple PMFs . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)

# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]

# Visualizing Multiple PMFs


sns.histplot(data = Actual , stat = 'probability', element = 'step’ , fill = False)
sns.histplot(data = Predicted , stat = 'probability', element = 'step’ , fill = False)

20
DataFrame Indexing
import numpy as np
import pandas

# Generating a 4x2 dataset of numbers


array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Creating dataframe
df = pandas.DataFrame(array)

# displaying dataframe
df

21
DataFrame Indexing . . .
• By default, the rows and columns are numbered starting at zero, but names can be
provided to columns.

import numpy as np
import pandas

# Generating a 4x2 dataset of numbers


array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Providing names to 2 columns


columns = ['A', 'B']

# Creating dataframe with column names


df = pandas.DataFrame(array, columns=columns)

# displaying dataframe
df

22
DataFrame Indexing . . .
• Names can be provided for rows also. The row names themselves are called
labels. The set of row names is called the index.
import numpy as np
import pandas

# Generating a 4x2 dataset of numbers


array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Providing names to 2 columns


columns = ['A', 'B']

# Providing names to 4 rows


rows = ['a', 'b', 'c', 'd']

# Creating dataframe with column names


df = pandas.DataFrame(array, columns=columns, index=rows)

# displaying dataframe
df
23
DataFrame Indexing . . .
• Simple indexing selects a column, returning a Series.

• If the integer position of a row is known , the iloc attribute, can be used. It
returns a Series object.

24
DataFrame Indexing . . .
• To select a row by label, the loc attribute can be used. It returns a Series
object.

• loc attribute can also take a list of labels; in that case, the result is a
DataFrame.

25
DataFrame Indexing . . .
• DataFrame can be sliced to select a range of rows by label or by integer
position.
• The result in either case is a DataFrame, but notice that the first result includes
the end of the slice; the second doesn’t.

26
Limits of PMFs
• PMFs work well if the number of values is small. But as the number of values
increases, the probability associated with each value gets smaller and the effect
of random noise increases.

27
Limits of PMFs . . .
• Overall, these distributions resemble the bell
shape of a normal distribution, with many
values near the mean and a few values
much higher and lower.
• But parts of this figure are hard to interpret.
• There are many spikes and valleys, and
some apparent differences between the
distributions. It is hard to tell which of these
features are meaningful.
• Also, it is hard to see overall patterns; for
example, which distribution do you think has
the higher mean?
• These problems can be mitigated by binning
the data; that is, dividing the range of values
into non-overlapping intervals and counting
the number of values in each bin.
• Binning can be useful, but it is tricky to get
the size of the bins right. If they are big Bins
enough to smooth out noise, they might also
smooth out useful information. Bin size = 10

28
Percentiles
• A percentile is a term that describes how a score compares to other scores from
the same set.
• It shows the percentage of scores that fall below the given number.
Ex: if a student is in the 90th percentile, it means that he/she is better
than 90% of the people who took the exam.
• It is a measure used in statistics indicating the value below which a given
percentage of observations in a group of observations fall.

29
Percentiles . . .

Ex: Calculating Percentile


Ex: 30 33 43 55 69 72 85 88 91 93

10th 20th 100th


Percentile Percentile 80th Percentile
Percentile
Percentile will be
in the range
0 – 100
• Ex: 30 33 43 55
𝑰𝒏𝒅𝒆𝒙
25th 𝑲= × 𝟏𝟎𝟎
𝑵
Percentile 50th 75th 100th
Percentile Percentile Percentile

30
Percentiles . . .
Procedure to Calculate Percentile from Index
• Given Index of an object in dataset, the following are the steps to find its
Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Percentile K
𝑰𝒏𝒅𝒆𝒙
𝑲= × 𝟏𝟎𝟎
𝑵

31
Percentiles . . .
Procedure to Calculate Kth Percentile on Dataset
• The Kth percentile is a value in a dataset that divides the data into two parts.
The lower part contains K% of data, and the upper part contains rest of the
data.
• The following are the steps to calculate the Kth Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝐾
𝐼𝑛𝑑𝑒𝑥 = ×𝑁
100
3) If Index is not a whole number, round to the nearest whole number.
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷 𝐼𝑛𝑑𝑒𝑥
4) If Index is a whole number
𝐷 𝐼𝑛𝑑𝑒𝑥 +𝐷 𝐼𝑛𝑑𝑒𝑥+1
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 =
2

32
Percentiles . . .

Ex: Calculating 50th Percentile (Data should be ordered ascending)


Ex: 30 33 43 55 69 72 85 88 91 93

Kth Percentile = ( D[5] + D[6] ) / 2


K = 50
= (69 + 72)/2
N = 10
= 70.5
Index = K/100 * N
= 50/100 * 10
=5

33
Percentiles . . .
# importing numpy module
import numpy

# creating data Output:


D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93] 50th Percentile is 70.5

# set K value
K = 50

# computing Kth percentile of Dataset


P = numpy.percentile( D , K )
print( K , “th Percentile is”, P)

34
Percentiles . . .
Procedure to Calculate Percentile Rank of Score “X”
• The following are the steps to calculate the Percentile Rank.
1) Let “D” be the dataset sorted in ascending order. Let dataset contain “N”
objects.
2) Compute Percentile Rank of value “X”

#𝑉𝑎𝑙𝑢𝑒𝑠 < 𝑋 𝑖𝑛 𝐷
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = × 100
𝑁

Ex: D = 30 33 43 55 69 72 85 88 91 93

Calculating Percentile Rank of 90

#𝑉𝑎𝑙𝑢𝑒𝑠 <90 8
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = 10
× 100 = 10
× 100 = 80.0

35
Percentiles . . .
# Function to calculate Percentile Rank from Score
def PercentileRank(scores, my_score):
scores.sort()
count = 0
for score in scores:
if score <= my_score:
count += 1 Output:
P_Rank = 100.0 * count / len(scores)
Your Percentile Rank is 80.0
return P_Rank
You performed better than the 80.0% of
# creating data Candidates who Attempted Exam
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
my_score = 90

# computing Percentile Rank


P_Rank = PercentileRank( D , my_score)
print("Your Percentile Rank is", P_Rank)
print("You performed better than the", P_Rank, "% of Candidates who Attempted Exam")

36
Percentiles . . .
Procedure to Calculate Percentile from Percentile Rank
• The following are the steps to calculate the Percentile from Percentile Rank.
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝑃_𝑅𝑎𝑛𝑘 × 𝑁 − 1
𝐼𝑛𝑑𝑒𝑥 =
100

3) 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷[𝐼𝑛𝑑𝑒𝑥]

37
Percentiles . . .
# Function to compute Percentile from Percentile Rank
def Percentile(scores, P_Rank):
scores.sort()
index = int(P_Rank * (len(scores)-1) / 100)
return scores[index] Output:

Your Percentile is 88
# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
P_Rank = 80

# computing Percentile from Percentile Rank


percentile = Percentile( D , P_Rank )
print("Your Percentile is", percentile)

38
Cumulative Distribution Functions - CDFs
• CDF is the function that maps from a value to its percentile rank.
• CDF is a function of X, where X is any value that might appear in the
distribution.
• To evaluate CDF(X) for a particular value of X, we compute the fraction of values
in the distribution ≤X.
• This function is almost identical to Percentile Rank, except that the result is a
probability in the range 0–1 rather than a percentile rank in the range 0–100.

• Ex: 30 33 43 55 CDF will be


in the range
0–1
CDF(30) =
CDF(33) CDF(55)
0.25 CDF(43) =
= 0.5 = 1.0
0.75

39
Cumulative Distribution Functions – CDFs . . .

• Ex: 30 33 43 55

CDF(30) CDF(33) CDF(43) = CDF(55) X ≥ 55


= 0.25 = 0.5 0.75 = 1.0 CDF(X) = 1.0
X ≤ 30
CDF(X) = 0 CDF(32) CDF(49)
= CDF(30) = CDF(43)
= 0.25 = 0.75

• CDF can be evaluated for any value of X which is in dataset or which is not
in dataset.
• If X ≤ D[0] , then CDF(X) = 0.
• If X ≥D[N] , then CDF(X) = 1.
• If D[I] ≤ X ≤ D[I+1] , then CDF(X) = CDF(D[I]).

40
Cumulative Distribution Functions – CDFs . . .
# Function to Evaluate CDF
def EvalCDF(D, X):
count = 0.0
for value in D:
if value <= X:
count += 1 Output:
prob = count / len(D)
return prob Your CDF is 0.5

# creating data
D = [30, 33, 43, 55]
X = 33

# computing CDF
cdf = EvalCDF( D , X)
print("Your CDF is", cdf)

41
Representing CDFs
import numpy as np
import matplotlib.pyplot as plt

# creating data
D = [30, 33, 43, 55]
N = len(D)

# sorting data
X = np.sort(D)

# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)

# plotting CDF
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X, Y)
plt.show()
42
Comparing CDFs
import numpy as np
import matplotlib.pyplot as plt

# creating multiple datasets


D1 = [30, 33, 43, 55]
D2 = [20, 28, 37, 45]

N1 = len(D1)
N2 = len(D2)

# sorting data
X1 = np.sort(D1)
X2 = np.sort(D2)

# calculating probabilities
Y1 = 1. * np.arange(N1) / (N1 - 1)
Y2 = 1. * np.arange(N2) / (N2 - 1)

43
Comparing CDFs . . .
# plotting multiple CDFs
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X1, Y1)
plt.plot(X2, Y2)
plt.show()

44
Percentile-Based Statistics
• Percentile
• Percentile Rank
• Percentile can be used to compute percentile-based summary statistics.
– For example, the 50th percentile is the value that divides the distribution in
half, also known as the median. Like the mean, the median is a measure of
the central tendency of a distribution.
– Another percentile-based statistic is the InterQuartile Range (IQR), which
is a measure of the spread of a distribution. The IQR is the difference
between the 75th and 25th percentiles.
• Percentiles are often used to summarize the shape of a distribution.

45
Random Numbers
import numpy as np
import matplotlib.pyplot as plt
import random

# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
sample = random.sample(D , 4)
N = len(sample)

# sorting data
X = np.sort(sample)

# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)

# plotting CDF of random sample


plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X, Y)
plt.show()
46
Comparing Percentile Ranks
• Percentile ranks are useful for comparing measurements across different groups.
• Ex: People who compete in foot races are usually grouped by age and gender.
To compare people in different age groups, you can convert race times to
percentile ranks.

47
Comparing Percentile Ranks . . .
Ex: A few years ago I ran the James Joyce Ramble 10K in Dedham MA. I finished in
42:44 which was 97th in a group of 1633. I beat or tied 1537 runners out of 1633
(His Percentile Rank in the field is 94%).

#Function to find Percentile Rank of runner from his Finishing Position and Group
Size of Marathon
def PositionToPercentile(position, group_size):
beat = group_size - position + 1
P_Rank = 100.0 * beat / group_size Output:
return P_Rank
Your Percentile Rank is 94.12

P_Rank = PositionToPercentile(97, 1633)


print("Your Percentile Rank is", P_Rank)

48
Comparing Percentile Ranks . . .
• Ex: In my age group, “males between 40 and 49 years of age”, I came in 26th
out of 256. (His Percentile Rank in his age group is 90%).
• If I am still running in 10 years, I will be in age group of “50 to 59 years”.
Assuming that my percentile rank in my division is the same and the field size of
“50 to 59 years” age group is 171, in which position should I finish the
marathon?

#Function to find Finishing Position from Percentile Rank and Group Size of Marathon
def PercentileToPosition(percentile, group_size):
beat = percentile * group_size / 100.0
position = group_size - beat + 1 Output:

return position Your Position is 18.09

position = PercentileToPosition(90, 171)


print("Your Percentile Rank is", position)

49
Modeling Distributions
• Empirical Distributions are based on empirical observations, which are
necessarily finite samples.
• Analytic Distribution, is characterized by a CDF that is a mathematical
function. Analytic distributions can be used to model empirical distributions. In
this context, a model is a simplification that leaves out unneeded details.
• Some of the common analytic distributions are:
– Exponential Distribution
– Normal Distribution
– Lognormal Distribution

50
Exponential Distribution
• In the real world, exponential distributions come up when we look at a series of
events and measure the times between events, called interarrival times. If
the events are equally likely to occur at any time, the distribution of interarrival
times tends to look like an exponential distribution.
• The exponential distribution is a continuous probability distribution that often
concerns the amount of time until some specific event happens. It is a process
in which events happen continuously and independently at a constant average
rate.

51
Exponential Distribution
• The Exponential distribution is a continuous distribution bounded on the lower
side. Its shape is always the same, starting at a finite value at the minimum and
continuously decreasing at larger x. The Exponential distribution decreases
rapidly for increasing x. Following figure shows PDF of x.

52
Exponential Distribution . . .
• The CDF of the exponential distribution is

53
Normal (Guassian) Distribution

• The normal distribution, also called Gaussian, is commonly used because it


describes many phenomena, at least approximately.

• It is a type of continuous probability distribution for a real-valued random


variable. The general form of its probability density function is

• The normal distribution is characterized by two parameters: the mean μ, and


standard deviation σ.

• The normal distribution with μ = 0 and σ = 1 is called the standard normal


distribution. Its CDF is defined by an integral that does not have a closed form
solution, but there are algorithms that evaluate it efficiently.

54
Normal (Guassian) Distribution . . .
• A normal distribution is sometimes informally called a bell curve. The red curve
is the standard normal distribution.

55
Normal (Guassian) Distribution . . .
• Figure shows CDFs for normal distributions with a range of parameters. The
sigmoid shape of these curves is a recognizable characteristic of a normal
distribution.

56
Lognormal Distribution
• If the logarithms of a set of values have a normal distribution, the values have a
lognormal distribution.
• The CDF of the lognormal distribution is the same as the CDF of the normal
distribution, with log x substituted for x.

57
Lognormal Distribution . . .
• If a sample is approximately lognormal and you plot its CDF on a log-x scale, it
will have the characteristic shape of a normal distribution.
• Left figure shows the distribution of adult weights on a linear scale with a
normal model. Right figure shows the same distribution on a log scale with a
lognormal model. The lognormal model is a better fit.

58
Lognormal Distribution . . .
• Figure shows normal probability plots for adult weights, w, and for their
logarithms, log10 w. Now it is apparent that the data deviate substantially from
the normal model.
• The lognormal model is a good match for the data within a few standard
deviations of the mean, but it deviates in the tails. It can be concluded that the
lognormal distribution is a good model for this data.

59

You might also like