Unit 4 - Statistical Thinking
Unit 4 - Statistical Thinking
Unit 4
By
Dr. G. Sunitha
Professor
Department of AI & ML
School of Computing
• One of the best ways to describe a variable is to report the distribution of the
variable.
2
Representing Histograms
# In Python, an efficient way to compute frequencies is with a dictionary.
3
Representing Histograms . . .
# Alternatively, Counter class from the collections module can be used.
4
Representing Histograms . . .
# Alternatively, value_counts() method from Pandas library can be used.
import pandas as pd
5
Plotting Histograms
National Survey of Family Growth (NSFG) Data - This dataset contains information
on marriage, pregnancy, infertility, contraception use, reproductive health and
family life.
6
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# reading data
df = pd.read_csv("D://nsfg.csv")
# Creating a figure
fig = plt.figure(dpi = 100)
# Set labels
plt.xlabel('Birth Weight(Pounds)')
plt.ylabel('Frequency')
# Visualizing histogram
sns.histplot(data = df ['birthwgt_lb1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13])
7
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# reading data
df = pd.read_csv("D://nsfg.csv")
# Creating a figure
fig = plt.figure(dpi = 100)
# Set labels
plt.xlabel('Birth Weight(Ounces)')
plt.ylabel('Frequency')
# Visualizing Histogram
sns.histplot(data = df ['birthwgt_oz1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13] )
8
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns Mode = 40
# reading data
df = pd.read_csv("D://nsfg.csv")
# Creating a figure
fig = plt.figure(dpi = 100)
# Visualizing Histogram
sns.histplot(data = df ['prglngth'] )
9
Outliers
• Looking at histograms, it is easy to identify the most common values and the
shape of the distribution, but rare values are not always visible.
• Histograms are useful because they make the most frequent values immediately
apparent. But they are not the best choice for comparing two distributions.
• it is a good idea to check for outliers, which are extreme values that might be
errors in measurement and recording, or might be accurate reports of rare
events.
• The best way to handle outliers depends on “domain knowledge”; that is,
information about where the data come from and what they mean. And it
depends on what analysis you are planning to perform.
10
Outliers
Tails
11
Summarizing Distributions
• A histogram is a complete description of the distribution of a sample; that is,
given a histogram, we could reconstruct the values in the sample (although not
their order).
• Often we want to summarize the distribution with a few descriptive statistics.
• Central Tendency
– A characteristic of a sample or population; intuitively, it is an average or
typical value.
– Do the values tend to cluster around a particular point?
• Mode
– The most frequent value in a sample, or one of the most frequent values.
– Is there more than one cluster?
• Spread
– A measure of how spread out the values in a distribution are.
– How much variability is there in the values?
12
Summarizing Distributions . . .
• Tails
– The part of a distribution at the high and low extremes.
– How quickly do the probabilities drop off as we move away from the
modes?
Tails
• Outliers
– A value far from the central tendency.
– Are there extreme values far from the modes?
13
Summarizing Distributions . . .
• Summary Statistic
– A statistic that quantifies some aspect of a distribution, like central tendency
or spread.
• Variance
– A summary statistic often used to quantify spread.
• Standard Deviation
– The square root of variance, also used as a measure of spread.
• Effect Size
– A summary statistic intended to quantify the size of an effect like a
difference between groups.
• Normal Distribution
– An idealization of a bell-shaped distribution; also known as a Gaussian
distribution.
• Uniform Distribution
– A distribution in which all values have the same frequency.
14
Summarizing Distributions . . .
• Statistics designed to answer these questions are called Summary Statistics.
• By far the most common summary statistic is the Mean, which is meant to
describe the central tendency of the distribution.
15
Variance
• If there is no single number that summarizes a variable, we can do a little better
with two numbers: Mean and Variance.
• The term is called the “deviation from the mean,” so variance is the
mean squared deviation.
16
Probability Mass Functions (PMF)
17
Plotting PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# reading data
df = pd.read_csv("D://nsfg.csv")
# Creating a figure
fig = plt.figure(dpi = 100)
# Visualizing PMF
sns.histplot(data = df['prglngth'] , stat = 'probability’ )
18
The Class Size Paradox - Plotting Multiple PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Creating a figure
fig = plt.figure(dpi = 100)
# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)
# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]
19
The Class Size Paradox - Plotting Multiple PMFs . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Creating a figure
fig = plt.figure(dpi = 100)
# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)
# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]
20
DataFrame Indexing
import numpy as np
import pandas
# Creating dataframe
df = pandas.DataFrame(array)
# displaying dataframe
df
21
DataFrame Indexing . . .
• By default, the rows and columns are numbered starting at zero, but names can be
provided to columns.
import numpy as np
import pandas
# displaying dataframe
df
22
DataFrame Indexing . . .
• Names can be provided for rows also. The row names themselves are called
labels. The set of row names is called the index.
import numpy as np
import pandas
# displaying dataframe
df
23
DataFrame Indexing . . .
• Simple indexing selects a column, returning a Series.
• If the integer position of a row is known , the iloc attribute, can be used. It
returns a Series object.
24
DataFrame Indexing . . .
• To select a row by label, the loc attribute can be used. It returns a Series
object.
• loc attribute can also take a list of labels; in that case, the result is a
DataFrame.
25
DataFrame Indexing . . .
• DataFrame can be sliced to select a range of rows by label or by integer
position.
• The result in either case is a DataFrame, but notice that the first result includes
the end of the slice; the second doesn’t.
26
Limits of PMFs
• PMFs work well if the number of values is small. But as the number of values
increases, the probability associated with each value gets smaller and the effect
of random noise increases.
27
Limits of PMFs . . .
• Overall, these distributions resemble the bell
shape of a normal distribution, with many
values near the mean and a few values
much higher and lower.
• But parts of this figure are hard to interpret.
• There are many spikes and valleys, and
some apparent differences between the
distributions. It is hard to tell which of these
features are meaningful.
• Also, it is hard to see overall patterns; for
example, which distribution do you think has
the higher mean?
• These problems can be mitigated by binning
the data; that is, dividing the range of values
into non-overlapping intervals and counting
the number of values in each bin.
• Binning can be useful, but it is tricky to get
the size of the bins right. If they are big Bins
enough to smooth out noise, they might also
smooth out useful information. Bin size = 10
28
Percentiles
• A percentile is a term that describes how a score compares to other scores from
the same set.
• It shows the percentage of scores that fall below the given number.
Ex: if a student is in the 90th percentile, it means that he/she is better
than 90% of the people who took the exam.
• It is a measure used in statistics indicating the value below which a given
percentage of observations in a group of observations fall.
29
Percentiles . . .
30
Percentiles . . .
Procedure to Calculate Percentile from Index
• Given Index of an object in dataset, the following are the steps to find its
Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Percentile K
𝑰𝒏𝒅𝒆𝒙
𝑲= × 𝟏𝟎𝟎
𝑵
31
Percentiles . . .
Procedure to Calculate Kth Percentile on Dataset
• The Kth percentile is a value in a dataset that divides the data into two parts.
The lower part contains K% of data, and the upper part contains rest of the
data.
• The following are the steps to calculate the Kth Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝐾
𝐼𝑛𝑑𝑒𝑥 = ×𝑁
100
3) If Index is not a whole number, round to the nearest whole number.
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷 𝐼𝑛𝑑𝑒𝑥
4) If Index is a whole number
𝐷 𝐼𝑛𝑑𝑒𝑥 +𝐷 𝐼𝑛𝑑𝑒𝑥+1
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 =
2
32
Percentiles . . .
33
Percentiles . . .
# importing numpy module
import numpy
# set K value
K = 50
34
Percentiles . . .
Procedure to Calculate Percentile Rank of Score “X”
• The following are the steps to calculate the Percentile Rank.
1) Let “D” be the dataset sorted in ascending order. Let dataset contain “N”
objects.
2) Compute Percentile Rank of value “X”
#𝑉𝑎𝑙𝑢𝑒𝑠 < 𝑋 𝑖𝑛 𝐷
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = × 100
𝑁
Ex: D = 30 33 43 55 69 72 85 88 91 93
#𝑉𝑎𝑙𝑢𝑒𝑠 <90 8
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = 10
× 100 = 10
× 100 = 80.0
35
Percentiles . . .
# Function to calculate Percentile Rank from Score
def PercentileRank(scores, my_score):
scores.sort()
count = 0
for score in scores:
if score <= my_score:
count += 1 Output:
P_Rank = 100.0 * count / len(scores)
Your Percentile Rank is 80.0
return P_Rank
You performed better than the 80.0% of
# creating data Candidates who Attempted Exam
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
my_score = 90
36
Percentiles . . .
Procedure to Calculate Percentile from Percentile Rank
• The following are the steps to calculate the Percentile from Percentile Rank.
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝑃_𝑅𝑎𝑛𝑘 × 𝑁 − 1
𝐼𝑛𝑑𝑒𝑥 =
100
3) 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷[𝐼𝑛𝑑𝑒𝑥]
37
Percentiles . . .
# Function to compute Percentile from Percentile Rank
def Percentile(scores, P_Rank):
scores.sort()
index = int(P_Rank * (len(scores)-1) / 100)
return scores[index] Output:
Your Percentile is 88
# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
P_Rank = 80
38
Cumulative Distribution Functions - CDFs
• CDF is the function that maps from a value to its percentile rank.
• CDF is a function of X, where X is any value that might appear in the
distribution.
• To evaluate CDF(X) for a particular value of X, we compute the fraction of values
in the distribution ≤X.
• This function is almost identical to Percentile Rank, except that the result is a
probability in the range 0–1 rather than a percentile rank in the range 0–100.
39
Cumulative Distribution Functions – CDFs . . .
• Ex: 30 33 43 55
• CDF can be evaluated for any value of X which is in dataset or which is not
in dataset.
• If X ≤ D[0] , then CDF(X) = 0.
• If X ≥D[N] , then CDF(X) = 1.
• If D[I] ≤ X ≤ D[I+1] , then CDF(X) = CDF(D[I]).
40
Cumulative Distribution Functions – CDFs . . .
# Function to Evaluate CDF
def EvalCDF(D, X):
count = 0.0
for value in D:
if value <= X:
count += 1 Output:
prob = count / len(D)
return prob Your CDF is 0.5
# creating data
D = [30, 33, 43, 55]
X = 33
# computing CDF
cdf = EvalCDF( D , X)
print("Your CDF is", cdf)
41
Representing CDFs
import numpy as np
import matplotlib.pyplot as plt
# creating data
D = [30, 33, 43, 55]
N = len(D)
# sorting data
X = np.sort(D)
# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)
# plotting CDF
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X, Y)
plt.show()
42
Comparing CDFs
import numpy as np
import matplotlib.pyplot as plt
N1 = len(D1)
N2 = len(D2)
# sorting data
X1 = np.sort(D1)
X2 = np.sort(D2)
# calculating probabilities
Y1 = 1. * np.arange(N1) / (N1 - 1)
Y2 = 1. * np.arange(N2) / (N2 - 1)
43
Comparing CDFs . . .
# plotting multiple CDFs
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X1, Y1)
plt.plot(X2, Y2)
plt.show()
44
Percentile-Based Statistics
• Percentile
• Percentile Rank
• Percentile can be used to compute percentile-based summary statistics.
– For example, the 50th percentile is the value that divides the distribution in
half, also known as the median. Like the mean, the median is a measure of
the central tendency of a distribution.
– Another percentile-based statistic is the InterQuartile Range (IQR), which
is a measure of the spread of a distribution. The IQR is the difference
between the 75th and 25th percentiles.
• Percentiles are often used to summarize the shape of a distribution.
45
Random Numbers
import numpy as np
import matplotlib.pyplot as plt
import random
# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
sample = random.sample(D , 4)
N = len(sample)
# sorting data
X = np.sort(sample)
# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)
47
Comparing Percentile Ranks . . .
Ex: A few years ago I ran the James Joyce Ramble 10K in Dedham MA. I finished in
42:44 which was 97th in a group of 1633. I beat or tied 1537 runners out of 1633
(His Percentile Rank in the field is 94%).
#Function to find Percentile Rank of runner from his Finishing Position and Group
Size of Marathon
def PositionToPercentile(position, group_size):
beat = group_size - position + 1
P_Rank = 100.0 * beat / group_size Output:
return P_Rank
Your Percentile Rank is 94.12
48
Comparing Percentile Ranks . . .
• Ex: In my age group, “males between 40 and 49 years of age”, I came in 26th
out of 256. (His Percentile Rank in his age group is 90%).
• If I am still running in 10 years, I will be in age group of “50 to 59 years”.
Assuming that my percentile rank in my division is the same and the field size of
“50 to 59 years” age group is 171, in which position should I finish the
marathon?
#Function to find Finishing Position from Percentile Rank and Group Size of Marathon
def PercentileToPosition(percentile, group_size):
beat = percentile * group_size / 100.0
position = group_size - beat + 1 Output:
49
Modeling Distributions
• Empirical Distributions are based on empirical observations, which are
necessarily finite samples.
• Analytic Distribution, is characterized by a CDF that is a mathematical
function. Analytic distributions can be used to model empirical distributions. In
this context, a model is a simplification that leaves out unneeded details.
• Some of the common analytic distributions are:
– Exponential Distribution
– Normal Distribution
– Lognormal Distribution
50
Exponential Distribution
• In the real world, exponential distributions come up when we look at a series of
events and measure the times between events, called interarrival times. If
the events are equally likely to occur at any time, the distribution of interarrival
times tends to look like an exponential distribution.
• The exponential distribution is a continuous probability distribution that often
concerns the amount of time until some specific event happens. It is a process
in which events happen continuously and independently at a constant average
rate.
51
Exponential Distribution
• The Exponential distribution is a continuous distribution bounded on the lower
side. Its shape is always the same, starting at a finite value at the minimum and
continuously decreasing at larger x. The Exponential distribution decreases
rapidly for increasing x. Following figure shows PDF of x.
52
Exponential Distribution . . .
• The CDF of the exponential distribution is
53
Normal (Guassian) Distribution
54
Normal (Guassian) Distribution . . .
• A normal distribution is sometimes informally called a bell curve. The red curve
is the standard normal distribution.
55
Normal (Guassian) Distribution . . .
• Figure shows CDFs for normal distributions with a range of parameters. The
sigmoid shape of these curves is a recognizable characteristic of a normal
distribution.
56
Lognormal Distribution
• If the logarithms of a set of values have a normal distribution, the values have a
lognormal distribution.
• The CDF of the lognormal distribution is the same as the CDF of the normal
distribution, with log x substituted for x.
57
Lognormal Distribution . . .
• If a sample is approximately lognormal and you plot its CDF on a log-x scale, it
will have the characteristic shape of a normal distribution.
• Left figure shows the distribution of adult weights on a linear scale with a
normal model. Right figure shows the same distribution on a log scale with a
lognormal model. The lognormal model is a better fit.
58
Lognormal Distribution . . .
• Figure shows normal probability plots for adult weights, w, and for their
logarithms, log10 w. Now it is apparent that the data deviate substantially from
the normal model.
• The lognormal model is a good match for the data within a few standard
deviations of the mean, but it deviates in the tails. It can be concluded that the
lognormal distribution is a good model for this data.
59