0% found this document useful (0 votes)

31 views59 pages

Unit 4 - Statistical Thinking

This document discusses statistical distributions and methods for visualizing and summarizing them. It covers histograms, probability mass functions, and descriptive statistics like mean, variance, and standard deviation. Histograms map values to frequencies while probability mass functions map values to probabilities. The document shows examples of plotting histograms and PMFs in Python using libraries like Pandas and Seaborn. It also discusses identifying outliers and comparing multiple distributions to analyze phenomena like the class size paradox.

Uploaded by

VivuEtukuru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views59 pages

Unit 4 - Statistical Thinking

Uploaded by

VivuEtukuru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Statistical Thinking

Unit 4

By
Dr. G. Sunitha
Professor
Department of AI & ML

School of Computing

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102

Distribution of a Variable

• One of the best ways to describe a variable is to report the distribution of the
variable.

– The values that appear in a sample and the frequency of each.

• The most common representation of a distribution is a histogram, which is a

graph that shows the frequency of each value.

– Graph that shows mapping from values to frequencies.

• A histogram is a complete description of the distribution of a sample; that is,

given a histogram, we could reconstruct the values in the sample (although not
their order).

2
Representing Histograms
# In Python, an efficient way to compute frequencies is with a dictionary.

# initializing the list

lst = ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B’] Output:

# creating an empty dictionary {'A': 3, 'B': 3, 'C': 1, 'D': 2}

hist = {}

# looping though list to find frequency of elements

for x in lst :
hist[x] = hist.get(x, 0) + 1

# displaying the frequency

print(hist)

3
Representing Histograms . . .
# Alternatively, Counter class from the collections module can be used.

# importing the module

from collections import Counter Output:

# initializing the list Counter (

{'A': 3, 'B': 3, 'C': 1, 'D': 2}
lst = ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B']
)

# using Counter to find frequency of elements

hist = Counter(lst)

# displaying the frequency

print(hist)

4
Representing Histograms . . .
# Alternatively, value_counts() method from Pandas library can be used.

import pandas as pd

# initializing the Series object

lst = pd.Series ( ['A', 'A', 'B', 'C', 'B', 'D', 'D', 'A', 'B'] )
Output:
# displaying the frequency
print(lst.value_counts()) A 3
B 3
C 1
D 2

5
Plotting Histograms
National Survey of Family Growth (NSFG) Data - This dataset contains information
on marriage, pregnancy, infertility, contraception use, reproductive health and
family life.

6
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Birth Weight(Pounds)')
plt.ylabel('Frequency')

# Visualizing histogram
sns.histplot(data = df ['birthwgt_lb1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13])
7
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Birth Weight(Ounces)')
plt.ylabel('Frequency')

# Visualizing Histogram
sns.histplot(data = df ['birthwgt_oz1'] , bins = [0,1,2,3,4,5,6,7,8,9,10,11,12,13] )

8
Plotting Histograms . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns Mode = 40

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels and title

plt.xlabel('Pregnancy Length')
plt.ylabel('Frequency')

# Visualizing Histogram
sns.histplot(data = df ['prglngth'] )
9
Outliers
• Looking at histograms, it is easy to identify the most common values and the
shape of the distribution, but rare values are not always visible.

• Histograms are useful because they make the most frequent values immediately
apparent. But they are not the best choice for comparing two distributions.

• it is a good idea to check for outliers, which are extreme values that might be
errors in measurement and recording, or might be accurate reports of rare
events.

• The best way to handle outliers depends on “domain knowledge”; that is,
information about where the data come from and what they mean. And it
depends on what analysis you are planning to perform.

10
Outliers

Tails

11
Summarizing Distributions
• A histogram is a complete description of the distribution of a sample; that is,
given a histogram, we could reconstruct the values in the sample (although not
their order).
• Often we want to summarize the distribution with a few descriptive statistics.
• Central Tendency
– A characteristic of a sample or population; intuitively, it is an average or
typical value.
– Do the values tend to cluster around a particular point?
• Mode
– The most frequent value in a sample, or one of the most frequent values.
– Is there more than one cluster?
• Spread
– A measure of how spread out the values in a distribution are.
– How much variability is there in the values?

12
Summarizing Distributions . . .
• Tails
– The part of a distribution at the high and low extremes.
– How quickly do the probabilities drop off as we move away from the
modes?

Tails

• Outliers
– A value far from the central tendency.
– Are there extreme values far from the modes?

13
Summarizing Distributions . . .
• Summary Statistic
– A statistic that quantifies some aspect of a distribution, like central tendency
or spread.
• Variance
– A summary statistic often used to quantify spread.
• Standard Deviation
– The square root of variance, also used as a measure of spread.
• Effect Size
– A summary statistic intended to quantify the size of an effect like a
difference between groups.
• Normal Distribution
– An idealization of a bell-shaped distribution; also known as a Gaussian
distribution.
• Uniform Distribution
– A distribution in which all values have the same frequency.

14
Summarizing Distributions . . .
• Statistics designed to answer these questions are called Summary Statistics.

• By far the most common summary statistic is the Mean, which is meant to
describe the central tendency of the distribution.

15
Variance
• If there is no single number that summarizes a variable, we can do a little better
with two numbers: Mean and Variance.

• Variance is a summary statistic intended to describe the variability or spread of a

distribution.

• The variance of a set of values is

• The term is called the “deviation from the mean,” so variance is the
mean squared deviation.

• The square root of variance, S, is the Standard Deviation.

• Pandas data structures provides methods to compute mean, variance and

standard deviation.

16
Probability Mass Functions (PMF)

• Probability Mass Function (PMF) maps each value to its probability.

• Probability is a frequency expressed as a fraction of the sample size, n.
• Normalization is the process of dividing a frequency by a sample size to get a
probability.
• The biggest difference is that Histogram maps values to integer counters; PMF
maps values to floating-point probabilities.

17
Plotting PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# reading data
df = pd.read_csv("D://nsfg.csv")

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels and title

plt.xlabel('Pregnancy Length')
plt.ylabel('Probability')

# Visualizing PMF
sns.histplot(data = df['prglngth'] , stat = 'probability’ )

18
The Class Size Paradox - Plotting Multiple PMFs
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)

# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]

# Visualizing Multiple PMFs

sns.histplot(data = Actual , stat = 'probability', element = 'step’ )
sns.histplot(data = Predicted , stat = 'probability', element = 'step’ )

19
The Class Size Paradox - Plotting Multiple PMFs . . .
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Creating a figure
fig = plt.figure(dpi = 100)

# Set labels
plt.xlabel('Actual vs Predicted Marks')
plt.ylabel('Probability’)

# creating data
Actual = [5, 14, 16, 46, 34, 5, 6, 46, 46, 56, 54, 54, 34, 5 ]
Predicted = [7, 24, 25, 36, 34, 13, 15, 45, 45, 44, 43, 43, 34, 13 ]

# Visualizing Multiple PMFs

sns.histplot(data = Actual , stat = 'probability', element = 'step’ , fill = False)
sns.histplot(data = Predicted , stat = 'probability', element = 'step’ , fill = False)

20
DataFrame Indexing
import numpy as np
import pandas

# Generating a 4x2 dataset of numbers

array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Creating dataframe
df = pandas.DataFrame(array)

# displaying dataframe
df

21
DataFrame Indexing . . .
• By default, the rows and columns are numbered starting at zero, but names can be
provided to columns.

import numpy as np
import pandas

# Generating a 4x2 dataset of numbers

array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Providing names to 2 columns

columns = ['A', 'B']

# Creating dataframe with column names

df = pandas.DataFrame(array, columns=columns)

# displaying dataframe
df

22
DataFrame Indexing . . .
• Names can be provided for rows also. The row names themselves are called
labels. The set of row names is called the index.
import numpy as np
import pandas

# Generating a 4x2 dataset of numbers

array = [ [1, 2] , [3, 4] , [5, 6] , [7, 8] ]

# Providing names to 2 columns

columns = ['A', 'B']

# Providing names to 4 rows

rows = ['a', 'b', 'c', 'd']

# Creating dataframe with column names

df = pandas.DataFrame(array, columns=columns, index=rows)

# displaying dataframe
df
23
DataFrame Indexing . . .
• Simple indexing selects a column, returning a Series.

• If the integer position of a row is known , the iloc attribute, can be used. It
returns a Series object.

24
DataFrame Indexing . . .
• To select a row by label, the loc attribute can be used. It returns a Series
object.

• loc attribute can also take a list of labels; in that case, the result is a
DataFrame.

25
DataFrame Indexing . . .
• DataFrame can be sliced to select a range of rows by label or by integer
position.
• The result in either case is a DataFrame, but notice that the first result includes
the end of the slice; the second doesn’t.

26
Limits of PMFs
• PMFs work well if the number of values is small. But as the number of values
increases, the probability associated with each value gets smaller and the effect
of random noise increases.

27
Limits of PMFs . . .
• Overall, these distributions resemble the bell
shape of a normal distribution, with many
values near the mean and a few values
much higher and lower.
• But parts of this figure are hard to interpret.
• There are many spikes and valleys, and
some apparent differences between the
distributions. It is hard to tell which of these
features are meaningful.
• Also, it is hard to see overall patterns; for
example, which distribution do you think has
the higher mean?
• These problems can be mitigated by binning
the data; that is, dividing the range of values
into non-overlapping intervals and counting
the number of values in each bin.
• Binning can be useful, but it is tricky to get
the size of the bins right. If they are big Bins
enough to smooth out noise, they might also
smooth out useful information. Bin size = 10

28
Percentiles
• A percentile is a term that describes how a score compares to other scores from
the same set.
• It shows the percentage of scores that fall below the given number.
Ex: if a student is in the 90th percentile, it means that he/she is better
than 90% of the people who took the exam.
• It is a measure used in statistics indicating the value below which a given
percentage of observations in a group of observations fall.

29
Percentiles . . .

Ex: Calculating Percentile

Ex: 30 33 43 55 69 72 85 88 91 93

10th 20th 100th

Percentile Percentile 80th Percentile
Percentile
Percentile will be
in the range
0 – 100
• Ex: 30 33 43 55
𝑰𝒏𝒅𝒆𝒙
25th 𝑲= × 𝟏𝟎𝟎
𝑵
Percentile 50th 75th 100th
Percentile Percentile Percentile

30
Percentiles . . .
Procedure to Calculate Percentile from Index
• Given Index of an object in dataset, the following are the steps to find its
Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Percentile K
𝑰𝒏𝒅𝒆𝒙
𝑲= × 𝟏𝟎𝟎
𝑵

31
Percentiles . . .
Procedure to Calculate Kth Percentile on Dataset
• The Kth percentile is a value in a dataset that divides the data into two parts.
The lower part contains K% of data, and the upper part contains rest of the
data.
• The following are the steps to calculate the Kth Percentile ( 0 ≤ K ≤ 100).
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝐾
𝐼𝑛𝑑𝑒𝑥 = ×𝑁
100
3) If Index is not a whole number, round to the nearest whole number.
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷 𝐼𝑛𝑑𝑒𝑥
4) If Index is a whole number
𝐷 𝐼𝑛𝑑𝑒𝑥 +𝐷 𝐼𝑛𝑑𝑒𝑥+1
𝐾 𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 =
2

32
Percentiles . . .

Ex: Calculating 50th Percentile (Data should be ordered ascending)

Ex: 30 33 43 55 69 72 85 88 91 93

Kth Percentile = ( D[5] + D[6] ) / 2

K = 50
= (69 + 72)/2
N = 10
= 70.5
Index = K/100 * N
= 50/100 * 10
=5

33
Percentiles . . .
# importing numpy module
import numpy

# creating data Output:

D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93] 50th Percentile is 70.5

# set K value
K = 50

# computing Kth percentile of Dataset

P = numpy.percentile( D , K )
print( K , “th Percentile is”, P)

34
Percentiles . . .
Procedure to Calculate Percentile Rank of Score “X”
• The following are the steps to calculate the Percentile Rank.
1) Let “D” be the dataset sorted in ascending order. Let dataset contain “N”
objects.
2) Compute Percentile Rank of value “X”

#𝑉𝑎𝑙𝑢𝑒𝑠 < 𝑋 𝑖𝑛 𝐷
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = × 100
𝑁

Ex: D = 30 33 43 55 69 72 85 88 91 93

Calculating Percentile Rank of 90

#𝑉𝑎𝑙𝑢𝑒𝑠 <90 8
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒_𝑅𝑎𝑛𝑘 = 10
× 100 = 10
× 100 = 80.0

35
Percentiles . . .
# Function to calculate Percentile Rank from Score
def PercentileRank(scores, my_score):
scores.sort()
count = 0
for score in scores:
if score <= my_score:
count += 1 Output:
P_Rank = 100.0 * count / len(scores)
Your Percentile Rank is 80.0
return P_Rank
You performed better than the 80.0% of
# creating data Candidates who Attempted Exam
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
my_score = 90

# computing Percentile Rank

P_Rank = PercentileRank( D , my_score)
print("Your Percentile Rank is", P_Rank)
print("You performed better than the", P_Rank, "% of Candidates who Attempted Exam")

36
Percentiles . . .
Procedure to Calculate Percentile from Percentile Rank
• The following are the steps to calculate the Percentile from Percentile Rank.
1) Let ‘D’ be the dataset sorted in ascending order. Let dataset contain ‘N’
objects.
2) Compute Index
𝑃_𝑅𝑎𝑛𝑘 × 𝑁 − 1
𝐼𝑛𝑑𝑒𝑥 =
100

3) 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝐷[𝐼𝑛𝑑𝑒𝑥]

37
Percentiles . . .
# Function to compute Percentile from Percentile Rank
def Percentile(scores, P_Rank):
scores.sort()
index = int(P_Rank * (len(scores)-1) / 100)
return scores[index] Output:

Your Percentile is 88
# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
P_Rank = 80

# computing Percentile from Percentile Rank

percentile = Percentile( D , P_Rank )
print("Your Percentile is", percentile)

38
Cumulative Distribution Functions - CDFs
• CDF is the function that maps from a value to its percentile rank.
• CDF is a function of X, where X is any value that might appear in the
distribution.
• To evaluate CDF(X) for a particular value of X, we compute the fraction of values
in the distribution ≤X.
• This function is almost identical to Percentile Rank, except that the result is a
probability in the range 0–1 rather than a percentile rank in the range 0–100.

• Ex: 30 33 43 55 CDF will be

in the range
0–1
CDF(30) =
CDF(33) CDF(55)
0.25 CDF(43) =
= 0.5 = 1.0
0.75

39
Cumulative Distribution Functions – CDFs . . .

• Ex: 30 33 43 55

CDF(30) CDF(33) CDF(43) = CDF(55) X ≥ 55

= 0.25 = 0.5 0.75 = 1.0 CDF(X) = 1.0
X ≤ 30
CDF(X) = 0 CDF(32) CDF(49)
= CDF(30) = CDF(43)
= 0.25 = 0.75

• CDF can be evaluated for any value of X which is in dataset or which is not
in dataset.
• If X ≤ D[0] , then CDF(X) = 0.
• If X ≥D[N] , then CDF(X) = 1.
• If D[I] ≤ X ≤ D[I+1] , then CDF(X) = CDF(D[I]).

40
Cumulative Distribution Functions – CDFs . . .
# Function to Evaluate CDF
def EvalCDF(D, X):
count = 0.0
for value in D:
if value <= X:
count += 1 Output:
prob = count / len(D)
return prob Your CDF is 0.5

# creating data
D = [30, 33, 43, 55]
X = 33

# computing CDF
cdf = EvalCDF( D , X)
print("Your CDF is", cdf)

41
Representing CDFs
import numpy as np
import matplotlib.pyplot as plt

# creating data
D = [30, 33, 43, 55]
N = len(D)

# sorting data
X = np.sort(D)

# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)

# plotting CDF
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X, Y)
plt.show()
42
Comparing CDFs
import numpy as np
import matplotlib.pyplot as plt

# creating multiple datasets

D1 = [30, 33, 43, 55]
D2 = [20, 28, 37, 45]

N1 = len(D1)
N2 = len(D2)

# sorting data
X1 = np.sort(D1)
X2 = np.sort(D2)

# calculating probabilities
Y1 = 1. * np.arange(N1) / (N1 - 1)
Y2 = 1. * np.arange(N2) / (N2 - 1)

43
Comparing CDFs . . .
# plotting multiple CDFs
plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X1, Y1)
plt.plot(X2, Y2)
plt.show()

44
Percentile-Based Statistics
• Percentile
• Percentile Rank
• Percentile can be used to compute percentile-based summary statistics.
– For example, the 50th percentile is the value that divides the distribution in
half, also known as the median. Like the mean, the median is a measure of
the central tendency of a distribution.
– Another percentile-based statistic is the InterQuartile Range (IQR), which
is a measure of the spread of a distribution. The IQR is the difference
between the 75th and 25th percentiles.
• Percentiles are often used to summarize the shape of a distribution.

45
Random Numbers
import numpy as np
import matplotlib.pyplot as plt
import random

# creating data
D = [30, 33, 43, 55, 69, 72, 85, 88, 91, 93]
sample = random.sample(D , 4)
N = len(sample)

# sorting data
X = np.sort(sample)

# calculating probabilities
Y = 1. * np.arange(N) / (N - 1)

# plotting CDF of random sample

plt.xlabel('Data')
plt.ylabel('CDF')
plt.title('CDF of Data')
plt.plot(X, Y)
plt.show()
46
Comparing Percentile Ranks
• Percentile ranks are useful for comparing measurements across different groups.
• Ex: People who compete in foot races are usually grouped by age and gender.
To compare people in different age groups, you can convert race times to
percentile ranks.

47
Comparing Percentile Ranks . . .
Ex: A few years ago I ran the James Joyce Ramble 10K in Dedham MA. I finished in
42:44 which was 97th in a group of 1633. I beat or tied 1537 runners out of 1633
(His Percentile Rank in the field is 94%).

#Function to find Percentile Rank of runner from his Finishing Position and Group
Size of Marathon
def PositionToPercentile(position, group_size):
beat = group_size - position + 1
P_Rank = 100.0 * beat / group_size Output:
return P_Rank
Your Percentile Rank is 94.12

P_Rank = PositionToPercentile(97, 1633)

print("Your Percentile Rank is", P_Rank)

48
Comparing Percentile Ranks . . .
• Ex: In my age group, “males between 40 and 49 years of age”, I came in 26th
out of 256. (His Percentile Rank in his age group is 90%).
• If I am still running in 10 years, I will be in age group of “50 to 59 years”.
Assuming that my percentile rank in my division is the same and the field size of
“50 to 59 years” age group is 171, in which position should I finish the
marathon?

#Function to find Finishing Position from Percentile Rank and Group Size of Marathon
def PercentileToPosition(percentile, group_size):
beat = percentile * group_size / 100.0
position = group_size - beat + 1 Output:

return position Your Position is 18.09

position = PercentileToPosition(90, 171)

print("Your Percentile Rank is", position)

49
Modeling Distributions
• Empirical Distributions are based on empirical observations, which are
necessarily finite samples.
• Analytic Distribution, is characterized by a CDF that is a mathematical
function. Analytic distributions can be used to model empirical distributions. In
this context, a model is a simplification that leaves out unneeded details.
• Some of the common analytic distributions are:
– Exponential Distribution
– Normal Distribution
– Lognormal Distribution

50
Exponential Distribution
• In the real world, exponential distributions come up when we look at a series of
events and measure the times between events, called interarrival times. If
the events are equally likely to occur at any time, the distribution of interarrival
times tends to look like an exponential distribution.
• The exponential distribution is a continuous probability distribution that often
concerns the amount of time until some specific event happens. It is a process
in which events happen continuously and independently at a constant average
rate.

51
Exponential Distribution
• The Exponential distribution is a continuous distribution bounded on the lower
side. Its shape is always the same, starting at a finite value at the minimum and
continuously decreasing at larger x. The Exponential distribution decreases
rapidly for increasing x. Following figure shows PDF of x.

52
Exponential Distribution . . .
• The CDF of the exponential distribution is

53
Normal (Guassian) Distribution

• The normal distribution, also called Gaussian, is commonly used because it

describes many phenomena, at least approximately.

• It is a type of continuous probability distribution for a real-valued random

variable. The general form of its probability density function is

• The normal distribution is characterized by two parameters: the mean μ, and

standard deviation σ.

• The normal distribution with μ = 0 and σ = 1 is called the standard normal

distribution. Its CDF is defined by an integral that does not have a closed form
solution, but there are algorithms that evaluate it efficiently.

54
Normal (Guassian) Distribution . . .
• A normal distribution is sometimes informally called a bell curve. The red curve
is the standard normal distribution.

55
Normal (Guassian) Distribution . . .
• Figure shows CDFs for normal distributions with a range of parameters. The
sigmoid shape of these curves is a recognizable characteristic of a normal
distribution.

56
Lognormal Distribution
• If the logarithms of a set of values have a normal distribution, the values have a
lognormal distribution.
• The CDF of the lognormal distribution is the same as the CDF of the normal
distribution, with log x substituted for x.

57
Lognormal Distribution . . .
• If a sample is approximately lognormal and you plot its CDF on a log-x scale, it
will have the characteristic shape of a normal distribution.
• Left figure shows the distribution of adult weights on a linear scale with a
normal model. Right figure shows the same distribution on a log scale with a
lognormal model. The lognormal model is a better fit.

58
Lognormal Distribution . . .
• Figure shows normal probability plots for adult weights, w, and for their
logarithms, log10 w. Now it is apparent that the data deviate substantially from
the normal model.
• The lognormal model is a good match for the data within a few standard
deviations of the mean, but it deviates in the tails. It can be concluded that the
lognormal distribution is a good model for this data.

Mind Action Series Igrade 11 Paper 2 Memorandum
No ratings yet
Mind Action Series Igrade 11 Paper 2 Memorandum
10 pages
Unit 3
No ratings yet
Unit 3
45 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
Lec 3 and 2 After Mid
No ratings yet
Lec 3 and 2 After Mid
15 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Probability & Statistics B: Review of Simple Data Summaries
No ratings yet
Probability & Statistics B: Review of Simple Data Summaries
85 pages
Lecture3 Classnotes
No ratings yet
Lecture3 Classnotes
31 pages
Chap1 Introduction To Applied Probability Statistics Upload
No ratings yet
Chap1 Introduction To Applied Probability Statistics Upload
87 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Chapter 1 Mathematics
No ratings yet
Chapter 1 Mathematics
2 pages
Measures of Central Tendency & Variation
No ratings yet
Measures of Central Tendency & Variation
86 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
All Lectures
No ratings yet
All Lectures
53 pages
Statistics Review
No ratings yet
Statistics Review
59 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Math
No ratings yet
Math
13 pages
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
No ratings yet
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
39 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Histogram Chart
No ratings yet
Histogram Chart
14 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
EDA Lab Manual
No ratings yet
EDA Lab Manual
93 pages
EDA Lab Manual
100% (2)
EDA Lab Manual
93 pages
614 Descriptive Statistcs
No ratings yet
614 Descriptive Statistcs
56 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
2nd Unit
No ratings yet
2nd Unit
31 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Lecture 4 - Data Wrangling
No ratings yet
Lecture 4 - Data Wrangling
41 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
BRM Unit-1
No ratings yet
BRM Unit-1
25 pages
EDAV Manual With Code
No ratings yet
EDAV Manual With Code
70 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data Visualization
No ratings yet
Data Visualization
35 pages
Guiang Mamow Paper 1 Statistical Terms
No ratings yet
Guiang Mamow Paper 1 Statistical Terms
5 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Data Visualization Exp. 3
No ratings yet
Data Visualization Exp. 3
3 pages
Week2 Modified
No ratings yet
Week2 Modified
43 pages
Intro To Statistics Lecture
No ratings yet
Intro To Statistics Lecture
41 pages
Data Preprocessing Python Tome II
No ratings yet
Data Preprocessing Python Tome II
14 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
Lecture 1
No ratings yet
Lecture 1
28 pages
Staff Manual 03
No ratings yet
Staff Manual 03
3 pages
Advanced Plot Types With Matplotlib
No ratings yet
Advanced Plot Types With Matplotlib
8 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
100% (10)
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
142 pages
Chapter1.3 - Data Visualization
No ratings yet
Chapter1.3 - Data Visualization
27 pages
Collection of Data Part 2 Edited MLIS
No ratings yet
Collection of Data Part 2 Edited MLIS
45 pages
3rd QTR Stats Reviewer
No ratings yet
3rd QTR Stats Reviewer
24 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
CC Lab-1
No ratings yet
CC Lab-1
5 pages
CC Lab-2
No ratings yet
CC Lab-2
6 pages
CC Lab-5
No ratings yet
CC Lab-5
6 pages
CC Unit2
No ratings yet
CC Unit2
12 pages
R Lab Ex 1 To 5
No ratings yet
R Lab Ex 1 To 5
26 pages
Question Bank Class 11 Eco II CH 8 Use of Statistical Tools
No ratings yet
Question Bank Class 11 Eco II CH 8 Use of Statistical Tools
5 pages
Lab Sta680 Week 8
No ratings yet
Lab Sta680 Week 8
8 pages
Mathematics of The Modern World
No ratings yet
Mathematics of The Modern World
77 pages
Answer: Script Cover Sheet
No ratings yet
Answer: Script Cover Sheet
17 pages
Measures of Central Tendency (Finals)
No ratings yet
Measures of Central Tendency (Finals)
18 pages
6.1-6.4 Review
No ratings yet
6.1-6.4 Review
5 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
Module 8 - Normal Distribution
No ratings yet
Module 8 - Normal Distribution
9 pages
Quiz For Normal Distribution
No ratings yet
Quiz For Normal Distribution
2 pages
(20234) STA404 - SCHEME - OF - WORK - Oct23 - Feb24
No ratings yet
(20234) STA404 - SCHEME - OF - WORK - Oct23 - Feb24
5 pages
Lecture7a 0919
No ratings yet
Lecture7a 0919
18 pages
(Math 102) Statistics and Probability Reviewer
No ratings yet
(Math 102) Statistics and Probability Reviewer
8 pages
Additive Properties
No ratings yet
Additive Properties
1 page
Co-Relation Between IPL Strike Rates and Average and Height of The Players. Introduction
No ratings yet
Co-Relation Between IPL Strike Rates and Average and Height of The Players. Introduction
4 pages
Confusion Matrix & Box Plot
No ratings yet
Confusion Matrix & Box Plot
5 pages
Measures of Position For Ungrouped Data
100% (2)
Measures of Position For Ungrouped Data
49 pages
Assessment of Learning 2 Report
100% (1)
Assessment of Learning 2 Report
31 pages
Bus 6225 B Business Statistics Individual Assignment 1 1
No ratings yet
Bus 6225 B Business Statistics Individual Assignment 1 1
3 pages
Frequencies: FREQUENCIES VARIABLES Nama Gender IPK Semester /order Analysis
No ratings yet
Frequencies: FREQUENCIES VARIABLES Nama Gender IPK Semester /order Analysis
14 pages
Statistics
No ratings yet
Statistics
6 pages
Epsc 123
No ratings yet
Epsc 123
3 pages
Recap
No ratings yet
Recap
75 pages
Statistical Modelling For Machine Learning
No ratings yet
Statistical Modelling For Machine Learning
9 pages
Formula Sheet CT1
No ratings yet
Formula Sheet CT1
3 pages
Materials SB: N) K X X
No ratings yet
Materials SB: N) K X X
11 pages
Statistical Tables 4
No ratings yet
Statistical Tables 4
45 pages
LinearRegression - 2022
No ratings yet
LinearRegression - 2022
38 pages
Basic Econometrics 2023 Question Paper With Solution Delhi University BBE Business Economics
No ratings yet
Basic Econometrics 2023 Question Paper With Solution Delhi University BBE Business Economics
7 pages

Unit 4 - Statistical Thinking

Uploaded by

Unit 4 - Statistical Thinking

Uploaded by

Statistical Thinking

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102

– The values that appear in a sample and the frequency of each.

• The most common representation of a distribution is a histogram, which is a

– Graph that shows mapping from values to frequencies.

• A histogram is a complete description of the distribution of a sample; that is,

# initializing the list

# creating an empty dictionary {'A': 3, 'B': 3, 'C': 1, 'D': 2}

# looping though list to find frequency of elements

# displaying the frequency

# importing the module

# initializing the list Counter (

# using Counter to find frequency of elements

# displaying the frequency

# initializing the Series object

# Set labels and title

• Variance is a summary statistic intended to describe the variability or spread of a

• The variance of a set of values is

• The square root of variance, S, is the Standard Deviation.

• Pandas data structures provides methods to compute mean, variance and

• Probability Mass Function (PMF) maps each value to its probability.

# Set labels and title

# Visualizing Multiple PMFs

# Visualizing Multiple PMFs

# Generating a 4x2 dataset of numbers

# Generating a 4x2 dataset of numbers

# Providing names to 2 columns

# Creating dataframe with column names

# Generating a 4x2 dataset of numbers

# Providing names to 2 columns

# Providing names to 4 rows

# Creating dataframe with column names

Ex: Calculating Percentile

10th 20th 100th

Ex: Calculating 50th Percentile (Data should be ordered ascending)

Kth Percentile = ( D[5] + D[6] ) / 2

# creating data Output:

# computing Kth percentile of Dataset

Calculating Percentile Rank of 90

# computing Percentile Rank

# computing Percentile from Percentile Rank

• Ex: 30 33 43 55 CDF will be

CDF(30) CDF(33) CDF(43) = CDF(55) X ≥ 55

# creating multiple datasets

# plotting CDF of random sample

P_Rank = PositionToPercentile(97, 1633)

return position Your Position is 18.09

position = PercentileToPosition(90, 171)

• The normal distribution, also called Gaussian, is commonly used because it

• It is a type of continuous probability distribution for a real-valued random

• The normal distribution is characterized by two parameters: the mean μ, and

• The normal distribution with μ = 0 and σ = 1 is called the standard normal

You might also like