CE-613 - DOC - 02 Descriptive Stat, Frequency Plot
CE-613 - DOC - 02 Descriptive Stat, Frequency Plot
https://stats.stackexchange.com/posts/comments/180175
Histograms Types
• Primary objective
• To visualize the shape of the distribution: skewness, multimodality
• To identify gaps, outliers, impossible or suspicious values
• Constructed by binning the data and counting total observations in
each bin
• The number of bins needs to be
• Large enough to reveal interesting features
• Small enough not to be too noisy
• A very small bin width can be used to look for rounding or heaping
Finding the Total Cases Within a Bin
• Bin count or Frequency Histogram
• Height of the Bar = Count within a Bin
• Like a bar chart for categorical variable
• Height (already expressed as proportion/count) sums
to 100% (or the total)
• More easily interpreted
• Density (Counts per unit Width) Histogram
• Height of the Bar = Density
• Count within a Bin = (Density)*(Bin width) = Area
• Area sum to 100%
• More suited for comparison to mathematical density
models
• Possible to show the density curve of the population
using the same vertical scale
Same Data = Four Representations
• Number of observations or
counts can also be represented as
% counts of the total
• Absolute frequency is just the
natural count of occurrences
in each bin,
• Relative frequency is the
proportion (or %) of
occurrences in each bin
• The labels on y-axis will tell us
whether it is a histogram or
density plot
Prob-1
Given that total count = 1000
Find using all four graphs
(1 A, 1 B, 2 A, 2 B):
(1) How many people scored
between 60-80?
(2) What proportion of people
scored ≤ 60?
(3) How many more people
scored between 60-80 than
between 20-40?
Solution to Prob-1
Given that total count = 1000 Find using all four
graphs(1 A, 1 B, 2 A, 2 B):
(1) how many people scored between 60-80
1 A) 400
1 B) 20*20 = 400
2 A) (40/100)*1000 = 400
2 B) (0 .002)*1000*20 = 400
20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,
37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,
54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,
88,89,90,91,92,93,94,95,96,97,98,99,100
Solution: Prob2) Estimating from a histogram
(a) Which block represents the people who scored between 60 and 80?
• Third block
(b) 10 % scored between 20 and 40 About what % scored between 40 and 60?
• 0 5%*20 = 10%, Hence there were 1 0*20 or 20% cases between 40-60
(c) What percentage scored exactly 70%?
• Approximately 0?
(d) What % scored below 60?
• 30%
Accuracy Depends on the Sample Size
Histogram and Cumulative Density
Histogram of Linearly
Transformed Data
𝑥 − 𝑚𝑒𝑎𝑛 𝑥
𝑥𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 =
𝑠𝑑 𝑥
Histogram and Density Plot
Histogram and Approximation/Flaw
• The histogram makes it appear as if the frequency in each class is
uniformly distributed over that class
• Please comment
(Line) Density plot
• See any problem/approximation?
Outlier Introduced
• Why?
Median is Robust to outliers (Reason)
Original data
• 76, 78, 81, 82, 84, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94, 98, 101, 103, 103, 103, 104,104, 106,
108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• 76, 78, 81, 82, 84, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94
• 94, 98, 101, 103, 103, 103, 104, 104, 106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• Median = (94+94)/2 = 94
• Mean = 97.58
With outlier
• 76, 78, 81, 82, 1000, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94, 98, 101, 103, 103, 103, 104,104,
106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• 76, 78, 81, 82, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94
• 98, 101, 103, 103, 103, 104, 104, 106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119, 1000
• Median = (94+98)/2 = 96
• Mean = 121.68
Truncated/Trimmed Mean
• Take the advantage of both mean and median
• Steps:
• Remove points with values less than 𝑥 %-tile and above 100 − 𝑥 %-tile
• Estimate mean of the remaining values
ഥ , 𝑿𝒎 , 𝑿𝒅 for Categorical/Ordinal data
𝑿
• Can we estimate the three central tendencies?
4∗5+3∗11+2∗18+1∗10+0∗6
• 𝑋ത = = 1. 78 → C
50
• Cumulative values = 5, 16, 34, 44, 50
• 𝑋𝑚 =C, since 32% grades are in A, B and 32% in D, F
• 𝑋𝑑 = 𝐶 (The highest frequency)
Problems with single numbers
• It is important to show “variation” in data
Dispersion Measure
• Variance (Just remember “variation”)
1
2
𝑆 = ∑ 𝑥𝑖 − 𝑋ത 2
𝑛−1
1 1
• Also, 𝑆2 = ∑𝑥𝑖2 − ∑𝑥𝑖 2
𝑛−1 𝑛
1 1
• Standard Deviation 𝑆 = ∑𝑥𝑖2 − ∑𝑥𝑖 2
𝑛−1 𝑛
• Since s utilizes the sample mean, the result has “lost” one degree of freedom
• (n – 1) independent observations remaining to calculate the variance
• For small samples, there is a tendency for the resultant s to be underestimated and (n –
1) accounts for this
• Coefficient of variance, 𝐶𝑂𝑉 = 𝑆/𝑋ത
• % of average value
Prob-04) Dispersion Measures of Concrete Strength
A sample of five tests was taken to determine the compression strength (in ksi)
of concrete Test results are 2 5, 3 5, 2 2, 3 2, and 2 9 𝑘𝑠𝑖 Compute the variance,
standard deviation, and COV of concrete strength
2 5+3 5+2 2+3 2+2 9
ത
•𝑋= = 2 86 𝑘𝑠𝑖
5
1
2 52 +3 52 +2 22 +3 22 +2 92 −5 2 5+3 5+2 2+3 2+2 9 2
• 𝑆2 = = 0 273 𝑘𝑠𝑖 2
5−1
•𝑆= 0 273 = 0 522 𝑘𝑠𝑖
𝑆 0 522
• 𝐶𝑂𝑉 𝑋 = = = 0 183 = 18 3%
𝑋ത 2 86
• COV is large ⇒ average value is not reliable
• Additional measurements might be needed
• If, with additional measurements, the COV remains large, relatively large factors of safety
should be used for projects that use the concrete
Percentiles
• A 𝑝 percentile value (𝑥𝑝 ) for a parameter or variable based on a sample is the
value of the parameter such that 𝑝% of the data is less than or equal to 𝑥𝑝
• The median value is considered to be the 50 percentile value
• It is common in engineering to be interested in the 10, 25, 50, 75, and 90
percentile values for a variable
Example
Data pertaining to hours of annual operations by toll booth operators was collected An
analytical model (normal probability density function) was fit the data (as shown in figure
below) The solid line represents assumed frequency model for the population
• The calculation of the percentile values, in this example, requires knowledge of the equation
of the model and probabilistic analysis
• The area under this model is a measure of likelihood or probability The total area under the
model is 1
• The 25 percentile value such that the area under the model up to this value is 0 25 is 702 h
• Likewise the 75, 90, and 10 percentile values are 1850, 2244, and 578 h, respectively
Boxplot
Box and Whisker Plot (or just Box plot)
• Graphical method for showing the distribution of sampled data, including the
central tendency (mean and median), dispersion, percentiles (i e , 10, 25, 75,
and 90 percentiles), and the extremes (minimum and maximum)
• Can show bias about the standard value
• The relative sample size to compare multiple plots can be depicted as width of
the box plots
• A box plot requires the following to be computed
1. Mean and median of the sample
2. Minimum and maximum of the sample
3. 75 (upper boundary), 25 (lower boundary)
4. At 90, and 10 percentile values the bars that are one half of the width of the box are
placed perpendicular to the whiskers
Example
• The figure shows box-and-whisker plot for the
maximum daily ozone concentration
• The mean and median values are 59 and 52 ppb
(parts per billion), respectively
• The 10, 25, 75, and 90 percentile points are 24,
36, 79, and 97 ppb, respectively
Box Plot versus Histogram
• Compare X and Y
Matching
References
• TEXT BOOKS
• David Freedman, Robert Pisani, Roger Purves 2004 Statistics 4th edition Norton and
Company
• Ayyub, Bilal M McCuen, Richard H (2012) Probability, statistics, and reliability for
engineers and scientists, 3rd edition
• URLS
• https://stats.stackexchange.com/q/135737/9583