0% found this document useful (0 votes)
19 views

CE-613 - DOC - 02 Descriptive Stat, Frequency Plot

Uploaded by

PRATHAM SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

CE-613 - DOC - 02 Descriptive Stat, Frequency Plot

Uploaded by

PRATHAM SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Topics

• Visualize data distribution


• Descriptive Statistics
• Central tendency
• Dispersion measure
• Percentile measure
• Graphical representation of frequency
• Categorical (Nominal, Ordinal): Count for each case
• Numeric (Interval, Ratio): Count within each bin
• Analysis of simulated data
More Chars to Summarize Data
• Lollipop Chart
• Hybrid/combination
• Lollipop Chart
• A Pareto chart is a type of chart that contains both bars and a line graph
which depicts the cumulative frequency
• Stacked Dot plot
• Step and Leaf plot
Visualize Distribution
A. Histogram
B. Kernel Density
C. Boxplot
D. Violin
Ridge plots (frequency labels omitted)
Step and Leaf Plot
• Data: 12, 15, 16, 21, 24, 29, 30, 31, 32, 33, 45, 46, 49, 50, 52, 58, 60, 63, 64, 65
• The decimal point is 1 digit(s) to the right of the |
1 | 256 # <-- 12, 15, 16
2 | 149 # <-- 21, 24, 29
3 | 0123 # <-- 30, 31, 32, 33
4 | 569 # <-- 45, 46, 49
5 | 028 # <-- 50, 52, 58
6 | 0345 # <-- 60, 63, 64, 65

• The decimal point is 1 digit(s) to the right of the |


0 | 256 # starting with 0 and 1
2 | 1490123 # starting with 2 and 3
4 | 569028
6 | 0345
Pareto chart
Pareto chart
Pareto chart
Pareto chart
Color/Shape/Size to replace multi-dimensionality
Lollipop Chart
Dot Plot – Step by Step
Dot Plot – Step by Step
Dot Plot – Final
Dot Plot → Histogram
Histogram for Continuous (Interval/Ratio) Data
• Thickness of corroded streel plate
• DATA = {7 807, 8 886, 8 694, 8 185, 9 235, 8 526, 6 890, 8 953, 6 284, 6 533, 8
953, 8 112, 7 372, 9 640, 7 344, 8 837, 8 900, 9 048, 7 253, and 8 588}
Relative Frequency = Sample Probability

Effect of Bin Width on the Utility of Histogram
Effect of Initial Bin Boundary
• (A) = Histogram with bin width of 10, starts
at 75
• (B) = Histogram with bin width of 10, starts
at 67
• (C) = Overlayed plots
What can we do to avoid or mitigate such problems?
• Nick Cox: Details robust to variations in bin width and bin origin are likely to be
genuine; details fragile to such are likely to be spurious or trivial.
• At the least, you should always do histograms at several different binwidths or
bin-origins, or preferably both.
• Alternatively, check a kernel density estimate at not-too-wide a bandwidth.
• One other approach that reduces the arbitrariness of histograms is averaged
shifted histograms

https://stats.stackexchange.com/posts/comments/180175
Histograms Types
• Primary objective
• To visualize the shape of the distribution: skewness, multimodality
• To identify gaps, outliers, impossible or suspicious values
• Constructed by binning the data and counting total observations in
each bin
• The number of bins needs to be
• Large enough to reveal interesting features
• Small enough not to be too noisy
• A very small bin width can be used to look for rounding or heaping
Finding the Total Cases Within a Bin
• Bin count or Frequency Histogram
• Height of the Bar = Count within a Bin
• Like a bar chart for categorical variable
• Height (already expressed as proportion/count) sums
to 100% (or the total)
• More easily interpreted
• Density (Counts per unit Width) Histogram
• Height of the Bar = Density
• Count within a Bin = (Density)*(Bin width) = Area
• Area sum to 100%
• More suited for comparison to mathematical density
models
• Possible to show the density curve of the population
using the same vertical scale
Same Data = Four Representations
• Number of observations or
counts can also be represented as
% counts of the total
• Absolute frequency is just the
natural count of occurrences
in each bin,
• Relative frequency is the
proportion (or %) of
occurrences in each bin
• The labels on y-axis will tell us
whether it is a histogram or
density plot
Prob-1
Given that total count = 1000
Find using all four graphs
(1 A, 1 B, 2 A, 2 B):
(1) How many people scored
between 60-80?
(2) What proportion of people
scored ≤ 60?
(3) How many more people
scored between 60-80 than
between 20-40?
Solution to Prob-1
Given that total count = 1000 Find using all four
graphs(1 A, 1 B, 2 A, 2 B):
(1) how many people scored between 60-80
1 A) 400
1 B) 20*20 = 400
2 A) (40/100)*1000 = 400
2 B) (0 .002)*1000*20 = 400

(2) What proportion of people scored ≤ 60


1 A) 100 + 200 = 300
1 B) (5+10)*20 = 300
2 A) (10%+20%)*1000 = 300
2 B) (1% + 0 .5%)*20 *1000= 300

(3) How many more people scored between 60-80


than between 20-40?
1 A) 400 – 100 = 300
1 B) (20 - 5)*20 = 300
2 A) (40% - 10%)*1000 = 300
2 B) (2% - 0 .5%)*20*1000 = 300
Histograms with unequal bin widths
• Possible but rarely a good idea unless it is a frequency histogram and not
density histogram
• Doing so would distort the perception of how many points are in each bin,
since increasing a bin’s size will only make it look bigger
• Consistent bin sizes makes measuring bar area and height equivalent
• A histogram sometimes appear to have different bin sizes if the borders are
omitted
Prob-2) Estimating from a histogram
The histogram below shows the distribution of final scores in a certain class.
(a) Which block represents the people who scored between 60 and 80?
(b) 10 % scored between 20 and 40. About what % scored between 40 and 60?
(c) What percentage scored exactly 70? Answer based on this histogram.
(d) What % scored below 60?

20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,
37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,
54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,
88,89,90,91,92,93,94,95,96,97,98,99,100
Solution: Prob2) Estimating from a histogram
(a) Which block represents the people who scored between 60 and 80?
• Third block
(b) 10 % scored between 20 and 40 About what % scored between 40 and 60?
• 0 5%*20 = 10%, Hence there were 1 0*20 or 20% cases between 40-60
(c) What percentage scored exactly 70%?
• Approximately 0?
(d) What % scored below 60?
• 30%
Accuracy Depends on the Sample Size
Histogram and Cumulative Density
Histogram of Linearly
Transformed Data

𝑥 − 𝑚𝑒𝑎𝑛 𝑥
𝑥𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 =
𝑠𝑑 𝑥
Histogram and Density Plot
Histogram and Approximation/Flaw
• The histogram makes it appear as if the frequency in each class is
uniformly distributed over that class
• Please comment
(Line) Density plot
• See any problem/approximation?

• Ans: Interpolation for values that do not exist in observations


Density Plot with Various binwidths (2 and 10)
Different Kernels (gaussian (red), epanechnikov (blue))
Percentage Scale
Density of Sample Data
Prob-3)
The following histograms depicts the test scores in three different
courses. The scores ranged from 0 to 100 with 50 as the passing score
For which course the percent who passed was about
(1) 50%,
(2) well over 50%,
(3) well under 50%
Central Tendency
• Mean
1

•𝑋= ∑𝑥𝑖
𝑛
• Most commonly used (misused)
• Median 𝑿𝑚
• 𝑋𝑚 = point that divides number of observations into two equal parts
• Robust to outlier
• Mode 𝑿𝒅
• Point or points with highest frequency (or relative frequency)
• Can be determined using a histogram
Properties of Mean and Variance
• 𝑀𝑒𝑎𝑛 = 𝜇 = 𝐸 𝑋
• 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉𝑎𝑟(𝑋 𝜎 = 𝐸 𝑋 2 − 𝐸 2 𝑋
• 𝐸 𝑎𝑋 + 𝑏 = 𝑎𝐸 𝑋
2
• 𝑉𝑎𝑟 𝑎𝑋 + 𝑏 = 𝑎 𝑉𝑎𝑟 𝑋
Median is Robust to outliers
• 76, 78, 81, 82, 84, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94, 98,
101, 103, 103, 103, 104,104, 106, 108, 109, 112, 113, 114, 116, 116, 118, 118,
119
• 76, 78, 81, 82, 1000, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94,
98, 101, 103, 103, 103, 104,104, 106, 108, 109, 112, 113, 114, 116, 116, 118,
118, 119

Outlier Introduced

• Why?
Median is Robust to outliers (Reason)
Original data
• 76, 78, 81, 82, 84, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94, 98, 101, 103, 103, 103, 104,104, 106,
108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• 76, 78, 81, 82, 84, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94
• 94, 98, 101, 103, 103, 103, 104, 104, 106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• Median = (94+94)/2 = 94
• Mean = 97.58
With outlier
• 76, 78, 81, 82, 1000, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94, 98, 101, 103, 103, 103, 104,104,
106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119
• 76, 78, 81, 82, 84, 86, 86, 87, 88, 88, 88, 89, 91, 91, 92, 92, 92, 94, 94
• 98, 101, 103, 103, 103, 104, 104, 106, 108, 109, 112, 113, 114, 116, 116, 118, 118, 119, 1000
• Median = (94+98)/2 = 96
• Mean = 121.68
Truncated/Trimmed Mean
• Take the advantage of both mean and median
• Steps:
• Remove points with values less than 𝑥 %-tile and above 100 − 𝑥 %-tile
• Estimate mean of the remaining values
ഥ , 𝑿𝒎 , 𝑿𝒅 for Categorical/Ordinal data
𝑿
• Can we estimate the three central tendencies?
4∗5+3∗11+2∗18+1∗10+0∗6
• 𝑋ത = = 1. 78 → C
50
• Cumulative values = 5, 16, 34, 44, 50
• 𝑋𝑚 =C, since 32% grades are in A, B and 32% in D, F
• 𝑋𝑑 = 𝐶 (The highest frequency)
Problems with single numbers
• It is important to show “variation” in data
Dispersion Measure
• Variance (Just remember “variation”)
1
2
𝑆 = ∑ 𝑥𝑖 − 𝑋ത 2
𝑛−1
1 1
• Also, 𝑆2 = ∑𝑥𝑖2 − ∑𝑥𝑖 2
𝑛−1 𝑛
1 1
• Standard Deviation 𝑆 = ∑𝑥𝑖2 − ∑𝑥𝑖 2
𝑛−1 𝑛

• Since s utilizes the sample mean, the result has “lost” one degree of freedom
• (n – 1) independent observations remaining to calculate the variance
• For small samples, there is a tendency for the resultant s to be underestimated and (n –
1) accounts for this
• Coefficient of variance, 𝐶𝑂𝑉 = 𝑆/𝑋ത
• % of average value
Prob-04) Dispersion Measures of Concrete Strength
A sample of five tests was taken to determine the compression strength (in ksi)
of concrete Test results are 2 5, 3 5, 2 2, 3 2, and 2 9 𝑘𝑠𝑖 Compute the variance,
standard deviation, and COV of concrete strength
2 5+3 5+2 2+3 2+2 9

•𝑋= = 2 86 𝑘𝑠𝑖
5
1
2 52 +3 52 +2 22 +3 22 +2 92 −5 2 5+3 5+2 2+3 2+2 9 2
• 𝑆2 = = 0 273 𝑘𝑠𝑖 2
5−1
•𝑆= 0 273 = 0 522 𝑘𝑠𝑖
𝑆 0 522
• 𝐶𝑂𝑉 𝑋 = = = 0 183 = 18 3%
𝑋ത 2 86
• COV is large ⇒ average value is not reliable
• Additional measurements might be needed
• If, with additional measurements, the COV remains large, relatively large factors of safety
should be used for projects that use the concrete
Percentiles
• A 𝑝 percentile value (𝑥𝑝 ) for a parameter or variable based on a sample is the
value of the parameter such that 𝑝% of the data is less than or equal to 𝑥𝑝
• The median value is considered to be the 50 percentile value
• It is common in engineering to be interested in the 10, 25, 50, 75, and 90
percentile values for a variable
Example
Data pertaining to hours of annual operations by toll booth operators was collected An
analytical model (normal probability density function) was fit the data (as shown in figure
below) The solid line represents assumed frequency model for the population
• The calculation of the percentile values, in this example, requires knowledge of the equation
of the model and probabilistic analysis
• The area under this model is a measure of likelihood or probability The total area under the
model is 1
• The 25 percentile value such that the area under the model up to this value is 0 25 is 702 h
• Likewise the 75, 90, and 10 percentile values are 1850, 2244, and 578 h, respectively
Boxplot
Box and Whisker Plot (or just Box plot)
• Graphical method for showing the distribution of sampled data, including the
central tendency (mean and median), dispersion, percentiles (i e , 10, 25, 75,
and 90 percentiles), and the extremes (minimum and maximum)
• Can show bias about the standard value
• The relative sample size to compare multiple plots can be depicted as width of
the box plots
• A box plot requires the following to be computed
1. Mean and median of the sample
2. Minimum and maximum of the sample
3. 75 (upper boundary), 25 (lower boundary)
4. At 90, and 10 percentile values the bars that are one half of the width of the box are
placed perpendicular to the whiskers
Example
• The figure shows box-and-whisker plot for the
maximum daily ozone concentration
• The mean and median values are 59 and 52 ppb
(parts per billion), respectively
• The 10, 25, 75, and 90 percentile points are 24,
36, 79, and 97 ppb, respectively
Box Plot versus Histogram
• Compare X and Y

Variation in Y is much different from X


Boxplot
• Can’t detect multimodality in the data
• Since, it is only based on five point summary (10, 25, 50, 75 and 90 percentile)
Boxplot and 5-point Summary
• Both the red and black CDFs share the same
minimum, maximum, and quartiles,
• But are clearly different distributions.
• Any number of CDFs could be specified that pass
through the same five points.

• All we need to do is restrict our distribution


function to lie within four boxes
• Five-number summary is identical doesn’t imply
the distributions are identical
Analysis of Simulated Data
• Discharge rate from a river
• Histogram Suggests Exponential Distribution
• Average of raw values = 54.82
• Assuming a Kernel Density Model as
1 −𝑥
𝑓 𝑥 = e 𝑏
𝑏
• Corresponding cumulative density model
𝑥
−𝑏
𝐹 𝑥 =1−e
• Algebraic rearrangement: 𝑥 = −𝑏 ∗ 𝐿𝑜𝑔 1 − 𝐹 𝑥
• 𝑥𝑖 = −𝟓𝟒. 𝟖𝟐*𝑙𝑜𝑔 𝑢𝑖 where 𝑢𝑖 ∈ [0,1] or uniform random variable
• This 𝑥𝑖 is a model for discharge rate
Modelling Discharge Rate Using Excel (or others)
• Model: 𝑥𝑖 = −𝟓𝟒. 𝟖𝟐 ∗ 𝑙𝑜𝑔 𝑢𝑖
• Simulate a large number of 𝑥𝑖 using 𝑢𝑖 (Uniform Random)
• Compare raw and modelled data
• (Separate Excel Sheet is provided to illustrate this)
Descriptive Statistics of Raw and Modelled Data

Matching
References
• TEXT BOOKS
• David Freedman, Robert Pisani, Roger Purves 2004 Statistics 4th edition Norton and
Company
• Ayyub, Bilal M McCuen, Richard H (2012) Probability, statistics, and reliability for
engineers and scientists, 3rd edition
• URLS
• https://stats.stackexchange.com/q/135737/9583

You might also like