Statistics in Traffic Engineering-1
Statistics in Traffic Engineering-1
Because traffic engineering involves the collection and analysis of large amounts of data for
performing all types of traffic studies, it follows that statistics is also an important element in traffic
engineering.
Statistics helps us determine how much data will be required, as well as what meaningful
inferences can confidently be made based on that data.
Because of this, traffic engineers often observe and measure the characteristics of a finite sample
of vehicles in a population that is effectively infinite.
Statistical analysis is used to address the following questions:
How many samples are required (i. e., how many individual measurements must be
made)?
What confidence should I have in this estimate (i. e., how sure can I be that this sample
measurement has the same characteristics as the population)'!
What statistical distribution best describes the observed data mathematically?
Has a traffic engineering design resulted in a change in characteristics of the population?
(For example, has a new speed limit resulted in reduced speeds?).
Consider the following example: Estimate the mean from the following sample speeds in mi/h:
(53, 41, 63, 52, 41, 39, 55, and 34). Using Equation above:
Because the original data had only two significant digits, the more correct answer is 47 mi/h.
For grouped data, the average value of all observations in a given group is considered to be the
midpoint value of the group. The overall average of the entire sample may then be found as:
The median is the middle value of all data when arranged in an array (ascending or descending
order). The median divides a distribution in half: Half of all observed values are higher than the
median, and half are lower. For no grouped'' data, it is the middle value; for example, for the set
of numbered (3, 4, 5, 5, 6, 7, 7, 7, 8), the median is 6. It is the fifth value (in ascending or descending
order) in an array of 9 numbers.
For grouped data, the easiest way to get the median is to read the 50% percentile point off a
cumulative frequency distribution.
The mode is the value that occurs most frequently-that is, the most common single value.
For example, in non-grouped data, for the set of numbers (3, 4, 5, 5, 6, 7, 7, 7, 8), the mode is 7.
For the set of numbers (3, 3, 4, 5,5,5,6, 7, 8, 8, 8, 9), both 5 and 8 are modes, and the data are said
to be bimodal.
For grouped data, the mode is estimated as the peak of the frequency distribution curve.
For a perfectly symmetrical distribution, the mean, median, and mode are the same.
Measures of Dispersion
Measures of dispersion are measures that describe how far the data spread from the center.
The statistical values that describe the magnitude of variation around the mean
variance and
standard deviation
Where all variables are as previously defined. The standard deviation (STD) may also be estimated
as:
Where:
P85:85th percentile value of the distribution (i.e., 85% of all data is at this value or less).
P15:15th percentile value of the distribution (i.e., 15% of all data is at this value or less).
The ith percentile is defined as that value below which x% of the outcomes fall. P58 is the 85th
percentile, often used in traffic speed studies; it is the speed that encompasses 85% of vehicles.
The median is the 50th percentile speed, or the median speed.
The coefficient of variation is the ratio of the standard deviation to the mean and is an indicator of
the spread of outcomes relative to the mean.
The distribution or the underlying shape of the data is of great interest. Is it normal? Exponential?
But the engineer is also interested in anomalies in the shape of the distribution (e.g., skewness or
bimodality).
Skewness is defined as the (mean - mode)/s.d
If a distribution is negatively skewed, it means that the data are concentrated to the left of
the most frequent value (i.e., the mode).
When a distribution is positively skewed, the data are concentrated to the right of the mode.
The engineer should look for the underlying reasons for skewness in a distribution. For
instance, a negatively skewed speed distribution may indicate a problem such as sight distance
or pavement condition that is inhibiting drivers from selecting higher travel speeds.
The Normal Distribution and Its Applications
One of the most common statistical distributions is the normal distribution, known by its
characteristic bell-shaped curve (Fig. 1). The normal distribution is a continuous distribution.
Probability is indicated by the area under the probability density function f(x) between specified
values, such as P (40 < x< 50).
Where:
x :a normally distributed statistic.
µ: true mean of the distribution.
𝜎: True standard deviation of the distribution.
The probability of any occurrence between values x1 and x2 is given by the area under the
distribution function between the two values. The area may be found by integration between the
two limits. Likewise, the mean µ, and the variance, 𝜎 2, can be found through integration.
The normal distribution is the most common distribution because any process that is the sum of
many parts tends to be normally distributed.
All other values in Equation above, including π, are constants. The notation for a normal
distribution is x: N [µ, 𝜎 2], which means that the variable x is normally distributed with a mean
of µ and a variance of 𝜎 2.
Fig. 1: The Normal Distribution.
For the normal distribution, the integration cannot be done in closed form due to the complexity
of the equation for f(x); thus tables for a "standard normal" distribution, with zero mean (µ = 0)
and unit variance ( 𝜎 2= 1), are constructed. Table 1 presents tabulated values of the standard
normal distribution.
The standard normal is denoted z: Z [0, 1]. Any value of x on any normal distribution, denoted
x: N [0, 1], can be converted to an equivalent value of z on the standard normal distribution.
This can also be done in reverse when needed. The translation of an arbitrary normal distribution
of values of z to equivalent values of z on the standard normal distribution is accomplished as:
Where:
z = equivalent statistic on the standard normal distribution, z: N [0, l]
x = statistic on any arbitrary normal distribution,
x: N[µ, 𝜎 2] other variables as previously defined.
The Standard Normal Distribution
Figure 2 shows the translation for a distribution of spot speeds that has a mean of 55 mi/h and
standard deviation of 7 mi/h to equivalent values of z.
Determine the probability that the next value of z will be less than:
Entering Table of standard normal distribution on the vertical scale at 1.4 and on the horizontal
scale at 0.03, the probability of having a value of z less than 1.43 is 0.9236, or 92.36%.
Another type of application frequently occurs: For the case just stated, what is the probability that
the speed of the next vehicle is between 55 and 65 mi/h?
The probability that the speed is less than 65 mi/h has already been computed. We can now find
the probability that the speed is less than 55 mi/h, which is equivalent to z = (55 - 55)/7 = 0.00, so
that the probability is 0.50, or 50%, exactly.
The probability of being between 55 and 65 mi/h is just the difference of the two probabilities:
(0.9236 - 0.5000) = 0.4236, or 42.36%.
For the case just stated, find the probability that the next vehicle ' s speed is less than 50 mi/h.
Translating to the z-axis, we wish to find the probability of a value being less than z = (50 - 55)/7
= -0.71.
Negative values of z are not given in the Table of standard distribution, but by symmetry it can be
seen that the desired shaded area is the same size as the area greater than + 0.71”. Still, we can
only find the shaded area less than + 0.71 (it is: 0.7611).
However, knowing that the total probability under the curve is 1.00, the remaining area (i.e., the
desired quantity) is therefore (1.0000 - 0.7611) = 0.2389, or 23.89%.
The preceding exercises allow us to compute relevant areas under the normal curve. Some numbers
occur frequently in practice, and it is useful to have those in mind. For instance, what is the
probability that the next observation will be within one standard deviation of the mean, given that
the distribution is normal? That is, what is the probability that x is in the range (µ± 1.00 𝜎 )? By a
similar process to those just illustrated, we can find that this probability is 68. 3%.
The following ranges have frequent use in statistical analysis involving normal distributions:
68.3% of the observations are within µ± 1.00 𝜎.
95.0% of the observations are within µ± 1.96 𝜎.
95. 5% of the observations are within µ± 2 𝜎.
99. 7% of the observations are within µ± 3 𝜎.
The total probability under the normal curve is 1.00, and the normal curve is symmetric around
the mean. It is also useful to note that the normal distribution is asymptotic to the x-axis and
extends to values of ± ∞. These critical characteristics will prove to be useful throughout the text.
Confidence Bounds
What would happen if we asked everyone in class (70 people) to collect 50 samples of speed data
and to compute their own estimate of the mean? How many estimates would there be? What
distribution would they have? There would be 70 estimates and the histogram of these 70 means
would look normally distributed. Thus the “estimate of the mean" is itself a random variable that
is normally distributed.
Usually we compute only one estimate of the mean (or any other quantity), but in this class exercise
we are confronted with the reality that there is a range of outcomes. We may therefore, ask how
good our estimate of the mean is. How confident are we that our estimate is correct? Consider that
The standard deviation of this distribution of the means is called the standard error of the
mean (E)
Where the sample standard deviation, s, is used to estimate 𝜎. and all variables are as previously
defined. The same characteristics of any normal distribution apply to this distribution of means as
well.
In other words, the single value of the estimate of the mean,𝑥̅𝑛 approximates the true mean
population, µ, as follows:
The ± term (E, 1.96E, or 3.00E, depending on the confidence level) in the preceding equation is
also called the tolerance and is given the symbol e.
Consider the following: 54 speeds are observed, and the mean is computed as 47.8 mi/h, with a
standard deviation of 7. 80 mi/h. What are the 95% confidence bounds?
Thus it is said there is a 95% probability that the true mean lies between 45.7 and 49.9 mi/h.
Further, although not proven here, any random variable consisting of sample means tends to be
normally distributed for reasonably large, regardless of the original distribution of individual
values.
Where 1.962 is used only for 95% confidence. If 99.7% confidence is desired, then the 1.96 would
be replaced by 32.
Consider another example: With 99.7% and 95% confidence, estimate the true mean of the speed
on a highway, plus or minus 1 mi/h. We know from previous work that the standard deviation is
7.2 mi/h. How many samples do we need to collect?
Consider further that a spot speed study is needed at a location with unknown speed characteristics.
A tolerance of ± 0.2 mph and a confidence of 95% is desired. What sample size is required?
Because the speed characteristics are unknown, a standard deviation of 5 mi/h (a most common
result in speed studies) is assumed. Then for 95% confidence,
This number is unreasonably high. It would be too expensive to collect such a large amount of
data. Thus the choices are to either reduce the confidence or increase the tolerance.
A 95% confidence level is considered the minimum that is acceptable; thus, in this case, the
tolerance would be increased. With a tolerance of 0. 5 mi/h:
Thus the increase of just 0.3 mi/h in tolerance resulted in a decrease of 2,017 samples required.
Note that the sample size required depends on s, which was assumed at the beginning. After the
study is completed and the mean and standard deviation are computed, should be rechecked. If N
is greater (i.e., the actual s is greater than the assumed s), then more samples may need to be taken.
Another example: An arterial is to be studied, and it is desired to estimate the mean travel time to
a tolerance of ±5 seconds with 95% confidence. Based on prior knowledge and experience, it is
estimated that the standard deviation of the travel times is about 15 seconds. How many samples
are required?
As the data is collected, the s computed is 22 seconds, not 15 seconds. If the sample size is kept at
N = 35, the confidence bounds will be ± 1.96(222)/√35 or about ±7.3 seconds.
If the confidence bounds must be kept at ±5 seconds, then the sample size must be increased so
that;
N > 1.962 (222/52) = 74.4 or 75 samples. Additional data will have to be collected to meet the
desired tolerance and confidence level.