Percentile
Percentile
Data can be "skewed", meaning it tends to have a long tail on one side or the other:
Negative Skew?
The Normal Distribution has Why is it called negative skew?
Because the long "tail" is on the
No Skew negative side of the peak.
A Normal Distribution is not skewed. People sometimes say it is Skew
Negative Skew No Skew Positive
It is perfectly symmetrical. "skewed to the left" (the long
tail is on the left hand side)
And the Mean is exactly at the peak.
The mean is also on the left of
the peak.
Positive Skew
And positive skew is when the long tail is on the
positive side of the peak, and some people say it
is "skewed to the right".
Percentiles
A percentile is a certain percentage of a set of data. Percentiles are
used to observe how many of a given set of data fall within a certain
percentage range; for example; a thirtieth percentile indicates data
that lies the 13% mark of the entire data set.
Calculating Percentiles
Let designate a percentile as Pm where m represents the percentile
we're finding, for example for the tenth percentile, m} would be 10.
Given that the total number of elements in the data set is N
Example: Shopping Percentile
A total of 10,000 people visited the shopping mall over 12 hours:
a) Estimate the 30th percentile (when 30% of the visitors had arrived)
Draw a line horizontally across from 3,000 until you hit the curve, then draw a line vertically
downwards to read off the time on the horizontal axis:
Time People
(hours)
0 0
2 350
4 1100
6 2400
8 6500
10 8850
12 10,000
So the 30th percentile occurs after about 6.5 hours
Quartiles
The term quartile is derived from the word quarter which means one fourth of something.
Thus a quartile is a certain fourth of a data set. When you arrange a date set increasing order
from the lowest to the highest, then you divide this data into groups of four, you end up with
quartiles. There are three quartiles that are studied in statistics.
•First Quartile (Q1)
When you arrange a data set in increasing order from the lowest to the highest, then you
proceed to divide this data into four groups, the data at the lower fourth ( 1⁄4) mark of the
data is referred to as the First Quartile.
The First Quartile is equal to the data at the 25th percentile of the data.
•Second Quartile (Q2)
When you arrange a given data set in increasing order from the lowest to the highest and
then divide this data into four groups , the data value at the second fourth (2⁄4) mark of the
data is referred to as the Second Quartile.
•Third Quartile (Q3)
When you arrange a given data set in increasing order from the lowest to the highest and
then divide this data into four groups, the data value at the third fourth ( 3⁄4) mark of the data
is referred to as the Third Quartile.
Find the First, Second and Third Quartiles of the data set below using the cumulative frequency curve.
Solution:
Age
Frequency
(years)
10 5
11 10
12 27
13 18
14 6
15 16
16 38
17 9
Cumulative
Age (years) Frequency
Frequency
10 5 5
11 10 15
12 27 42
13 18 60
14 6 66
15 16 82
16 38 120
17 9 129
From the Ogive, we can see the positions where the quartiles lie and thus can approximate them as follows
Interquartile Range
The interquartile range is the difference between the third
quartile and the first quartile.
Step 2: Find the minimum and maximum for your data set. Now that your
numbers are in order, this should be easy to spot.
In the example in step 1, the minimum (the smallest number) is 1 and the
maximum (the largest number) is 27
Step 3: Find the median. The median is the middle number. If you aren’t sure how
to find the median, see: How to find the mean mode and median
. Step 4:Place parentheses around the numbers above and below the median.
(This is not technically necessary, but it makes Q1 and Q3 easier to find).
(1,2,5,6,7),9,(12,15,18,19,27).
Step 6: Write down your five number summary found in the above 5 steps.
minimum=1, Q1 =5, median=9, Q3=18, and maximum=27.
box plot
Box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data
based on the five number summary: minimum, first quartile, median, third quartile, and maximum.
In the simplest box plot the central rectangle spans the first quartile to the third quartile (the inter quartile
range or IQR). A segment inside the rectangle shows the median and "whiskers" above
and below the box show the locations of the minimum and maximum.
•Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the
first quartile.
•Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or
more above the third quartile or 1.5×IQR or more below the first quartile.
Variance and Standard Deviation
The variance of N observations, x1;x2; : : : ;xN, is
s2 =
1
N
Nå
i=1
(xix)2 =
1
N
åx2
i
1
N
(åxi)2
, (2.6)
where x is the mean value of the observations, as defined in Equation (2.1). The
standard
deviation, s, of the observations is the square root of the variance, s2.
The basic properties of the standard deviation, s, as a measure of spread
•SD measures spread about the mean and should be used only when the mean is
chosen as the measure of center.
•s=0 only when there is no spread, that is, when all observations have the same
value.Otherwise s > 0.
Dispersion the action or process of distributing things or people over a wide area.