Descriptive Statistics
Descriptive Statistics
Some internet sources that may be useful for various aspects of statistics:
https://www.amstat.org/
http://www.stats.gla.ac.uk/steps/glossary/index.html
Some Terminologies
Statistics: Science of learning from data. It deals with collection, organization, analysis, and
interpretation of data.
Data: A list of observations for a variable (may be numerical or non numerical).
Ex. (a) Observe infant birth weights in a hospital: 7 lbs., 6 lbs 2 oz, 9 lbs etc.
(b) Observe gender variable among all students in this class: Male/ female.
1
a) Variable: Parking times of patrons who park at the mall
b) Population: All patrons who parked at the mall.
c) Sample: 105 patrons who parked at the mall on July 6, 2012
2
Ex. Organize the following data by constructing a frequency distribution and relative frequency
distribution for the colors of M & Ms:
Br Y G O Bl R R Bl O G Y Br G Y Br O Bl R G O R Br Y G Br Y O R O R Y Br R Br Br Y Y
R Br R Br Br Y Y Br
Table 1: Frequency or relative frequency table
3
2. Pareto graph: A bar graph in which the bars are drawn in decreasing order of frequency or
relative frequency.
3. Pie chart: A pie chart is a circle divided into sectors. The sectors represent the categories, and
the area of a sector is proportional to the relative frequency of the category.
Ex. Following data represents the marital status of US residents (in millions) 18 years of age or
older in 2006. Calculate the relative frequencies in percentage.
4
Organization and visual representation of discrete data
For a discrete variable, if the possible values are relatively few then consider each value as a
category.
Ex. The following discrete data represent the number of cars in a household based on a random
sample of 50 households. Construct a frequency and relative frequency distribution.
30121112024222122024113241212233212203222321221135
Table 2: Frequency table
5
c a r s = as . t a b l e ( c a r s )
c a r s = a s . data . frame ( c a r s ) # r e a d i n g a s data frame
For continuous data or for a discrete data with relatively large number of possible values, first
construct a grouped frequency distribution to draw a histogram.
Ex. The following data represent integer scores for a statistics exam. Construct a histogram for the
given data.
60 47 82 95 88 72 67 66 68 98 90 77 86 58 64 95 74 72 88 74 77 39 90 63 68 97 70 64 70 70 58 78
89 44 55 85 82 83 72 77 72 86 50 94 92 80 91 75
6
Steps:
1. Range of the data = highest data value - lowest data value = 59.
2. Select number of classes and class width (for the first table), number of classes = 7 and class
width = 10. (7*10 = 70 works b/c 70 is larger than range)
3. Pick a starting point smaller than the lowest data value 39 (30 which is 1st lower limit).
4. Add the class width 10 to 30 to get the lower limit 40 of the 2nd class.
Important: For a grouped frequency distribution, classes must be disjoint and must cover the entire
range.
• Lets draw a histogram of the same data with 13 classes and class width 5.
7
Remark: Shape of a histogram (distribution) for a dataset changes as the class width (d) or the
number of classes k changes. If n = sample size, then, one way to find k is to use the Sturges
formula: k = dlog2 ne + 1 (from binomial assuming normality). For the exam score data n = 50,
and hence, k = 7 using Sturges formula.
2. Stem and Leaf plot: This is a sorting or graphing technique sometimes used in computer
applications when the datasets are small.
In this plot, each data value is split into a “stem” and a “leaf” where the “leaf” is usually the
last/rightmost digit of the number and the other digits to the left of the “leaf” form the “stem”.
Sorting the data first helps in drawing this plot.
Ex. Construct a stem and leaf plot of the following data (ages of people).
12 20 23 32 35 38 38 39 41 43 43 50 51 52 53 53 55 58 59 59 85
Stem Leaf
1 2
2 0 3
3 2 5 8 8 9
4 1 3 3
5 0 1 2 3 3 5 8 9 9
6
7
8 5
Remark: The shape of the stem and leaf plot, if rotated 90 degree anticlockwise, resembles the
shape of a histogram. Usually there is no need to sort the leaves, although computer packages
typically do.
8
Measures of Central Tendency
Measures of central tendency are numerical values that locate, in some sense, the center of the data.
1. Sample Mean (x): Let x1 , x2 , . . . , xn be a set of sample values from a population. The sample
mean is the arithmetic mean of the values:
n
1X
x= xi .
n i=1
1/n
Similarly, one may consider the geometric mean xg = ( i=1 xi ) for nonnegative values x1 , . . . , xn
Qn
P −1
or the harmonic mean xh = n 1 n 1
i=1 xi as an alternative measure of central tendency. However,
the arithmetic mean is typically referred as the sample mean.
Note: E[X] is referred as the population mean corresponding to a random variable X and sample
mean x gives an approximate value of E[X] when x1 , . . . , xn are realizations of X.
2. Sample Median (x e): The value that lies in the middle of a dataset after arranging a data set
in ascending or descending order.
For odd sample size n: median (x
e) = n+12 -th value in the sorted data.
For even sample size n: median (x
e) = average of n2 -th and n2 + 1 -th value in the sorted data.
Remark 2: R codes for computing sample mean and median is mean(x) and median(x) respectively.
b): The value(s) of the data that occurs most frequently in the dataset.
3. Sample Mode (x
9
Ex. Find the mode(s) of the following data sets.
(a) {0, 1, 2, 3, 3, −3, 3, 6}, mode = 3.
(b) {−1, 1, 1.5, 1.5, 3, 4, 4.5, 4.5, 6}, mode = 1.5 and 4.5.
(c) {−1, 1, 2, 3, 4}, mode = no mode or each data point is a mode.
Remark:
• Relation between mean, median and skewness of a distribution:
Sample Quartiles: After arranging a dataset in ascending order, 3 quartiles: first quartile Q1 ,
second quartile Q2 , and third quartile Q3 divide the data into 4 equal parts. Clearly, Q2 = median.
Thus: Q1 = P25 , Q2 = P50 , and Q3 = P75 . There are alternative ways to find sample quartiles. For
example, first order a dataset to find the median that divides the dataset into 2 halves. Then Q1 is
the median of left half (with smaller values) and Q3 , is the median of right half (with larger values).
Quantiles: Percentiles and quartiles are special cases of quantiles. Population quantiles are points
partitioning the range of a probability distribution into disjoint intervals with equal probabilities,
and sample quantiles are points partitioning the observations in a sample in the same way.
10
For 0 < α < 1, the α-th population quantile qα of a random variable X is a point satisfying:
P (X ≤ qα ) ≥ α and P (X ≥ qα ) ≥ 1 − α.
Probability density of a normal distribution, with population quartiles shown. The area below the
red curve is the same in the intervals (−∞, Q1 ), (Q1 , Q2 ), (Q2 , Q3 ), and (Q3 , ∞).
A α-th sample quantile qbα is a point such that at least α × 100% of the sorted data (arranged in
ascending order) is less or equal to qbα and at least (1 − α) × 100% of the sorted data is greater or
equal to qbα .
Remark: From the definition, note that quantiles may not be unique.
Ordered data: 1.1, 1.8, 2.5, 2.7, 3.4, 3.9, 4.0, 4.0, 4.1, 4.7, 5.2, 5.9, 6.1, 6.2, 9.6.
(a) r = 15 ∗ 0.75 = 11.25. Thus, dre = 12. Therefore, 75th percentile = 5.9.
(b) r = 15 ∗ 0.5 = 7.5. Thus, dre = 8. Therefore, 2nd quartile = 4.
(c) r = 15 ∗ 0.4 = 6. Thus, 0.4-th quantile is the average of 3.9 and 4, which is 3.95.
11
Note that R output is: qb0.75 = 5.55, qb0.5 = 4, and qb0.4 = 3.96 which are different from the values
obtained by the above algorithm. There are, in fact, 9 different algorithms available in R to compute
sample quantiles. The default is type=7.
Trimmed or Truncated Mean calculates the sample mean after discarding given parts of a sample
at the high and low end (typically discarding an equal amount of both). For some 0 < α < 0.5,
α × 100% trimmed mean is the mean computed by excluding α × 100% largest and α × 100%
smallest values from the sample and taking mean of remaining (1 − 2α) × 100% of sample values.
For example, 10% trimmed mean of the dataset
is x0.1 = 3.375 (R code: mean(x, trim=0.1)) which is obtained by discarding −2 and 20 from the
sample and computing mean of remaining 8 observations. Note that sample mean and median are
4.5 and 3 respectively. Trimmed mean is more robust against outliers than usual mean and the
median can be regarded as a fully trimmed mean that is most robust. Alternatively, trimmed mean
can also be defined by omitting a fixed number of sample values. That is, if x1 , . . . , xn are ordered
(smallest to largest) sample values, g-times trimmed mean is (xg+1 + · · · + xn−g )/(n − 2g) for some
positive integer g.
12
• Based on IQR, outliers of a dataset may be defined. Let us define:
• Outliers: Data points outside the lower or the upper fences can be defined as outliers of the
dataset.
• If outliers don’t affect a statistic substantially, it is considered resistant or robust.
• Among the measure of central tendencies discussed, which statistic is least affected by outliers
and which statistic is most affected by outliers?
• Among the measure of dispersion discussed, which statistic is robust and which statistic is not
robust?
Sorted data: 6.5, 9.9, 12.0, 13.0, 13.3, 13.9, 14.3, 14.4, 14.4, 14.5
Steps:
1. Interquartile range (IQR) = Q3 − Q1 = 14.4 − 12 = 2.4.
2. Lower fence = Q1 − 1.5 ∗ IQR = 12 − 3.6 = 8.4.
3. Upper fence = Q3 + 1.5 ∗ IQR = 14.4 + 3.6 = 18.
13
4. Smallest value larger than the lower fence = 9.9
5. Largest value smaller than the upper fence = 14.5
6. Data value less than lower fence: 6.5 Data value greater than upper fence: NONE.
• Using Box plots to describe shapes of distributions:
Q. Based on the boxplot of interest rate data, what type of skewness does the distribution have?
R code for boxplot of interest rate data
interest = c (6.5 ,12 ,14.4 ,14.4 ,14.3 ,13 ,13.3 ,13.9 ,9.9 ,14.5)
b o x p l o t ( i n t e r e s t , main=" b o x p l o t o f i n t e r e s t r a t e data " ,
h o r i z o n t a l=TRUE)
a x i s ( 1 , a t=c ( 6 : 1 5 ) , t c k = −.025 , l a s =0)
14
Let x1 , . . . , xn be n realizations (i.e., sample) from X. The r-th sample raw moment and r-th sample
central moments are
n n
1X 1X
m0r = xr and mr = (xi − x)r for r = 1, 2, . . . .
n i=1 i n i=1
Measure of Skewness:
Recall: if random variable X is symmetric around mean µ = E[X], then E[(X − µ)r ] = 0 for
r = 1, 3, 5, . . . . (H.W.)
Similarly, for a symmetric distribution, the observations about the mean x, the positive terms
will cancel with the negative terms resulting in m3 = 0. Therefore, small value of m3 indicates a
somewhat symmetric distribution. Also, if m3 is positive, then the positive terms in m3 dominate
the negative terms, that is, there are more large observations to the right of x stretching further
than those to the left of x. We call this kind of data distribution as positively skewed, or skewed
to the right. Similarly, when m3 is negative, the data distribution is called negatively skewed, or
skewed to the left. Formally, a measure of skewness (or asymmetry) is defined as
1 n
(xi − x)3
P
m3
γ1 = √ 3 = qn i=1 3
m2 1 Pn
(x − x)2
n i=1 i
Note that, for X ∼ N (µ, σ 2 ), Kurt[X] = 3. So, in order to compare kurtosis of a r.v. X
with that of the N (µ, σ 2 ), an alternative measure of kurtosis, called excess kurtosis is defined as
γ2 = E[(X − µ)4 ]/σ 4 − 3. Some authors refer excess kurtosis as kurtosis.
See https://en.wikipedia.org/wiki/Kurtosis#Excess_kurtosis for details about the distributions.
Sometimes, Kurt[X] is said to measure “peakedness” of the central value (say, mean) of a probability
distribution, as seen from the above picture. However, this interpretation is disputed. Recently,
Westfall (Ref: Westfall, Peter H. (2014), "Kurtosis as Peakedness, 1905 - 2014. R.I.P.", The
American Statistician, 68 (3): 191–195,), Noted as "...its unambiguous interpretation relates to tail
extremity. Specifically, it reflects either the presence of existing outliers (for sample kurtosis) or the
tendency to produce outliers (for the kurtosis of a probability distribution). The underlying logic
is straightforward: Kurtosis represents the average (or expected value) of standardized data raised
to the fourth power. Standardized values less than 1—corresponding to data within one standard
deviation of the mean (where the “peak” occurs)—contribute minimally to kurtosis. This is because
15
Probability density functions for selected distributions with mean 0, variance 1 and different excess
kurtosis. Here, D : Laplace, S : hyperbolic secant, L : Logistic, N : normal, C : raised cosine, W :
Wigner semicircle, U : uniform distributions.
raising a number less than 1 to the fourth power brings it closer to zero. The meaningful contributors
to kurtosis are data values outside the peak region, i.e., the outliers. Therefore, kurtosis primarily
measures outliers and provides no information about the central "peak".
The sample kurtosis for data x1 , x2 , . . . , xn is naturally given by:
m4
sample excess kurtosis = − 3.
m22
Normal Datasets
Sometimes real datasets have bell-shaped histograms. A “bell-shaped” histogram, in some cases,
may be referred as the ideal bell-shaped curve, that is, the normal curve that defines a normal
distribution. (studied in your probability class). That is why the datasets that have bell-shaped
histograms are often called the normal datasets. For the normal datasets we have the empirical rule
that follows from the properties of the normal distribution. Ex. SAT math scores have a bell-shaped
distribution with mean 515 and standard deviation 114.
(a) What percentage of SAT scores is less than 401 or greater than 629?
16
Empirical rule for a bell-shaped distribution
Scatter diagram or plot: A two-dimensional graph where x-axis and y-axis represent data values
from x-variable and y-variable respectively.
Points on the scatter plot may create some patterns: (a) Linear: Points may follow an imaginary
line. (b) Nonlinear: Points may follow a curve.
17
Notice that warmer weather leads to more sales; there seem to be a linear relationship between
temperature and sale variable.
R code to draw scatter plot of ice cream data
data = r e a d . t a b l e ( " i c e c r e a m . t x t " )
p l o t ( data [ , 1 ] , data [ , 2 ] , x l a b =" t e mperature " , y l a b =" s a l e s " , type = " p " , c o l = " r e d " , lwd=2)
a b l i n e ( lm ( data [ , 2 ] ~ data [ , 1 ] ) )
Note : you need t o keep t h e data f i l e `` i c e c r e a m . t x t ' ' i n t h e c u r r e n t R d i r e c t o r y .
By Cauchy-Swartz inequality,
v v
n u n u n
(xi − x)(yi − y)| ≤ (xi − x) t (yi − y)2
X uX uX
| t 2
i=1 i=1 i=1
Properties of r:
1. r is a unit free measure such that −1 ≤ r ≤ 1.
2. r only measures strength of linear relationship between x and y variables.
3. r > 0: positive linear relation, r < 0: negative linear relation. If r = ±1, two variables have
perfect linear relationship.
18
4. r = 0 implies no linear relationship between two variables. But, there may be some “nonlinear”
relationship between x and y variables.
19
Ex. What is wrong with these statements?
The correlation between height and weight of Computer Science students
(a) is 2.61
(b) is 0.61 inches per pound
(c) is 0.61, so the correlation between weight and height is −0.61
(d) is 0.61 using inches and pounds, but converting inches to centimeters would make r > 0.61
(since an inch equals about 2.54 centimeters)
Rank Correlations
There are correlation measures other than the Pearson’s correlation coefficient which can detect
non-linear relationships. These measures are based on ranks thereby depending on the relative
positions of the x and y components of the n pairs instead of their actual values. This is why
the name rank correlation. For this reason, one could use such measures for ordinal variables
as well, but the occurrence of ties will be too often creating some complications. Actually, the
following two rank correlation coefficients measures the degree of a monotone relationship between
two categorical/numerical variables.
General Properties:
A higher rank correlation coefficient implies more agreement between rankings. The coefficient is
inside the interval [-1, 1] and assumes the value:
• 1 if the agreement between the two rankings is perfect; the two rankings are the same.
• 0 if the rankings are completely independent.
• −1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the
other.
20
So, this sum is of no use. Let us consider
Pn
i=1 di
n n
d2i = (ri − si )2 ,
X X
i=1 i=1
which is equal to 0 if ri = si for all i. For the other extreme when si = n − ri + 1 for all i,
n n
= (2ri − n − 1)2 = n(n2 − 1)/3. (verify)
X X
d2i
i=1 i=1
6 ni=1 d2i
P
rSP =1− .
n(n2 − 1)
• Thus: −1 ≤ rSP ≤ 1 where rSP = 1 if there is a perfect positive association (i.e, if x increases,
then y increases and if x decreases, then y decreases) and rSP = −1 if there is a perfect negative
association (i.e, if x increases, then y decreases and if x decreases, then y increases).
• While Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic
relationships. If there are no repeated data values, a perfect Spearman correlation of +1 or -1 occurs
when each of the variables is a perfect monotone function of the other.
Exercise: Check that rSP is same as the Pearson’s correlation coefficient r based on the rank
variables {(ri , si ) : i = 1, . . . , n} if all ranks are distinct integers (i.e., no tie occurs).
If tie occurs: If there is any tie, that is broken in the usual manner by assigning the same rank to
the units with tied observations. For example, if the 5-th and 6-th observations in the sorted data
are tied, we assign the rank 5.5 to both of them. Also, if the 5-th, 6-th and the 7-th observations in
the sorted data are tied, we assign the rank 6 to all three of them.
Kendall’s Tau τ :
Note that the observation in each unit, say the i-th unit, is a pair (xi , yi ) with corresponding ranks
(ri , si ). Again, assume no tie for the time.
• For each pair of units, say (i, j) with i < j, we assign a score of +1 if their ranks w.r.t.
to both x and y variables are in the same direction, that is, if {ri < rj and si < sj } or if
{ri > rj and si > sj }.
• We assign a score of −1 to a pair (i, j) with i < j if their ranks w.r.t. to both x and y variables
are in the reverse direction, that is, if {ri < rj and si > sj } or if {ri > rj and si < sj }.
• Let P be the number of pairs (i.e., (i, j) withi < j) with +1 scores and Q denote the number
of pairs with −1 scores. Clearly, P + Q = n2 .
Kendall’s τ is defined as
P −Q no. of concordant pairs − no. of discordant pairs
τ= n = n .
2 2
Properties: The denominator is the total number of pair combinations, so the coefficient must be
in the range −1 ≤ τ ≤ 1.
21
• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the
coefficient has value +1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the
other) the coefficient has value −1.
• If there is no monotonic relationship between x and y variables, then we would expect the
coefficient to be approximately zero.
If tie occurs: The ties are broken as before and a score of 0 is assigned to any pair (i, j), with
i < j, if the above two conditions for scores of +1 and −1 with strict inequality are not met.
Remark: For each pair (i, j), define
+1 if ri < rj
aij = −1 if ri > rj
0
if ri = rj .
Similarly, define
+1
if si < sj
bij = −1 if si > sj
0
if si = sj .
(H.W.) We can show that the Pearson’s correlation coefficient r between the aij and bij variables,
i.e., using {(aij , bij ) : i, j = 1, . . . , n} gives the Kendall’s τ .
Remark: Association doesn’t imply causation, i.e., two variables are associated does not necessarily
mean that one variable causes the other.
Ex 1. “number of firemen and severity of fire is highly correlated”
Wrong conclusion: High correlation imply that the firemen cause the fire
Ex 2. “The faster windmills are observed to rotate, the more wind is observed to be.”
Wrong conclusion: wind is caused by the rotation of windmills. In practice, it is rather the other
way around.
Ex 3. “Children that watch a lot of TV are the most violent. So, TV makes children more violent.”
Not necessarily true. This could easily be the other way round; that is, violent children like watching
more TV than less violent ones.
Ex 4. Study shows that there is a significant correlation (0.791) between a country’s consumption of
chocolate and the number of Nobel prizes (averaged per person).
22
23