0% found this document useful (0 votes)
9 views23 pages

Descriptive Statistics

The document provides an overview of descriptive statistics, including definitions of key terms such as statistics, data, population, sample, parameter, and statistic. It explains the process of statistics, types of variables, and methods for organizing and visually representing data, including frequency distributions, bar graphs, pie charts, and histograms. Additionally, it includes examples and R code for creating visual representations of data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Descriptive Statistics

The document provides an overview of descriptive statistics, including definitions of key terms such as statistics, data, population, sample, parameter, and statistic. It explains the process of statistics, types of variables, and methods for organizing and visually representing data, including frequency distributions, bar graphs, pie charts, and histograms. Additionally, it includes examples and R code for creating visual representations of data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Descriptive Statistics

Some internet sources that may be useful for various aspects of statistics:
https://www.amstat.org/
http://www.stats.gla.ac.uk/steps/glossary/index.html

Some Terminologies
Statistics: Science of learning from data. It deals with collection, organization, analysis, and
interpretation of data.
Data: A list of observations for a variable (may be numerical or non numerical).

Ex. (a) Observe infant birth weights in a hospital: 7 lbs., 6 lbs 2 oz, 9 lbs etc.
(b) Observe gender variable among all students in this class: Male/ female.

The process of statistics:


1. Identify the research objective.
2. Collect data.
3. Organize, summarize, and analyze data.
4. Draw conclusion/inference.
Population: The entire group of individuals that is of our interest.
Sample: A subset of the population being studied.
Ex. To investigate: “what percentage of students at ISI is female?” randomly select 50 students
near ISI canteen.
Population: all ISI students.
Sample: those 50 students selected for the study.
Parameter: Numerical characteristic of a population or a model (typically unknown).
Ex. In the ex above, the true proportion of female students at ISI.
Statistic: Numerical summary of data (typically estimates an unknown parameter).
Ex. Proportion of female students selected in the sample (sample proportion).
Variable: Examples are gender, infant birth weight, amount of rainfall in a day, etc.
Define the following terms in the study/ poll described below:
To determine average parking time of patrons who park at the mall, a statistician monitors and
records parking times of 105 patrons on July 6, 2012.

1
a) Variable: Parking times of patrons who park at the mall
b) Population: All patrons who parked at the mall.
c) Sample: 105 patrons who parked at the mall on July 6, 2012

Describing and summarizing datasets


Any dataset is a list of observations for a variable (numerical or non-numerical). A variable can be
of two types:
1. Qualitative/Categorical variable: Mathematical operations are NOT meaningful for this type
of variable. A categorical variable can be of two types:
(a) Nominal variable: Ordering is NOT meaningful for this type of variable.
Examples: (i) color: “red”, “blue”, “green”, etc (ii) nationality: “American”, “Canadian”,
“Australian” etc, (iii) Gender: “male”, “female”, (iv) zip codes, (v) race, (vi) religion.
(b) Ordinal variable: Ordering is meaningful for this type.
Examples: (i) letter grades: “A”, “B”, “C”, “D”, and “F”, (ii) “completely agree”,
“mostly agree”, “mostly disagree”, “completely disagree” when measuring opinion, (iii)
educational level.
2. Quantitative/Numerical variable: Mathematical operations are meaningful for this type of
variable. A numerical variable can be of two types:
(a) Discrete variable: Number of objects (cars, books, pens, players, committee members,
coins, students, cellular phones, etc.)
(b) Continuous variable: Height, weight, distance, volume, temperature, cholesterol level,
age, etc.
Data can be summarized: Graphically and Numerically.

Organization of categorical data


Frequency of a category: Number of occurrences of that category in the data.
Relative Frequency of a category = Frequency of a category/sum of all frequencies.
Frequency distribution or table: Lists categories of data and the corresponding frequencies of each
category.
Relative frequency distribution or table: Lists categories of data and the corresponding relative
frequencies of each category.

2
Ex. Organize the following data by constructing a frequency distribution and relative frequency
distribution for the colors of M & Ms:
Br Y G O Bl R R Bl O G Y Br G Y Br O Bl R G O R Br Y G Br Y O R O R Y Br R Br Br Y Y
R Br R Br Br Y Y Br
Table 1: Frequency or relative frequency table

Category (color) Tally Frequency Relative frequency


Brown ||||| ||||| || 12 12/45
Yellow ||||| ||||| 10 10/45
Red ||||| |||| 9 9/45
Orange ||||| | 6 6/45
Blue ||| 3 3/45
Green ||||| 5 5/45

Note that total frequency = 45 and total relative frequency = 1.

Visual representation of categorical data


1. Bar graph: A graph constructed by labeling each category of data on the x-axis or the y-axis
and frequency or relative frequency on the other axis.
• Bars are rectangles separate from one another and of equal width.
• Bars represent categories and their heights represent frequencies or relative frequencies.
Ex. Using the same data for M & Ms construct a bar graph of the frequency distribution for M & M
colors:

R code for Bar graph of M&M data


freq = c (12 ,10 ,9 ,6 ,3 ,5)
Gems = m a trix ( f r e q , n c o l =1 , byrow=T)
rownames (Gems) = c ( " Brown " , " Yellow " , " Red " , " Orange " , " Blue " , " Green " ) colnames (Gems) = c ( " Frequency
Gems = a s . t a b l e (Gems) # r e a d i n g a s data t a b l e
Gems = a s . data . frame (Gems) # r e a d i n g a s data frame

b a r p l o t ( Gems$Freq , main=" bar graph o f M & M c o l o r s " , x l a b =" C o l o r s " ,


y l a b =" Frequency " , names . a r g = Gems$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

3
2. Pareto graph: A bar graph in which the bars are drawn in decreasing order of frequency or
relative frequency.

R code for Pareto graph of M&M data


Gems_sorted = Gems [ o r d e r (Gems [ , 3 ] , d e c r e a s i n g=TRUE) , ] # s o r t i n g data frame by f r e q u e n c y
b a r p l o t ( Gems_sorted$Freq , main=" p a r e t o graph o f M & M c o l o r s " , x l a b =" C o l o r s " ,
y l a b =" Frequency " , names . a r g = Gems_sorted$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

3. Pie chart: A pie chart is a circle divided into sectors. The sectors represent the categories, and
the area of a sector is proportional to the relative frequency of the category.
Ex. Following data represents the marital status of US residents (in millions) 18 years of age or
older in 2006. Calculate the relative frequencies in percentage.

Marital status Frequency Relative frequency


Never married 125 125/500
Married 290 290/500
Widowed 30 30/500
Divorced 55 55/500

Note that total frequency = 500.

R code for pie chart of marital status data

freq = c (125 ,290 ,30 ,55)


M a r i t a l S t a t u s = matrix ( f r e q , n c o l =1 , byrow=T)
rownames ( M a r i t a l S t a t u s ) = c ( " Never Married " , " Married " , " Widowed " , " D i v o r c e d " )
colnames ( M a r i t a l S t a t u s ) = c ( " Frequency " ) M a r i t a l S t a t u s = a s . t a b l e ( M a r i t a l S t a t u s )
M a r i t a l S t a t u s = a s . data . frame ( M a r i t a l S t a t u s )
r e l . f r e q = round ( f r e q /sum ( f r e q ) ∗ 1 0 0 , d i g i t s = 1 ) l b l s = p a s t e ( M a r i t a l S t a t u s $ V a r 1 ,
" ( " , r e l . f r e q , " % ) " , s e p = " " ) p i e ( f r e q , l a b e l s = l b l s , c o l=rainbow ( l e n g t h ( f r e q ) ) )

4
Organization and visual representation of discrete data
For a discrete variable, if the possible values are relatively few then consider each value as a
category.

Ex. The following discrete data represent the number of cars in a household based on a random
sample of 50 households. Construct a frequency and relative frequency distribution.
30121112024222122024113241212233212203222321221135
Table 2: Frequency table

Category (No. of Cars) Frequency


0 4
1 13
2 22
3 7
4 3
5 1

R codes for barplot of car data


d = c (3 ,0 ,1 ,2 ,1 ,1 ,1 ,2 ,0 ,2 ,4 ,2 ,2 ,2 ,1 ,2 ,2 ,0 ,2 ,4 ,1 ,1 ,3 ,2 ,4 ,1 ,2 ,1 ,2 ,2 ,3 ,3 ,2 ,1 ,
2 ,2 ,0 ,3 ,2 ,2 ,2 ,3 ,2 ,1 ,2 ,2 ,1 ,1 ,3 ,5)
f r e q = c ( sum ( d==0),sum ( d==1), sum ( d==2), sum ( d==3), sum ( d==4), sum ( d==5))
c a r s = m a trix ( f r e q , n c o l =1 , byrow=T)
rownames ( c a r s ) = c ( " no c a r " , " 1 c a r " , " 2 c a r s " , " 3 c a r s " , " 4 c a r s " , " 5 c a r s " )
colnames ( c a r s ) = c ( " Frequency " )

5
c a r s = as . t a b l e ( c a r s )
c a r s = a s . data . frame ( c a r s ) # r e a d i n g a s data frame

b a r p l o t ( c a r s $ F r e q , main=" bar graph o f number o f c a r s " , x l a b ="number o f c a r s " ,


y l a b ="number o f h o u s e h o l d s " , names . a r g = cars$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

Visual representation of numerical data


1. Histogram: A graph representing the frequency distribution of a numerical data. A histogram
consists of rectangles of equal width that touch each another. In a histogram, classes are marked on
the horizontal axis and frequencies are represented by heights on the vertical axis.
Ex. (discrete data) Using the car data given above, construct a histogram.

For continuous data or for a discrete data with relatively large number of possible values, first
construct a grouped frequency distribution to draw a histogram.

Steps to construct a grouped frequency distribution and draw a histogram


1. Find the range of the data, where range = (highest data value - lowest data value)
2. Select number of classes (k) and create classes/bins/intervals of equal width.
3. The common class length, denoted as d, is called class width/ bin width (d). Choose class
width and number of classes such that their product is slightly larger than the range.
4. Pick a starting point (lower limit) smaller than the lowest data value and add the class width
to get the lower limit of the next class. Ensure the upper limit of the first and the lower limit
of the second class do not overlap (classes are disjoint) and classes cover the entire range of
data.

Ex. The following data represent integer scores for a statistics exam. Construct a histogram for the
given data.
60 47 82 95 88 72 67 66 68 98 90 77 86 58 64 95 74 72 88 74 77 39 90 63 68 97 70 64 70 70 58 78
89 44 55 85 82 83 72 77 72 86 50 94 92 80 91 75

6
Steps:
1. Range of the data = highest data value - lowest data value = 59.
2. Select number of classes and class width (for the first table), number of classes = 7 and class
width = 10. (7*10 = 70 works b/c 70 is larger than range)
3. Pick a starting point smaller than the lowest data value 39 (30 which is 1st lower limit).
4. Add the class width 10 to 30 to get the lower limit 40 of the 2nd class.
Important: For a grouped frequency distribution, classes must be disjoint and must cover the entire
range.

R code for drawing a histogram of exam score data


score = c (60 ,47 ,82 ,95 ,88 ,72 ,67 ,66 ,68 ,98 ,90 ,77 ,86 ,58 ,64 ,95 ,74 ,72 , 88 ,74 ,77 ,39 ,90 ,
63 ,68 ,97 ,70 ,64 ,70 ,70 ,58 ,78 ,89 ,44 ,55 ,85 ,82 ,83 ,72 ,77 ,72 ,86 ,50 ,94 ,92 ,80 ,91 ,75 ,76 ,78)

h i s t ( s c o r e , n c l a s s =7 , main=" Histogram o f s c o r e s with c l a s s width =10" , x l a b =" s c o r e s " ,


y l a b =" f r e q u e n c y " , c o l= " b l u e " )

• Lets draw a histogram of the same data with 13 classes and class width 5.

7
Remark: Shape of a histogram (distribution) for a dataset changes as the class width (d) or the
number of classes k changes. If n = sample size, then, one way to find k is to use the Sturges
formula: k = dlog2 ne + 1 (from binomial assuming normality). For the exam score data n = 50,
and hence, k = 7 using Sturges formula.

2. Stem and Leaf plot: This is a sorting or graphing technique sometimes used in computer
applications when the datasets are small.
In this plot, each data value is split into a “stem” and a “leaf” where the “leaf” is usually the
last/rightmost digit of the number and the other digits to the left of the “leaf” form the “stem”.
Sorting the data first helps in drawing this plot.
Ex. Construct a stem and leaf plot of the following data (ages of people).
12 20 23 32 35 38 38 39 41 43 43 50 51 52 53 53 55 58 59 59 85

Stem Leaf
1 2
2 0 3
3 2 5 8 8 9
4 1 3 3
5 0 1 2 3 3 5 8 9 9
6
7
8 5

Remark: The shape of the stem and leaf plot, if rotated 90 degree anticlockwise, resembles the
shape of a histogram. Usually there is no need to sort the leaves, although computer packages
typically do.

R code for stem and leaf plot of age data


age = c ( 1 2 , 2 0 , 2 3 , 3 2 , 3 5 , 3 8 , 3 8 , 3 9 , 4 1 , 4 3 , 4 3 , 5 0 , 5 1 , 5 2 , 5 3 , 5 3 , 5 5 , 5 8 , 5 9 , 5 9 , 8 5 )
stem ( age , s c a l e =2)

8
Measures of Central Tendency
Measures of central tendency are numerical values that locate, in some sense, the center of the data.
1. Sample Mean (x): Let x1 , x2 , . . . , xn be a set of sample values from a population. The sample
mean is the arithmetic mean of the values:
n
1X
x= xi .
n i=1

1/n
Similarly, one may consider the geometric mean xg = ( i=1 xi ) for nonnegative values x1 , . . . , xn
Qn
 P −1
or the harmonic mean xh = n 1 n 1
i=1 xi as an alternative measure of central tendency. However,
the arithmetic mean is typically referred as the sample mean.
Note: E[X] is referred as the population mean corresponding to a random variable X and sample
mean x gives an approximate value of E[X] when x1 , . . . , xn are realizations of X.
2. Sample Median (x e): The value that lies in the middle of a dataset after arranging a data set
in ascending or descending order.
 
For odd sample size n: median (x
e) = n+12 -th value in the sorted data.
For even sample size n: median (x
e) = average of n2 -th and n2 + 1 -th value in the sorted data.
 

Remark 1: The population median, m, corresponding to a random variable X is such that P (X ≤


m) ≥ 0.5 and P (X ≥ m) ≥ 0.5. If the CDF of X, denoted by F is continuous, then m is the
solution of F (m) = 0.5. If x1 , . . . , xn are realizations of X, then x
e is an approximate value of m.

Remark 2: R codes for computing sample mean and median is mean(x) and median(x) respectively.
b): The value(s) of the data that occurs most frequently in the dataset.
3. Sample Mode (x

(a) (b) (c)

9
Ex. Find the mode(s) of the following data sets.
(a) {0, 1, 2, 3, 3, −3, 3, 6}, mode = 3.
(b) {−1, 1, 1.5, 1.5, 3, 4, 4.5, 4.5, 6}, mode = 1.5 and 4.5.
(c) {−1, 1, 2, 3, 4}, mode = no mode or each data point is a mode.

Remark:
• Relation between mean, median and skewness of a distribution:

Other measures of location: percentiles, quartiles and quantiles


Sample Percentiles: After arranging a dataset in ascending order, percentiles divide the data
into 100 equal parts.
For k = 1, 2, . . . , 100, k-th percentile Pk of a dataset is a value such that at least k% of the
observations are less than or equal to Pk and at least (100 − k)% of the observations are greater
than or equal to Pk .

Sample Quartiles: After arranging a dataset in ascending order, 3 quartiles: first quartile Q1 ,
second quartile Q2 , and third quartile Q3 divide the data into 4 equal parts. Clearly, Q2 = median.

Thus: Q1 = P25 , Q2 = P50 , and Q3 = P75 . There are alternative ways to find sample quartiles. For
example, first order a dataset to find the median that divides the dataset into 2 halves. Then Q1 is
the median of left half (with smaller values) and Q3 , is the median of right half (with larger values).
Quantiles: Percentiles and quartiles are special cases of quantiles. Population quantiles are points
partitioning the range of a probability distribution into disjoint intervals with equal probabilities,
and sample quantiles are points partitioning the observations in a sample in the same way.

10
For 0 < α < 1, the α-th population quantile qα of a random variable X is a point satisfying:

P (X ≤ qα ) ≥ α and P (X ≥ qα ) ≥ 1 − α.

If X is a continuous random variable, then qα is a solution of F (qα ) = P (X ≤ qα ) = α.

Probability density of a normal distribution, with population quartiles shown. The area below the
red curve is the same in the intervals (−∞, Q1 ), (Q1 , Q2 ), (Q2 , Q3 ), and (Q3 , ∞).

A α-th sample quantile qbα is a point such that at least α × 100% of the sorted data (arranged in
ascending order) is less or equal to qbα and at least (1 − α) × 100% of the sorted data is greater or
equal to qbα .
Remark: From the definition, note that quantiles may not be unique.

An algorithm to compute α-th sample quantile:


1. Arrange the data in increasing order and compute r = nα
2. If r is not an integer: Round it up to the nearest integer greater than r, i.e., dre = ceiling of
r. The rounded integer dre is the position (index) of a sample quantile qbα in the ordered list
of data points.
3. If r is an integer: The sample quantile qbα is the average of data values in positions r and
r + 1.
Ex. Find (a) 75th percentile, (b) 2nd quartile, and (c) 0.4-th quantile of rainfall (inches) data in
Boston during the month of April for 15 years:
9.6, 2.5, 3.9, 4.1, 5.9, 1.1, 2.7, 4, 4.7, 6.1, 1.8, 3.4, 4, 5.2, 6.2

Ordered data: 1.1, 1.8, 2.5, 2.7, 3.4, 3.9, 4.0, 4.0, 4.1, 4.7, 5.2, 5.9, 6.1, 6.2, 9.6.
(a) r = 15 ∗ 0.75 = 11.25. Thus, dre = 12. Therefore, 75th percentile = 5.9.
(b) r = 15 ∗ 0.5 = 7.5. Thus, dre = 8. Therefore, 2nd quartile = 4.
(c) r = 15 ∗ 0.4 = 6. Thus, 0.4-th quantile is the average of 3.9 and 4, which is 3.95.

R code for finding sample quantiles of rainfall data


rain = c (9.6 ,2.5 ,3.9 ,4.1 ,5.9 ,1.1 ,2.7 ,4 ,4.7 ,6.1 ,1.8 ,3.4 ,4 ,5.2 ,6.2)
q u a n t i l e ( r a i n , p r o b s=c ( 0 . 7 5 , 0 . 5 , 0 . 4 ) )

11
Note that R output is: qb0.75 = 5.55, qb0.5 = 4, and qb0.4 = 3.96 which are different from the values
obtained by the above algorithm. There are, in fact, 9 different algorithms available in R to compute
sample quantiles. The default is type=7.
Trimmed or Truncated Mean calculates the sample mean after discarding given parts of a sample
at the high and low end (typically discarding an equal amount of both). For some 0 < α < 0.5,
α × 100% trimmed mean is the mean computed by excluding α × 100% largest and α × 100%
smallest values from the sample and taking mean of remaining (1 − 2α) × 100% of sample values.
For example, 10% trimmed mean of the dataset

x = {2, 4, −1, −2, 0, 9, 20, 7, 1, 5}

is x0.1 = 3.375 (R code: mean(x, trim=0.1)) which is obtained by discarding −2 and 20 from the
sample and computing mean of remaining 8 observations. Note that sample mean and median are
4.5 and 3 respectively. Trimmed mean is more robust against outliers than usual mean and the
median can be regarded as a fully trimmed mean that is most robust. Alternatively, trimmed mean
can also be defined by omitting a fixed number of sample values. That is, if x1 , . . . , xn are ordered
(smallest to largest) sample values, g-times trimmed mean is (xg+1 + · · · + xn−g )/(n − 2g) for some
positive integer g.

Measures of dispersion (range, variance, standard deviation, IQR)


Measures of dispersion are numerical values that describe the spread or variability in the data.
Suppose we observe n data points x1 , x2 , . . . , xn .
1. Range (R) = max{x1 , . . . , xn } − min{x1 , . . . , xn } = the difference between the largest and the
smallest data value. R is not very informative and robust.
2. Sample variance (S 2 ): measures the average squared distance of all observations from their mean
x. That is, S 2 = i=1 (xi − x)2 . [Why dividing by (n − 1) instead of n?]
1 Pn
n−1

Sample standard deviation S is simply S 2 which has the same unit of measurement as xi s. The
greater the values of S 2 or S, the greater the spread of the data and vice versa.
Note: R code for sample variance is var(x), and for sample standard deviation is sd(x) where
x = c(x1 , . . . , xn ).
3. Mean Absolute Deviation: measures average absolute distance of the observations from a center
of the data, say mx . That is,
n
1X
M= |xi − mx | where mx = mean (x), median (x
e) or mode (x
b).
n i=1

When mx is either mean or median, M is sometimes referred as MAD.


4. Interquartile Range (IQR) = Q3 − Q1 where Q3 and Q1 are 3rd and 1st sample quartiles.
IQR represents the spread of the data between P25 and P75 , i.e., the middle 50% of the data. High
value of IQR indicates the data is more spread out and vice versa. Note that (Q3 − Q1 )/2 is
known as Quartile Deviation or Semi Interquartile Range.

12
• Based on IQR, outliers of a dataset may be defined. Let us define:

Lower fence = Q1 − 1.5 ∗ IQR


(
Fences :
Upper fence = Q3 + 1.5 ∗ IQR

• Outliers: Data points outside the lower or the upper fences can be defined as outliers of the
dataset.
• If outliers don’t affect a statistic substantially, it is considered resistant or robust.
• Among the measure of central tendencies discussed, which statistic is least affected by outliers
and which statistic is most affected by outliers?
• Among the measure of dispersion discussed, which statistic is robust and which statistic is not
robust?

Five number summary and boxplot


Five number summary: Min value, Q1 , median (Q2 ), Q3 , and max value are called the five
number summary.
Boxplot: Graphical representation of the five number summary, the upper and lower fences, and
outliers (if they exist).
Ex. Following is the data for interest rates charged by ten credit card companies.
6.5%, 12%, 14.4%, 14.4%, 14.3%, 13%, 13.3%, 13.9%, 9.9%, 14.5%

Sorted data: 6.5, 9.9, 12.0, 13.0, 13.3, 13.9, 14.3, 14.4, 14.4, 14.5
Steps:
1. Interquartile range (IQR) = Q3 − Q1 = 14.4 − 12 = 2.4.
2. Lower fence = Q1 − 1.5 ∗ IQR = 12 − 3.6 = 8.4.
3. Upper fence = Q3 + 1.5 ∗ IQR = 14.4 + 3.6 = 18.

13
4. Smallest value larger than the lower fence = 9.9
5. Largest value smaller than the upper fence = 14.5
6. Data value less than lower fence: 6.5 Data value greater than upper fence: NONE.
• Using Box plots to describe shapes of distributions:

Q. Based on the boxplot of interest rate data, what type of skewness does the distribution have?
R code for boxplot of interest rate data
interest = c (6.5 ,12 ,14.4 ,14.4 ,14.3 ,13 ,13.3 ,13.9 ,9.9 ,14.5)
b o x p l o t ( i n t e r e s t , main=" b o x p l o t o f i n t e r e s t r a t e data " ,
h o r i z o n t a l=TRUE)
a x i s ( 1 , a t=c ( 6 : 1 5 ) , t c k = −.025 , l a s =0)

Moments, measures of skewness and kurtosis


For a random variable X, r-th population raw moment is µ0r = E[X r ] for r = 1, 2, . . . .
If µ = E[X], then r-th population central moment is µr = E[(X − µ)r ] for r = 1, 2, . . . .

14
Let x1 , . . . , xn be n realizations (i.e., sample) from X. The r-th sample raw moment and r-th sample
central moments are
n n
1X 1X
m0r = xr and mr = (xi − x)r for r = 1, 2, . . . .
n i=1 i n i=1

Note: Sample mean x = m01 . Sample variance S 2 = n−1 1


[m02 − n(m01 )2 ]. For central moments,
m1 = 0, m2 = n−1
n S . Clearly, different moments give some measure about the distribution of data.
2

Measure of Skewness:
Recall: if random variable X is symmetric around mean µ = E[X], then E[(X − µ)r ] = 0 for
r = 1, 3, 5, . . . . (H.W.)
Similarly, for a symmetric distribution, the observations about the mean x, the positive terms
will cancel with the negative terms resulting in m3 = 0. Therefore, small value of m3 indicates a
somewhat symmetric distribution. Also, if m3 is positive, then the positive terms in m3 dominate
the negative terms, that is, there are more large observations to the right of x stretching further
than those to the left of x. We call this kind of data distribution as positively skewed, or skewed
to the right. Similarly, when m3 is negative, the data distribution is called negatively skewed, or
skewed to the left. Formally, a measure of skewness (or asymmetry) is defined as
1 n
(xi − x)3
P
m3
γ1 = √ 3 = qn i=1 3
m2 1 Pn
(x − x)2
n i=1 i

is known as (Fisher-Pearson’s) coefficient of skewness.


Note that the population coefficient of skewness is E[(X − µ)3 ]/σ 3 where σ 2 = V ar(X). Therefore,
this measure is scale or unit invariant, in the sense that if the observations are measured in some
other unit, the value of the skewness coefficient will not change.
Kurtosis:
The population kurtosis of a random variable X is the fourth standardized central moment:
( 4 )
X −µ
Kurt[X] = E[(X − µ) ]/σ = E
4 4
.
σ

Note that, for X ∼ N (µ, σ 2 ), Kurt[X] = 3. So, in order to compare kurtosis of a r.v. X
with that of the N (µ, σ 2 ), an alternative measure of kurtosis, called excess kurtosis is defined as
γ2 = E[(X − µ)4 ]/σ 4 − 3. Some authors refer excess kurtosis as kurtosis.
See https://en.wikipedia.org/wiki/Kurtosis#Excess_kurtosis for details about the distributions.
Sometimes, Kurt[X] is said to measure “peakedness” of the central value (say, mean) of a probability
distribution, as seen from the above picture. However, this interpretation is disputed. Recently,
Westfall (Ref: Westfall, Peter H. (2014), "Kurtosis as Peakedness, 1905 - 2014. R.I.P.", The
American Statistician, 68 (3): 191–195,), Noted as "...its unambiguous interpretation relates to tail
extremity. Specifically, it reflects either the presence of existing outliers (for sample kurtosis) or the
tendency to produce outliers (for the kurtosis of a probability distribution). The underlying logic
is straightforward: Kurtosis represents the average (or expected value) of standardized data raised
to the fourth power. Standardized values less than 1—corresponding to data within one standard
deviation of the mean (where the “peak” occurs)—contribute minimally to kurtosis. This is because

15
Probability density functions for selected distributions with mean 0, variance 1 and different excess
kurtosis. Here, D : Laplace, S : hyperbolic secant, L : Logistic, N : normal, C : raised cosine, W :
Wigner semicircle, U : uniform distributions.

raising a number less than 1 to the fourth power brings it closer to zero. The meaningful contributors
to kurtosis are data values outside the peak region, i.e., the outliers. Therefore, kurtosis primarily
measures outliers and provides no information about the central "peak".
The sample kurtosis for data x1 , x2 , . . . , xn is naturally given by:
m4
sample excess kurtosis = − 3.
m22

Q. Suppose data values x1 , x2 , . . . , xn are linearly transformed as y1 , y2 , . . . , yn where yi = a + bxi


for i = 1, . . . , n, and a, b ∈ R. A statistic T (x1 , . . . , xn ) is called invariant under linear transform if
T (x1 , . . . , xn ) = T (y1 , . . . , yn ). Can you comment on the invariance property of statistics studied so
far, i.e., sample mean, median, variance, IQR, moments, coefficient of skewness, and kurtosis?

Normal Datasets
Sometimes real datasets have bell-shaped histograms. A “bell-shaped” histogram, in some cases,
may be referred as the ideal bell-shaped curve, that is, the normal curve that defines a normal
distribution. (studied in your probability class). That is why the datasets that have bell-shaped

histograms are often called the normal datasets. For the normal datasets we have the empirical rule
that follows from the properties of the normal distribution. Ex. SAT math scores have a bell-shaped
distribution with mean 515 and standard deviation 114.
(a) What percentage of SAT scores is less than 401 or greater than 629?

16
Empirical rule for a bell-shaped distribution

(b) What percentage of SAT scores is greater than 743


(c) What percentage of SAT scores fall between 287 and 515?

Paired datasets and the sample correlation coefficient


Two numerical variables are often studied together to investigate any possible relationship between
the two variables. Examples: (a) shoe size and weight; (b) height and weight of a student; (c) the
amount of time spent studying for statistics exam and the score on the exam.
Such datasets consist of pairs of observations {(xi , yi ) : i = 1, 2, . . . , n}. Suppose, for the time, that
x and y variables are numerical.
Ex. Ice cream sales data: The local ice cream shop keeps track of how much ice cream they sell
in a day and the noon temperature on that day. Observations for the last 12 days are recorded in
the table. In this example, x1 , . . . , x12 are 12 temperatures in Celsius and y1 , . . . , yn are 12 sales in
dollar.

Scatter diagram or plot: A two-dimensional graph where x-axis and y-axis represent data values
from x-variable and y-variable respectively.
Points on the scatter plot may create some patterns: (a) Linear: Points may follow an imaginary
line. (b) Nonlinear: Points may follow a curve.

17
Notice that warmer weather leads to more sales; there seem to be a linear relationship between
temperature and sale variable.
R code to draw scatter plot of ice cream data
data = r e a d . t a b l e ( " i c e c r e a m . t x t " )
p l o t ( data [ , 1 ] , data [ , 2 ] , x l a b =" t e mperature " , y l a b =" s a l e s " , type = " p " , c o l = " r e d " , lwd=2)
a b l i n e ( lm ( data [ , 2 ] ~ data [ , 1 ] ) )
Note : you need t o keep t h e data f i l e `` i c e c r e a m . t x t ' ' i n t h e c u r r e n t R d i r e c t o r y .

• Qs Is there a numerical measure that determines linear relationship between variables?


Sample correlation coefficient (r) measures the strength and direction of linear relationship
between two numerical variables.
Let x = and y = be the sample means of x and y variables and let corresponding
1 Pn 1 Pn
n i=1 xi n i=1 yi q q
sample standard deviations are Sx = n−1 i=1 (xi − x) and Sy =
n
i=1 (yi − y)2 . Let
1 2
P 1 Pn
n−1
sample covariance between x and y variables be
n
1 X
Sxy = (xi − x)(yi − y)
n − 1 i=1

The sample correlation coefficient, also called Pearson’s correlation coefficient, is


n
(xi − x)(yi − y)
P
Sxy
r= = pPn i=1 .
i=1 (xi − x) i=1 (yi − y)
pPn
Sx Sy 2 2

By Cauchy-Swartz inequality,
v v
n u n u n
(xi − x)(yi − y)| ≤ (xi − x) t (yi − y)2
X uX uX
| t 2
i=1 i=1 i=1

where equality holds iff yi = a + bxi for i = 1, . . . , n and a, b ∈ R.

Properties of r:
1. r is a unit free measure such that −1 ≤ r ≤ 1.
2. r only measures strength of linear relationship between x and y variables.
3. r > 0: positive linear relation, r < 0: negative linear relation. If r = ±1, two variables have
perfect linear relationship.

18
4. r = 0 implies no linear relationship between two variables. But, there may be some “nonlinear”
relationship between x and y variables.

• If we interchange x and y variables, does the correlation change?


• Is correlation coefficient r affected by linear transformation of data?
Population version of Pearson’s correlation coefficient between random variables X and Y is

cov(x, Y ) E[(X − µX )(Y − µY )]


ρ= =
σX σY σX σY

where µX = E[X], µY = E[Y ], σX = V ar(X), σY = V ar(Y ).


p p

19
Ex. What is wrong with these statements?
The correlation between height and weight of Computer Science students
(a) is 2.61
(b) is 0.61 inches per pound
(c) is 0.61, so the correlation between weight and height is −0.61
(d) is 0.61 using inches and pounds, but converting inches to centimeters would make r > 0.61
(since an inch equals about 2.54 centimeters)

Rank Correlations
There are correlation measures other than the Pearson’s correlation coefficient which can detect
non-linear relationships. These measures are based on ranks thereby depending on the relative
positions of the x and y components of the n pairs instead of their actual values. This is why
the name rank correlation. For this reason, one could use such measures for ordinal variables
as well, but the occurrence of ties will be too often creating some complications. Actually, the
following two rank correlation coefficients measures the degree of a monotone relationship between
two categorical/numerical variables.
General Properties:
A higher rank correlation coefficient implies more agreement between rankings. The coefficient is
inside the interval [-1, 1] and assumes the value:
• 1 if the agreement between the two rankings is perfect; the two rankings are the same.
• 0 if the rankings are completely independent.
• −1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the
other.

Spearman’s Rank Correlation:


It is a measure of statistical dependence between the rankings of two variables. Recall that we have
observations of the form {(xi , yi ), i = 1, . . . , n}.
• Let ri be the rank of the i-th observation (i.e, xi ) w.r.t. the x-observations in ascending order.
Similarly, let si be the rank of the i-th observation (i.e, yi ) w.r.t. the y-observations.
• For the time, suppose there is no tie in these ranks.
• If there is perfect positive correlation, then si = ri for all i = 1, . . . , n.
• If there is perfect negative correlation, then we should have si = n − ri + 1 for all i = 1, . . . , n.
So, writing di = ri − si , note that
n n n
n(n + 1) n(n + 1)
di = si = = 0.
X X X
ri − −
i=1 i=1 i=1
2 2

20
So, this sum is of no use. Let us consider
Pn
i=1 di
n n
d2i = (ri − si )2 ,
X X

i=1 i=1

which is equal to 0 if ri = si for all i. For the other extreme when si = n − ri + 1 for all i,
n n
= (2ri − n − 1)2 = n(n2 − 1)/3. (verify)
X X
d2i
i=1 i=1

Spearman’s rank correlation coefficient rSP is defined as

6 ni=1 d2i
P
rSP =1− .
n(n2 − 1)

• Thus: −1 ≤ rSP ≤ 1 where rSP = 1 if there is a perfect positive association (i.e, if x increases,
then y increases and if x decreases, then y decreases) and rSP = −1 if there is a perfect negative
association (i.e, if x increases, then y decreases and if x decreases, then y increases).
• While Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic
relationships. If there are no repeated data values, a perfect Spearman correlation of +1 or -1 occurs
when each of the variables is a perfect monotone function of the other.
Exercise: Check that rSP is same as the Pearson’s correlation coefficient r based on the rank
variables {(ri , si ) : i = 1, . . . , n} if all ranks are distinct integers (i.e., no tie occurs).
If tie occurs: If there is any tie, that is broken in the usual manner by assigning the same rank to
the units with tied observations. For example, if the 5-th and 6-th observations in the sorted data
are tied, we assign the rank 5.5 to both of them. Also, if the 5-th, 6-th and the 7-th observations in
the sorted data are tied, we assign the rank 6 to all three of them.

Kendall’s Tau τ :
Note that the observation in each unit, say the i-th unit, is a pair (xi , yi ) with corresponding ranks
(ri , si ). Again, assume no tie for the time.
• For each pair of units, say (i, j) with i < j, we assign a score of +1 if their ranks w.r.t.
to both x and y variables are in the same direction, that is, if {ri < rj and si < sj } or if
{ri > rj and si > sj }.
• We assign a score of −1 to a pair (i, j) with i < j if their ranks w.r.t. to both x and y variables
are in the reverse direction, that is, if {ri < rj and si > sj } or if {ri > rj and si < sj }.
• Let P be the number of pairs (i.e., (i, j) withi < j) with +1 scores and Q denote the number
of pairs with −1 scores. Clearly, P + Q = n2 .
Kendall’s τ is defined as
P −Q no. of concordant pairs − no. of discordant pairs
τ= n = n .
2 2

Properties: The denominator is the total number of pair combinations, so the coefficient must be
in the range −1 ≤ τ ≤ 1.

21
• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the
coefficient has value +1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the
other) the coefficient has value −1.
• If there is no monotonic relationship between x and y variables, then we would expect the
coefficient to be approximately zero.
If tie occurs: The ties are broken as before and a score of 0 is assigned to any pair (i, j), with
i < j, if the above two conditions for scores of +1 and −1 with strict inequality are not met.
Remark: For each pair (i, j), define

+1 if ri < rj


aij = −1 if ri > rj

0

if ri = rj .

Similarly, define 
+1

 if si < sj
bij = −1 if si > sj

0

if si = sj .
(H.W.) We can show that the Pearson’s correlation coefficient r between the aij and bij variables,
i.e., using {(aij , bij ) : i, j = 1, . . . , n} gives the Kendall’s τ .

R code for Pearson’s and rank correlations


x = rnorm ( 1 0 0 0 , 0 , 1 )
y = exp ( s q r t ( exp ( x ) ) )
c o r ( x , y , method = " p e a r s o n " )
c o r ( x , y , method = " spearman " )
c o r ( x , y , method = " k e n d a l l " )

Remark: Association doesn’t imply causation, i.e., two variables are associated does not necessarily
mean that one variable causes the other.
Ex 1. “number of firemen and severity of fire is highly correlated”
Wrong conclusion: High correlation imply that the firemen cause the fire
Ex 2. “The faster windmills are observed to rotate, the more wind is observed to be.”
Wrong conclusion: wind is caused by the rotation of windmills. In practice, it is rather the other
way around.
Ex 3. “Children that watch a lot of TV are the most violent. So, TV makes children more violent.”
Not necessarily true. This could easily be the other way round; that is, violent children like watching
more TV than less violent ones.
Ex 4. Study shows that there is a significant correlation (0.791) between a country’s consumption of
chocolate and the number of Nobel prizes (averaged per person).

22
23

You might also like