7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
1
I would appreciate if you point out any typos you spot out to me: (javier.rubio [email protected]).
Disclaimer: These notes should not be distributed or used for commercial purposes.
Week 2: Exploratory Data Analysis
Definition 1. Dispersion: The amount by which a set of observations deviate from their mean (or
another measure of location). When the values of a set of observations are close to their mean, the
dispersion is less than when they are spread out widely from the mean (or another measure of location).
Definition 2. Range: The difference between the largest and smallest observations in a data set. Often
used as an easy-to-calculate measure of the dispersion in a set of observations but not recommended for
this task because of its sensitivity to outliers and the fact that its value increases with sample size.
Formally, this is defined as:
R = xmax − xmin = x(n) − x(1) ,
where x(1) , . . . , x(n) are sorted observations.
R = xuk − xl1 ,
where xuk is the upper class limit of the last class and xl1 the lower class limit of the first class.
Definition 3. Interquartile range: A measure of spread given by the difference between the first and
third quartiles of a sample.
Definition 4. Variance: In a population, the second moment about the mean. An unbiased estimator of
the population value is provided by s2 given by
n n
2 1 X 2 1 X n
s = (xi − x̄) = x2i − x̄2
n − 1 i=1 n − 1 i=1 n−1
k
1 X
s2 = (xj − x̄)2 h(xj )
n − 1 j=1
1
• Linear transformation of the data:
yi = a + b · xi (b 6= 0)
a =
b shift of the data
0<b<1 = compression of the data
b>1 = dilation of the data
b<0 = mirroring at the origin with dilation
or compression
• Normalization
2. of the median
y0.5 = a + bx0.5 ,
3. of the variance
s2y = b2 s2x sy = |b| sx ,
4. of the range
RY = |b|RX ,
Now, let us discuss one of the most popular visual tools: box plots.
Boxplots: Ingredients
• Calculate the median, 25th percentile (Q1), and 75th percentile (Q3).
• Calculate the upper and lower limits of the box as Q3 + 1.5 × IQR and Q1 − 1.5 × IQR.
• Boxplots are used to describe and communicate the characteristics of a data set.
• You can present more than one boxplot associated to different data sets. This is often used to
compare populations.
• Example.
Now, let’s calculate all of these quantities and tools in two simulated data sets.
2
Figure 2.1.1: Boxplot.
# Interquartile range
IQR(x)
IQR(y)
# Variance
var(x)
var(y)
# Boxplot
boxplot(x,y, names = c("Population 1","Population 2"), cex.lab=1.5,
cex.axis = 1.5, col = c(’red’,’blue’))
Further Reading 1. If you are interested on learning about other visual tools, check the following R
Markdown:
3
2.2 Lecture 2: Exploratory Data Analysis VI
Definition 5. Joint distribution: Essentially synonymous with multivariate distribution, although used
particularly as an alternative to bivariate distribution when two variables are involved.
Thus, a bivariate distribution is a particular case, where only two variables are considered.
Definition 6. Bivariate distribution: The joint distribution of two random variables, X and Y .
Pairs of discrete variables are often summarised in Frequency tables or frequency distribution table.
Definition 7. In statistics, a frequency distribution is a list, table or graph that displays the frequency of
various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences
of values within a particular group or interval.
The total row and total column report the marginal frequencies or marginal distribution, while the
body of the table reports the joint frequencies.
The following example illustrates the concepts of frequency table and marginal distribution.
# Define the table of values
mytable <- rbind(c(240,120,70),
c(160,90,90),
c(30,30,30),
c(37,7,6),
c(40,32,18))
# Marginal distributions:
# total
margin.table(mytable)
# X
margin.table(mytable,1)
# Y
margin.table(mytable,2)
Continuous variables are often summarised in a table, however, if a big amount of data is present the
interpretation of the table might be difficult. In such case, it is better to produce a scatter plot or scatter
diagram.
4
Definition 8. Scatter diagram: A two-dimensional plot of a sample of bivariate observations. The
diagram is an important aid in assessing what type of relationship links the two variables
The following R code illustrates how to produce scatter plots of two or more variables, as well as
showing only part of a table.
# Attache the "iris" data set
attach(iris)
help(iris)
# An alternative method
pairs(iris[,1:4], cex.axis = 1.5, cex.lab = 1.5, pch = 19)
When the data set contains a mixture of continuous and discrete variables, summaries of the continuous
variable are typically reported for each of the values of the corresponding discrete variable. An example
of this is presented in the following R code, where you need to install an additional package containing
the data set.
# We need an additional package that contains the data set
#install.packages("fda")
library(fda)
5
2.3 Lecture 3: Exploratory Data Analysis VII
The marginal distribution basically refers to focusing the analysis on a specific variable, even if we have
two or more variables (i.e. without accounting for the remaining variables).
Definition 9. Marginal distribution: The probability distribution of a single variable, or combinations
of variables, in a multivariate distribution. Obtained from the multivariate distribution by integrating (or
adding) over the other variables.
The following R code illustrates the use of barplots for summarising conditional distributions. Try to
reflect what questions can you answer in each scenario.
# conditional distribution of an HIV test given HIV infection
# These results resemble scenario when the test is applied to a risk group
tab1 <- rbind(c(0.995,0.005),c(0.005,0.995))
colnames(tab1) <- c("present","not-present")
rownames(tab1) <- c("positive","negative")
print(tab1)
# print transposed table
print(t(tab1))
# marginal values
margin.table(tab1,2)
# Barplot
barplot(t(tab1), beside = TRUE, col = c("red","blue"),
main = "HIV test vs HIV infection", ylim = c(0,1),
ylab = "Conditional relative frequency",
cex.axis = 1.5, cex.lab = 1.5)
box()
legend("center",
c("present","not-present"),
fill = c("red","blue"))
# marginal values
margin.table(tab2,1)
# Barplot
barplot(tab2, beside = TRUE, col = c("red","green"),
main = "HIV infection vs HIV test", ylim = c(0,1),
ylab = "Conditional relative frequency",
cex.axis = 1.5, cex.lab = 1.5)
box()
legend("center",
6
c("positive","negative"),
fill = c("red","green"))
7
2.4 Lecture 4: Exploratory Data Analysis VIII
Thus, we will say that two variables or events are dependent if they are not independent. This is a
slightly vague definition and, as such, there is no unique way of identifying or summarising the depen-
dence of two variables. For this reason, there exist a number of summary statistics for illustrating or
identifying dependence between two variables.
One of these quantities is the Covariance. The covariance (or sample covariance) is a statistic calcu-
lated from the joint dispersion of two numerical variables.
n
1 X
sxy = (xi − x̄)(yi − ȳ)
n − 1 i=1
Definition 12. Correlation coefficient: An index that quantifies the linear relationship between a pair of
variables. An estimator of the correlation coefficient obtained from n sample values of the two variables
of interest, (x1 , y1 ), . . . (xn , yn ) is Pearson’s product moment correlation coefficient, r, given by
n
1
(xi − x̄)(yi − ȳ)
P
n−1 sxy
i=1
rxy = =
sx · sy sx · sy
n
(xi − x̄)(yi − ȳ)
P
= s i=1
n n
(xi − x̄)2 (yi − ȳ)2
P P
i=1 i=1
The coefficient takes values between −1 and 1, with the sign indicating the direction of the relationship
and the numerical magnitude its strength. Values of −1 or 1 indicate that the sample values fall on a
straight line. A value of zero indicates the lack of any linear relationship between the two variables.
The following example presents the comparison between the variables height and weight measured in
15 American women aged 30-39. What do you think about the relationship between these two variables
based on the different summary statistics and plots.
# loading the "women" data set
attach(women)
help(women)
8
The following R code presents a fun example that illustrates the importance of visualising the data
as visual tools often reveals features that cannot be understood from summary statistics. The data set
(Datasaurus data.csv) can be downloaded from KEATS.
# Read the datasaurus file (produced by Professor Alberto Cairo)
# The data are defined as a "n x 2" matrix
data <- as.matrix(read.csv("Datasaurus_data.csv",header = T))
# Based on this correlation coefficient, how related do you think X and Y are?
# Plot Y vs X
plot(X,Y, pch = 19, cex.axis =1.5, cex.lab =1.5, main = "Data Saurus" )
9
Bibliography
[1] Y. Dodge and D. Commenges. The Oxford dictionary of statistical terms. Oxford University Press
on Demand, 2006.
10