0% found this document useful (0 votes)
142 views13 pages

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

The document discusses exploratory data analysis techniques for summarizing univariate and bivariate data. For univariate data, it covers measures of central tendency, dispersion, and visualization tools like boxplots. For bivariate data, it discusses joint and marginal distributions, frequency tables for discrete variables, scatter plots for continuous variables, and reporting summaries separately for each level of a discrete covariate when data contains both continuous and discrete variables. Examples are provided in R to demonstrate calculating and visualizing these summaries.

Uploaded by

Viorel Adirva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views13 pages

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

The document discusses exploratory data analysis techniques for summarizing univariate and bivariate data. For univariate data, it covers measures of central tendency, dispersion, and visualization tools like boxplots. For bivariate data, it discusses joint and marginal distributions, frequency tables for discrete variables, scatter plots for continuous variables, and reporting summaries separately for each level of a discrete covariate when data contains both continuous and discrete variables. Examples are provided in R to demonstrate calculating and visualizing these summaries.

Uploaded by

Viorel Adirva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

7CCMMS61 Statistics for Data Analysis

Francisco Javier Rubio


Department of Mathematics
Contents

2 Week 2: Exploratory Data Analysis 1


2.1 Lecture 1: Exploratory Data Analysis V . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1.1 Dispersion Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Lecture 2: Exploratory Data Analysis VI . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Bivariate statistics - Joint distribution . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1.2 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1.3 Mix of Continuous and Discrete variables . . . . . . . . . . . . . . . 5
2.3 Lecture 3: Exploratory Data Analysis VII . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Lecture 4: Exploratory Data Analysis VIII . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Dependence Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1
I would appreciate if you point out any typos you spot out to me: (javier.rubio [email protected]).

Disclaimer: These notes should not be distributed or used for commercial purposes.
Week 2: Exploratory Data Analysis

2.1 Lecture 1: Exploratory Data Analysis V

2.1.1 Dispersion Statistics


In general, the aim of dispersion statistics is to measure deviations from a reference quantity, which can
be the mean, median, or some other quantity. This is, dispersion is the variability in observed values of
a numerical variable.

Definition 1. Dispersion: The amount by which a set of observations deviate from their mean (or
another measure of location). When the values of a set of observations are close to their mean, the
dispersion is less than when they are spread out widely from the mean (or another measure of location).

Next, we study some measures of dispersion:

Definition 2. Range: The difference between the largest and smallest observations in a data set. Often
used as an easy-to-calculate measure of the dispersion in a set of observations but not recommended for
this task because of its sensitivity to outliers and the fact that its value increases with sample size.
Formally, this is defined as:
R = xmax − xmin = x(n) − x(1) ,
where x(1) , . . . , x(n) are sorted observations.

For classified variables, the approximate range is

R = xuk − xl1 ,

where xuk is the upper class limit of the last class and xl1 the lower class limit of the first class.

Definition 3. Interquartile range: A measure of spread given by the difference between the first and
third quartiles of a sample.

Definition 4. Variance: In a population, the second moment about the mean. An unbiased estimator of
the population value is provided by s2 given by
n n
2 1 X 2 1 X n
s = (xi − x̄) = x2i − x̄2
n − 1 i=1 n − 1 i=1 n−1

k
1 X
s2 = (xj − x̄)2 h(xj )
n − 1 j=1

Next, we discuss some properties of these summary statistics:

1
• Linear transformation of the data:
yi = a + b · xi (b 6= 0)
a =
b shift of the data
0<b<1 = compression of the data
b>1 = dilation of the data
b<0 = mirroring at the origin with dilation
or compression
• Normalization

zi = a + bxi with a = −x̄/sx , b = 1/sx


xi − x̄
zi = ⇒ z̄ = 0, s2z = 1
sx

• Properties of location and dispersion parameters (Linear transformation)

1. of the arithmetic mean


ȳ = a + bx̄,

2. of the median
y0.5 = a + bx0.5 ,

3. of the variance
s2y = b2 s2x sy = |b| sx ,

4. of the range
RY = |b|RX ,

5. of the interquartile range


IQRY = |b| IQRX .

Now, let us discuss one of the most popular visual tools: box plots.
Boxplots: Ingredients

• Calculate the median, 25th percentile (Q1), and 75th percentile (Q3).

• Calculate the interquartile range IQR = Q3 − Q1.

• Calculate the upper and lower limits of the box as Q3 + 1.5 × IQR and Q1 − 1.5 × IQR.

• The remaining observations are considered as “outliers”.

Boxplots: Graphical representation

• Boxplots are used to describe and communicate the characteristics of a data set.

• You can present more than one boxplot associated to different data sets. This is often used to
compare populations.

• Example.

Now, let’s calculate all of these quantities and tools in two simulated data sets.

2
Figure 2.1.1: Boxplot.

n <- 500 # Sample size

# Two simulated data sets


set.seed(123)
x <- rnorm(n,0,1)
y <- rnorm(n,0,1)

# Interquartile range
IQR(x)
IQR(y)

# Variance
var(x)
var(y)

# Five number summary


summary(x)
summary(y)

# Boxplot
boxplot(x,y, names = c("Population 1","Population 2"), cex.lab=1.5,
cex.axis = 1.5, col = c(’red’,’blue’))

Further Reading 1. If you are interested on learning about other visual tools, check the following R
Markdown:

[”Some visual tools for comparing two univariate samples”]

3
2.2 Lecture 2: Exploratory Data Analysis VI

2.2.1 Bivariate statistics - Joint distribution


In this section we will focus on the description of pairs of variables. This is, we are now interested
in describing and summarising variables when they are observed simultaneously. This generalisation
motivates the concepts of joint distribution (or multivariate distribution), and bivariate distribution.

Definition 5. Joint distribution: Essentially synonymous with multivariate distribution, although used
particularly as an alternative to bivariate distribution when two variables are involved.

Thus, a bivariate distribution is a particular case, where only two variables are considered.

Definition 6. Bivariate distribution: The joint distribution of two random variables, X and Y .

2.2.1.1 Discrete Variables

Pairs of discrete variables are often summarised in Frequency tables or frequency distribution table.

Definition 7. In statistics, a frequency distribution is a list, table or graph that displays the frequency of
various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences
of values within a particular group or interval.
The total row and total column report the marginal frequencies or marginal distribution, while the
body of the table reports the joint frequencies.

The following example illustrates the concepts of frequency table and marginal distribution.
# Define the table of values
mytable <- rbind(c(240,120,70),
c(160,90,90),
c(30,30,30),
c(37,7,6),
c(40,32,18))

# Name the columns and rows


colnames(mytable) <- c("rare", "occasional", "regular")
rownames(mytable) <- c("Manual worker", "Non-manual worker",
"Office worker", "Farmer", "Others")

# print the table


print(mytable)

# Marginal distributions:
# total
margin.table(mytable)
# X
margin.table(mytable,1)
# Y
margin.table(mytable,2)

2.2.1.2 Continuous Variables

Continuous variables are often summarised in a table, however, if a big amount of data is present the
interpretation of the table might be difficult. In such case, it is better to produce a scatter plot or scatter
diagram.

4
Definition 8. Scatter diagram: A two-dimensional plot of a sample of bivariate observations. The
diagram is an important aid in assessing what type of relationship links the two variables

The following R code illustrates how to produce scatter plots of two or more variables, as well as
showing only part of a table.
# Attache the "iris" data set
attach(iris)
help(iris)

# Dimension of the data set


dim(iris)

# Let’s show part of the data


head(iris)

# Let’s now produce a scatter plot of the first two colums


plot(iris[,1:2], cex.axis = 1.5, cex.lab = 1.5, pch = 19)

# Let’s now produce a scatter plot of the 4 colums containing values


plot(iris[,1:4], cex.axis = 1.5, cex.lab = 1.5, pch = 19)

# An alternative method
pairs(iris[,1:4], cex.axis = 1.5, cex.lab = 1.5, pch = 19)

2.2.1.3 Mix of Continuous and Discrete variables

When the data set contains a mixture of continuous and discrete variables, summaries of the continuous
variable are typically reported for each of the values of the corresponding discrete variable. An example
of this is presented in the following R code, where you need to install an additional package containing
the data set.
# We need an additional package that contains the data set
#install.packages("fda")
library(fda)

# growth data set


attach(growth)

# Let’s try to understand what’s inside this data set


help(growth)
head(growth)
str(growth)

# Let’s focus on the variables hgtf and hgtm


matplot(age, hgtf, col="red", pch=19, xlab = "Age(Years)", ylab="height(cm)")

matplot(age, hgtm, col="blue", pch=19, xlab = "Age(Years)", ylab="height(cm)")

# Let’s over-plot these two graphs in order to compare them


matplot(age, hgtf, col="red", pch="*", xlab = "Age(Years)",
ylab="height(cm)", ylim = c(50,200))

matplot(age, hgtm, col="blue", pch="+", xlab = "Age(Years)",


ylab="height(cm)", add = T)

5
2.3 Lecture 3: Exploratory Data Analysis VII
The marginal distribution basically refers to focusing the analysis on a specific variable, even if we have
two or more variables (i.e. without accounting for the remaining variables).
Definition 9. Marginal distribution: The probability distribution of a single variable, or combinations
of variables, in a multivariate distribution. Obtained from the multivariate distribution by integrating (or
adding) over the other variables.

In addition, we might be interested on the distribution of a variable restricted to particular values of


another variable. This leads to the concept of Conditional distribution.
Definition 10. Conditional distribution: The probability distribution of a random variable (or the joint
distribution of several variables) when the values of one or more other random variables are held fixed.

The following R code illustrates the use of barplots for summarising conditional distributions. Try to
reflect what questions can you answer in each scenario.
# conditional distribution of an HIV test given HIV infection
# These results resemble scenario when the test is applied to a risk group
tab1 <- rbind(c(0.995,0.005),c(0.005,0.995))
colnames(tab1) <- c("present","not-present")
rownames(tab1) <- c("positive","negative")
print(tab1)
# print transposed table
print(t(tab1))

# marginal values
margin.table(tab1,2)

# Barplot
barplot(t(tab1), beside = TRUE, col = c("red","blue"),
main = "HIV test vs HIV infection", ylim = c(0,1),
ylab = "Conditional relative frequency",
cex.axis = 1.5, cex.lab = 1.5)
box()
legend("center",
c("present","not-present"),
fill = c("red","blue"))

# conditional distribution of HIV infection given HIV test result


# These results resemble a scenario when the test is applied at random
tab2 <- rbind(c(0.289,0.711),c(0.001,0.999))
colnames(tab2) <- c("present","not-present")
rownames(tab2) <- c("positive","negative")
print(tab2)

# marginal values
margin.table(tab2,1)

# Barplot
barplot(tab2, beside = TRUE, col = c("red","green"),
main = "HIV infection vs HIV test", ylim = c(0,1),
ylab = "Conditional relative frequency",
cex.axis = 1.5, cex.lab = 1.5)
box()
legend("center",

6
c("positive","negative"),
fill = c("red","green"))

7
2.4 Lecture 4: Exploratory Data Analysis VIII

2.4.1 Dependence Statistics


Definition 11. Independence: Essentially, two events (or variables) are said to be independent if know-
ing the outcome of one tells us nothing about the other.

Thus, we will say that two variables or events are dependent if they are not independent. This is a
slightly vague definition and, as such, there is no unique way of identifying or summarising the depen-
dence of two variables. For this reason, there exist a number of summary statistics for illustrating or
identifying dependence between two variables.
One of these quantities is the Covariance. The covariance (or sample covariance) is a statistic calcu-
lated from the joint dispersion of two numerical variables.
n
1 X
sxy = (xi − x̄)(yi − ȳ)
n − 1 i=1

A related concept is that of Correlation:

Definition 12. Correlation coefficient: An index that quantifies the linear relationship between a pair of
variables. An estimator of the correlation coefficient obtained from n sample values of the two variables
of interest, (x1 , y1 ), . . . (xn , yn ) is Pearson’s product moment correlation coefficient, r, given by
n
1
(xi − x̄)(yi − ȳ)
P
n−1 sxy
i=1
rxy = =
sx · sy sx · sy
n
(xi − x̄)(yi − ȳ)
P

= s i=1
n n
(xi − x̄)2 (yi − ȳ)2
P P
i=1 i=1

The coefficient takes values between −1 and 1, with the sign indicating the direction of the relationship
and the numerical magnitude its strength. Values of −1 or 1 indicate that the sample values fall on a
straight line. A value of zero indicates the lack of any linear relationship between the two variables.

The following example presents the comparison between the variables height and weight measured in
15 American women aged 30-39. What do you think about the relationship between these two variables
based on the different summary statistics and plots.
# loading the "women" data set
attach(women)
help(women)

# Calculate the covariance between height and weight


cov(women$height,women$weight)

# Calculate the correlation between height and weight


cor(women$height,women$weight)

# Plot weight vs height


plot(women$height,women$weight,
xlab = "Height (in)", ylab = "Weight (lbs)",
pch = 19, cex.axis = 1.5, cex.lab = 1.5)

8
The following R code presents a fun example that illustrates the importance of visualising the data
as visual tools often reveals features that cannot be understood from summary statistics. The data set
(Datasaurus data.csv) can be downloaded from KEATS.
# Read the datasaurus file (produced by Professor Alberto Cairo)
# The data are defined as a "n x 2" matrix
data <- as.matrix(read.csv("Datasaurus_data.csv",header = T))

# Define the variable X


X <- data[,1]
# Summary of X
summary(X)

# Define the variable Y


Y <- data[,2]
# Summary of Y
summary(Y)

# Calculate the covariance between X and Y


cov(X,Y)

# Calculate the correlation between X and Y


cor(X,Y)

# Based on this correlation coefficient, how related do you think X and Y are?

# Plot Y vs X
plot(X,Y, pch = 19, cex.axis =1.5, cex.lab =1.5, main = "Data Saurus" )

# Has this plot changed your mind?

9
Bibliography

[1] Y. Dodge and D. Commenges. The Oxford dictionary of statistical terms. Oxford University Press
on Demand, 2006.

10

You might also like