1 3 ST-explore
1 3 ST-explore
Visualization
Descriptive Statistics
In order to access the variables directly in our workspace we use the command
attach(iris).
Summary Statistics
The major problem with the mean is that it is very sensitive to outliers.
We take a random vector of mean 20 and then add a random element that
comes from a distribution of much larger mean. We see that the mean is strongly
affected by noise:
> vec<-rnorm(10,20,10)
> mean(vec)
[1] 16.80036
> vec.noise<-c(vec,rnorm(1,300,100))
> mean(vec.noise)
[1] 35.36422
We can robust the mean by removing a fraction of the extreme values using the
trimmed mean.
In R we can give a second parameter to the function mean called trim that
defines the fraction of extreme elements to discard.
Example: We discard 10% of the extreme values in the previous example:
> mean(vec,trim=0.1)
[1] 17.78799
> mean(vec.noise,trim=0.1)
[1] 19.51609 # much more robust
The median represents the central ranking position of the variable that separates
the lower half and the upper half of the observations.
Intuitively, it consists of the value where for one half of the observations all values
are greater than it, and for the other half all are less.
xr +1 If |x| (vector length) is odd, |x| = 2r + 1
median(x) = 1
(x
2 r
+ xr +1 ) If |x| is even, |x| = 2r
For the above example, we see that the median is more robust to noise than the
mean:
> median(vec)
[1] 17.64805
> median(vec.noise)
[1] 17.64839
Percentiles or Quantiles
In addition it is very common to talk about the quartiles which are three specific
percentiles:
The first quartile Q1 (lower quartile) is the percentile with k = 25.
The second quartile Q2 is with k = 50 which is equivalent to the median.
The third quartile Q3 (upper quartile) is with k = 75.
# The minimum, the three quartiles and the maximum.
> quantile(Sepal.Length,seq(0,1,0.25))
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Exercise
Using the command tapply analyze the mean, median and quartiles for the
three species of Iris for the four variables.
Do you notice any differences in the different species?
tapply(iris$Petal.Length,iris$Species,summary)
tapply(iris$Petal.Width,iris$Species,summary)
tapply(iris$Sepal.Length,iris$Species,summary)
tapply(iris$Sepal.Width,iris$Species,summary)
Variability Measures
Variability mesaures or dispersion measures tell us how different or similar the
observations tend to be with respect to a particular value. Usually this value
refers to some measure of central tendency.
The range is the difference between the maximum and minimum value:
> max(Sepal.Length)-min(Sepal.Length)
[1] 3.6
The standard deviation is the square root of the variance that measures the
mean squared differences of the observations from the mean.
n
1 X
var(x) = (xi − x)2
n−1
i=1
p
sd(x) = var(x)
> var(Sepal.Length)
[1] 0.6856935
> sd(Sepal.Length)
[1] 0.8280661
n
1X
AAD(x) = |xi − m(x)|
n
i=1
Finally, the interquartile range (IQR) is defined as the difference between the
third and the first quartile (Q3 − Q1 ).
IQR(Sepal.Length)
[1] 1.3
cov (x, y)
r (x, y ) =
sd(x)sd(y)
A value close to 1 indicates that as one variable grows the other also grows in a
linear proportion.
A value close to -1 indicates an inverse relationship (one is growing and the
other is decreasing).
If the correlation is close to zero we have linear independence.
Note that a correlation of zero does not imply that there cannot be a non-linear
relationship between the variables.
In R, the correlation is calculated with the command cor.
> cor(iris[,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Contingency Tables
>table(gender,studies)
studies
gender college high school postgraduate
Female 0 1 2
Male 2 1 0
Skeweness
Skeweness is a measure of the asymmetry of the probability distribution.
Pn 3
i=1 (xi − x)
skewness(x) =
(n − 1)sd(x)3
It indicates how much our underlying distribution deviates from the normal
distribution since the normal distribution has skewness 0.
Generally, we have three types of skewness:
1 Symmetrical: the skewness is close to 0 and the mean is almost the
same as the median.
2 Negative skew: the majority of the observations are concentrated on the
right tail (the median is greater than the mean).
3 Positive skew: the majority of the observations are concentrated on the
left tail (the median is less than the mean).
Skweness
Skweness
In R, we can calculate skweness with the formula skewness from the library
moments:
> library(moments)
#positive skew
> skewness(c(1,1,2,3))
[1] 0.4933822
#symmetrical
> skewness(c(1,2,3))
[1] 0
#negative skew
> skewness(c(1,2,3,3))
[1] -0.4933822
Kurtosis
Kurtosis
The Kurtosis describes the “tailedness” of a distribution [Westfall, 2014].
Pn 4
i=1 (xi − x)
kurtosis(x) =
(n − 1)sd(x)4
Let’s see the main three types of kurtosis.
1 Mesokurtic: This is the normal distribution (kurtosis ≈ 3)
2 Leptokurtic: This distribution has fatter tails than a normal distribution
(kurtosis > 3). Consequently, outliers are more likely to occur than in a
normal distribution.
3 Platykurtic: The distribution has thinner tails than a normal distribution
(kurtosis < 3). Consequently, outliers are less likely to occur than in a
normal distribution.
Kurtosis
Kurtosis
In R, we can calculate kurtosis with the formula kurtosis from the library
moments.
Let’s calculate kurtosis for normal data:
> x <- rnorm(1000, 0,1)
> plot(density(x))
> kurtosis(x)
[1] 2.946869
density.default(x = x)
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
Kurtosis
Now using random data generated with an exponential long-tailed distribution:
> x<-rexp(1000)
> plot(density(x))
> kurtosis(x)
[1] 6.839485
As expected we get a positive excess kurtosis (i.e. greater than 3) since the
distribution has fatter tails (Leptokurtic).
density.default(x = x)
0.6
Density
0.4
0.2
0.0
0 2 4 6 8
Kurtosis
Now using data generated with a thinner-tailed Beta distribution with
hyperparameters 2, 2:
> x <-rbeta(1000,2,2)
> plot(density(x))
> kurtosis(x)
[1] 2.1046
As expected we get a negative excess kurtosis (i.e. less than 3) since the
distribution has thinner tails (Platykurtic).
density.default(x = x)
1.5
1.0
Density
0.5
0.0
Data Visualization
Representation
Plotting in R
Example
plot(rnorm(15,10,5),col="red",type="p",pch=1)
lines(rnorm(15,10,5),col="blue",type="p",pch=1)
lines(rnorm(15,10,5),col="green",type="b",pch=2)
title(main="My Plot")
legend(’topright’, c("lines","dots","both") ,
lty=1:3, col=c("red", "blue","green"), bty=’n’, cex=.75)
My Plot
20
lines
dots
both
15
rnorm(15, 10, 5)
10
5
0
2 4 6 8 10 12 14
Index
Histograms
They show the distribution of the values of a variable.
The values of the elements are divided into bins and bar charts are created for
each of them.
The height of each bar indicates the number of examples in the corresponding
bin.
In R they are created with the command hist.
> hist(Sepal.Length)
Histogram of Sepal.Length
Frequency
25
10
0
4 5 6 7 8
Sepal.Length
Felipe Bravo Márquez Descriptive Statistics
Summary Statistics
Visualization
Histograms (2)
Histogram of Sepal.Length
Frequency
8
4
0
Histograms (3)
A very popular library for making visualizations in R, which is part of tidyverse, is
ggplot2.
It is based on the idea of decomposing the plot into semantic components such
as scales and layers.
>install.packages("ggplot2")
>library(ggplot2)
>ggplot(iris, aes(x=Sepal.Length))
+ geom_histogram(bins = 10, color="black", fill="white")
20
count
10
5 6 7 8
Sepal.Length
Density
Another way to visualize how the data are distributed is to estimate a density.
These are calculated using nonparametric statistical techniques called kernel
density estimation.
The density is a smoothed version of the histogram and allows us to determine
more clearly if the observed data behaves like a known density e.g., Gaussian.
In R they are created with the command density, and then visualized with the
command plot.
plot(density(iris$Sepal.Length),main=""Density of Sepal.Length")
Density of Sepal.Length
0.4
0.3
Density
0.2
0.1
0.0
4 5 6 7 8
Pie Charts
setosa
versicolor
virginica
Many people consider pie charts to be misleading and recommend using
histograms as a better alternative [Poldrack, 2019].
Boxplots
Boxplots (2)
Figure: https://www.laboneconsultoria.com.br/
wp-content/uploads/2018/07/Boxplot-04.png
Boxplots (3)
The length of the arms as well as the criteria for identifying outliers is based on
the behavior of a normal distribution.
Boxplots (4)
Boxplot Sepal.Length
7.5
6.0
4.5
Boxplots (4)
If we have a factor variable we can create a boxplot for each category as follows:
> boxplot(Sepal.Length˜Species,ylab="Sepal.Length")
7.5
Sepal.Length
6.5
5.5
4.5
Boxplots (5)
Boxplots Iris
8
6
4
2
0
Boxplots (6)
Now using ggplot2:
> ggplot(iris, aes(x = Species, y = Sepal.Length,
fill = Species)) + geom_boxplot()
8
Species
Sepal.Length
setosa
6 versicolor
virginica
Scatter Plots
Scatter plots use Cartesian coordinates to display the values of two numerical
variables of the same length.
The values of the attributes determine the position of the elements.
Other attributes can be used to define the size, shape or color of objects.
In R we can plot a scatterplot of two numeric variables using the command
plot(x,y), which would be y vs x.
We can also define formulas f (x) = y using the notation y˜x.
Thus, the command plot(y˜x) is equivalent to plot(x,y).
If we have a data.frame or a numerical matrix, we can see the scatterplots of all
pairs using the command pairs(x).
setosa
4.0
versicolor
Sepal.Width
virginica
3.0
2.0
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0 virginica
2.5
2.0
5 6 7 8
Sepal.Length
Sepal.Width
7
5
Petal.Length
3
1
0.5 1.5 2.5
Petal.Width
Sepal.Length
3.0
8
7
2.5
6
5
2.0
4
0.0 0.5 1.0 1.5 2.0 2.5
Petal.Width
Conclusions
References I
Pipis, G. (2020).
Skewness and kurtosis in statistics: R-bloggers.
Poldrack, R. A. (2019).
Statistical thinking for the 21st century.
https://statsthinking21.org/.
Tan, P.-N., Steinbach, M., and Kumar, V. (2016).
Introduction to data mining.
Pearson Education India.
Westfall, P. H. (2014).
Kurtosis as peakedness, 1905–2014. rip.
The American Statistician, 68(3):191–195.