Data Exploration and Visualisation With R: Yanchang Zhao
Data Exploration and Visualisation With R: Yanchang Zhao
Yanchang Zhao
http://www.RDataMining.com
July 2019
∗
Chapter 3: Data Exploration, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 45
Contents
Introduction
More Explorations
2 / 45
Data Exploration and Visualisation with R
3 / 45
Quiz: What’s the Name of This Flower?
Commons.
4 / 45
The Iris Dataset
†
https://archive.ics.uci.edu/ml/datasets/Iris
5 / 45
Contents
Introduction
More Explorations
6 / 45
Size and Variables Names of Data
# number of rows
nrow(iris)
## [1] 150
# number of columns
ncol(iris)
## [1] 5
# dimensionality
dim(iris)
## [1] 150 5
# column names
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"
7 / 45
Structure of Data
Below we have a look at the structure of the dataset with str().
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0...
## $ Species : Factor w/ 3 levels "setosa","versicolor",....
8 / 45
Attributes of Data
attributes(iris)
## $names
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 ...
## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 ...
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 ...
## [46] 46 47 48 49 50 51 52 53 54 55 56 57 58 ...
## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 ...
## [76] 76 77 78 79 80 81 82 83 84 85 86 87 88 ...
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 1...
## [106] 106 107 108 109 110 111 112 113 114 115 116 117 118 1...
## [121] 121 122 123 124 125 126 127 128 129 130 131 132 133 1...
## [136] 136 137 138 139 140 141 142 143 144 145 146 147 148 1...
9 / 45
First/Last Rows of Data
iris[1:3, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
tail(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Spe...
## 148 6.5 3.0 5.2 2.0 virgi...
## 149 6.2 3.4 5.4 2.3 virgi...
## 150 5.9 3.0 5.1 1.8 virgi...
10 / 45
A Single Column
iris[1:10, "Sepal.Length"]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris$Sepal.Length[1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
11 / 45
Contents
Introduction
More Explorations
12 / 45
Summary of Data
Function summary()
I numeric variables: minimum, maximum, mean, median, and
the first (25%) and third (75%) quartiles
I categorical variables (i.e., factors): frequency of every level
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
13 / 45
library(Hmisc)
# describe(iris) # check all columns
describe(iris[, c(1, 5)]) # check columns 1 and 5
## iris[, c(1, 5)]
##
## 2 Variables 150 Observations
## -----------------------------------------------------------...
## Sepal.Length
## n missing distinct Info Mean Gmd ...
## 150 0 35 0.998 5.843 0.9462 4....
## .10 .25 .50 .75 .90 .95
## 4.800 5.100 5.800 6.400 6.900 7.255
##
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## -----------------------------------------------------------...
## Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
## -----------------------------------------------------------...
14 / 45
Mean, Median, Range and Quartiles
range(iris$Sepal.Length)
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
15 / 45
Variance and Histogram
var(iris$Sepal.Length)
## [1] 0.6856935
hist(iris$Sepal.Length)
Histogram of iris$Sepal.Length
30
25
20
Frequency
15
10
5
0
4 5 6 7 8
iris$Sepal.Length
16 / 45
Density
library(magrittr) ## for pipe operations
iris$Sepal.Length %>% density() %>%
plot(main='Density of Sepal.Length')
Density of Sepal.Length
0.4
0.3
Density
0.2
0.1
0.0
4 5 6 7 8
17 / 45
Pie Chart
Frequency of factors: table()
library(dplyr)
iris2 <- iris %>% sample_n(50)
iris2$Species %>% table() %>% pie()
# add percentages
tab <- iris2$Species %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt)
setosa
setosa
38%
versicolor
versicolor
36%
virginica virginica
26%
18 / 45
Bar Chart
iris2$Species %>% table() %>% barplot()
15
Frequency
setosa
10
10
versicolor
38% 36%
virginica
26%
5
5
0
19 / 45
Contents
Introduction
More Explorations
20 / 45
Correlation
Covariance and correlation: cov() and cor()
cov(iris$Sepal.Length, iris$Petal.Length)
## [1] 1.274315
cor(iris$Sepal.Length, iris$Petal.Length)
## [1] 0.8717538
cov(iris[, 1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
# cor(iris[,1:4])
21 / 45
Aggreation
22 / 45
Boxplot
I The bar in the middle is median.
I The box shows the interquartile range (IQR), i.e., range
between the 75% and 25% observation.
23 / 45
Scatter Plot
with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
pch = as.numeric(Species)))
4.0
3.5
Sepal.Width
3.0
2.5
2.0
Sepal.Length
24 / 45
Scatter Plot with Jitter
Function jitter(): add a small amount of noise to the data
with(iris, plot(jitter(Sepal.Length), jitter(Sepal.Width),
col=Species,pch=as.numeric(Species)))
4.0
jitter(Sepal.Width)
3.5
3.0
2.5
2.0
jitter(Sepal.Length)
25 / 45
A Matrix of Scatter Plots
pairs(iris)
7.5
6.5
Sepal.Length
5.5
4.5
4.0
Sepal.Width
3.0
2.0
1 2 3 4 5 6 7
Petal.Length
2.5
1.5
Petal.Width
0.5
3.0
2.0
Species
1.0
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0
26 / 45
Contents
Introduction
More Explorations
27 / 45
3D Scatter plot
library(scatterplot3d)
scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
4.5
4.0
iris$Sepal.Width
3.5
iris$Sepal.Length
3.0
7
2.5
5
2.0
4
0.0 0.5 1.0 1.5 2.0 2.5
iris$Petal.Width
28 / 45
Interactive 3D Scatter Plot
library(rgl)
plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
29 / 45
Heat Map
Calculate the similarity between different flowers in the iris data
with dist() and then plot it with a heat map
dist.matrix <- as.matrix(dist(iris[, 1:4]))
heatmap(dist.matrix)
143
102
114
122
120
73
71
139
128
127
134
124
147
150
84
77
87
51
53
78
148
111
112
135
115
133
129
149
116
117
138
104
113
140
142
146
141
125
105
137
109
64
92
74
79
86
75
98
69
88
59
55
57
66
76
52
95
100
96
97
89
91
72
62
67
56
85
107
54
90
70
60
68
93
83
63
82
81
80
65
58
94
99
61
145
121
144
101
126
103
130
136
110
108
131
118
132
123
106
119
1
18
29
8
40
28
50
12
41
5
38
10
35
30
31
26
20
47
22
11
49
37
33
17
44
27
25
24
32
21
19
6
45
15
34
16
3
48
7
36
46
2
13
4
39
43
9
14
23
42
42
23
14
9
43
39
13
4
46
2
36
7
48
3
16
34
15
45
19
6
21
32
24
25
27
44
17
33
37
49
11
22
47
20
26
31
30
35
10
38
41
5
12
50
28
40
29
8
18
1
119
106
123
132
118
131
108
110
136
130
103
126
101
144
121
145
61
99
94
58
65
80
81
82
63
83
93
68
60
70
90
54
107
85
56
67
62
72
91
89
97
96
100
95
52
76
66
57
55
59
88
69
98
75
86
79
74
92
109
64
137
105
125
141
146
142
140
113
104
138
117
116
149
129
133
115
135
112
111
148
78
53
51
87
77
84
150
147
124
134
127
128
139
71
120
73
122
114
102
143
30 / 45
Level Plot
Function rainbow() creates a vector of contiguous colors.
rev() reverses a vector.
library(lattice)
levelplot(Petal.Width ~ Sepal.Length * Sepal.Width,
data=iris, cuts=8)
2.5
4.0
2.0
3.5
1.5
Sepal.Width
3.0
1.0
2.5
0.5
2.0
0.0
5 6 7
Sepal.Length
31 / 45
Contour
contour() and filled.contour() in package graphics
contourplot() in package lattice
filled.contour(volcano, color=terrain.colors, asp=1,
plot.axes=contour(volcano, add=T))
110
100
180
110
170
160
130
190
160
140
180
170
180 160
150
120
140
100
120
100
110
0
110 10
32 / 45
3D Surface
persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")
Y
volc
ano
33 / 45
Parallel Coordinates
Visualising multiple dimensions
library(MASS)
parcoord(iris[1:4], col = iris$Species)
34 / 45
Parallel Coordinates with Package lattice
library(lattice)
parallelplot(~iris[1:4] | Species, data = iris)
virginica
Petal.Width
Petal.Length
Sepal.Width
Sepal.Length
setosa versicolor
Petal.Width
Petal.Length
Sepal.Width
Sepal.Length
Min Max
35 / 45
Visualisation with Package ggplot2
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)
4.5
4.0
3.5
setosa
3.0
2.5
2.0
4.5
4.0
Sepal.Width
3.5
versicolor
3.0
2.5
2.0
4.5
4.0
3.5
virginica
3.0
2.5
2.0
5 6 7 8
Sepal.Length
36 / 45
Contents
Introduction
More Explorations
37 / 45
Save Charts to Files
I Save charts to PDF and PS files: pdf() and postscript()
I BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and
tiff()
I Close files (or graphics devices) with graphics.off() or
dev.off() after plotting
38 / 45
Save ggplot Charts to Files
ggsave("myPlot3.png")
ggsave("myPlot4.pdf")
ggsave("myPlot5.jpg")
ggsave("myPlot6.bmp")
ggsave("myPlot7.ps")
ggsave("myPlot8.eps")
39 / 45
Contents
Introduction
More Explorations
40 / 45
Further Readings
41 / 45
Online Resources
42 / 45
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
43 / 45
How to Cite This Work
I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
44 / 45
References I
45 / 45