0% found this document useful (0 votes)
23 views

Data Exploration and Visualisation With R: Yanchang Zhao

This document discusses exploring and visualizing data with R. It covers having a look at data, exploring individual variables with summary statistics and charts like histograms, exploring relationships between multiple variables with plots like scatterplots, and saving charts to files. The document uses the iris dataset as an example, showing how to examine attributes, structure, sample rows, and summarize variables.

Uploaded by

Anish Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Exploration and Visualisation With R: Yanchang Zhao

This document discusses exploring and visualizing data with R. It covers having a look at data, exploring individual variables with summary statistics and charts like histograms, exploring relationships between multiple variables with plots like scatterplots, and saving charts to files. The document uses the iris dataset as an example, showing how to examine attributes, structure, sample rows, and summarize variables.

Uploaded by

Anish Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Exploration and Visualisation with R

Yanchang Zhao
http://www.RDataMining.com

R and Data Mining Course


Beijing University of Posts and Telecommunications,
Beijing, China

July 2019


Chapter 3: Data Exploration, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

2 / 45
Data Exploration and Visualisation with R

Data Exploration and Visualisation


I Summary and stats
I Various charts like pie charts and histograms
I Exploration of multiple variables
I Level plot, contour plot and 3D plot
I Saving charts into files

3 / 45
Quiz: What’s the Name of This Flower?

Oleg Yunakov [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia

Commons.

4 / 45
The Iris Dataset

The iris dataset [Frank and Asuncion, 2010] consists of 50


samples from each of three classes of iris flowers. There are five
attributes in the dataset:
I sepal length in cm,
I sepal width in cm,
I petal length in cm,
I petal width in cm, and
I class: Iris Setosa, Iris Versicolour, and Iris Virginica.
Detailed desription of the dataset can be found at the UCI
Machine Learning Repository † .


https://archive.ics.uci.edu/ml/datasets/Iris
5 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

6 / 45
Size and Variables Names of Data

# number of rows
nrow(iris)
## [1] 150

# number of columns
ncol(iris)
## [1] 5

# dimensionality
dim(iris)
## [1] 150 5

# column names
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"

7 / 45
Structure of Data
Below we have a look at the structure of the dataset with str().

str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0...
## $ Species : Factor w/ 3 levels "setosa","versicolor",....

I 150 observations (records, or rows) and 5 variables (or


columns)
I The first four variables are numeric.
I The last one, Species, is categoric (called “factor” in R) and
has three levels of values.

8 / 45
Attributes of Data
attributes(iris)
## $names
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 ...
## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 ...
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 ...
## [46] 46 47 48 49 50 51 52 53 54 55 56 57 58 ...
## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 ...
## [76] 76 77 78 79 80 81 82 83 84 85 86 87 88 ...
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 1...
## [106] 106 107 108 109 110 111 112 113 114 115 116 117 118 1...
## [121] 121 122 123 124 125 126 127 128 129 130 131 132 133 1...
## [136] 136 137 138 139 140 141 142 143 144 145 146 147 148 1...

9 / 45
First/Last Rows of Data

iris[1:3, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa

head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa

tail(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Spe...
## 148 6.5 3.0 5.2 2.0 virgi...
## 149 6.2 3.4 5.4 2.3 virgi...
## 150 5.9 3.0 5.1 1.8 virgi...

10 / 45
A Single Column

The first 10 values of Sepal.Length

iris[1:10, "Sepal.Length"]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

iris$Sepal.Length[1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

11 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

12 / 45
Summary of Data
Function summary()
I numeric variables: minimum, maximum, mean, median, and
the first (25%) and third (75%) quartiles
I categorical variables (i.e., factors): frequency of every level
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
13 / 45
library(Hmisc)
# describe(iris) # check all columns
describe(iris[, c(1, 5)]) # check columns 1 and 5
## iris[, c(1, 5)]
##
## 2 Variables 150 Observations
## -----------------------------------------------------------...
## Sepal.Length
## n missing distinct Info Mean Gmd ...
## 150 0 35 0.998 5.843 0.9462 4....
## .10 .25 .50 .75 .90 .95
## 4.800 5.100 5.800 6.400 6.900 7.255
##
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## -----------------------------------------------------------...
## Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
## -----------------------------------------------------------...
14 / 45
Mean, Median, Range and Quartiles

I Mean, median and range: mean(), median(), range()


I Quartiles and percentiles: quantile()

range(iris$Sepal.Length)
## [1] 4.3 7.9

quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9

quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))


## 10% 30% 65%
## 4.80 5.27 6.20

15 / 45
Variance and Histogram
var(iris$Sepal.Length)
## [1] 0.6856935

hist(iris$Sepal.Length)
Histogram of iris$Sepal.Length

30
25
20
Frequency

15
10
5
0

4 5 6 7 8

iris$Sepal.Length

16 / 45
Density
library(magrittr) ## for pipe operations
iris$Sepal.Length %>% density() %>%
plot(main='Density of Sepal.Length')

Density of Sepal.Length
0.4
0.3
Density

0.2
0.1
0.0

4 5 6 7 8

N = 150 Bandwidth = 0.2736

17 / 45
Pie Chart
Frequency of factors: table()

library(dplyr)
iris2 <- iris %>% sample_n(50)
iris2$Species %>% table() %>% pie()

# add percentages
tab <- iris2$Species %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt)

setosa
setosa
38%

versicolor
versicolor
36%

virginica virginica
26%

18 / 45
Bar Chart
iris2$Species %>% table() %>% barplot()

# add colors and percentages


bb <- iris2$Species %>% table() %>%
barplot(axisnames=F, main='Species', ylab='Frequency',
col=c('pink', 'lightblue', 'lightgreen'))
text(bb, tab/2, labels=txt, cex=1.5)
Species
15

15
Frequency
setosa
10

10
versicolor
38% 36%
virginica
26%
5

5
0

setosa versicolor virginica

19 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

20 / 45
Correlation
Covariance and correlation: cov() and cor()

cov(iris$Sepal.Length, iris$Petal.Length)
## [1] 1.274315

cor(iris$Sepal.Length, iris$Petal.Length)
## [1] 0.8717538

cov(iris[, 1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063

# cor(iris[,1:4])

21 / 45
Aggreation

Stats of Sepal.Length for every Species with aggregate()

aggregate(Sepal.Length ~ Species, summary, data = iris)


## Species Sepal.Length.Min. Sepal.Length.1st Qu.
## 1 setosa 4.300 4.800
## 2 versicolor 4.900 5.600
## 3 virginica 4.900 6.225
## Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu.
## 1 5.000 5.006 5.200
## 2 5.900 5.936 6.300
## 3 6.500 6.588 6.900
## Sepal.Length.Max.
## 1 5.800
## 2 7.000
## 3 7.900

22 / 45
Boxplot
I The bar in the middle is median.
I The box shows the interquartile range (IQR), i.e., range
between the 75% and 25% observation.

boxplot(Sepal.Length ~ Species, data = iris)


8.0
7.5
7.0
6.5
6.0
5.5
5.0
4.5

setosa versicolor virginica

23 / 45
Scatter Plot
with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
pch = as.numeric(Species)))

4.0
3.5
Sepal.Width

3.0
2.5
2.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length

24 / 45
Scatter Plot with Jitter
Function jitter(): add a small amount of noise to the data
with(iris, plot(jitter(Sepal.Length), jitter(Sepal.Width),
col=Species,pch=as.numeric(Species)))

4.0
jitter(Sepal.Width)

3.5
3.0
2.5
2.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

jitter(Sepal.Length)

25 / 45
A Matrix of Scatter Plots
pairs(iris)

2.0 3.0 4.0 0.5 1.5 2.5

7.5
6.5
Sepal.Length

5.5
4.5
4.0

Sepal.Width
3.0
2.0

1 2 3 4 5 6 7
Petal.Length
2.5
1.5

Petal.Width
0.5

3.0
2.0
Species

1.0
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0

26 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

27 / 45
3D Scatter plot
library(scatterplot3d)
scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

4.5
4.0
iris$Sepal.Width

3.5

iris$Sepal.Length
3.0

7
2.5

5
2.0

4
0.0 0.5 1.0 1.5 2.0 2.5

iris$Petal.Width
28 / 45
Interactive 3D Scatter Plot

Package rgl supports interactive 3D scatter plot with plot3d().

library(rgl)
plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

29 / 45
Heat Map
Calculate the similarity between different flowers in the iris data
with dist() and then plot it with a heat map
dist.matrix <- as.matrix(dist(iris[, 1:4]))
heatmap(dist.matrix)

143
102
114
122
120
73
71
139
128
127
134
124
147
150
84
77
87
51
53
78
148
111
112
135
115
133
129
149
116
117
138
104
113
140
142
146
141
125
105
137
109
64
92
74
79
86
75
98
69
88
59
55
57
66
76
52
95
100
96
97
89
91
72
62
67
56
85
107
54
90
70
60
68
93
83
63
82
81
80
65
58
94
99
61
145
121
144
101
126
103
130
136
110
108
131
118
132
123
106
119
1
18
29
8
40
28
50
12
41
5
38
10
35
30
31
26
20
47
22
11
49
37
33
17
44
27
25
24
32
21
19
6
45
15
34
16
3
48
7
36
46
2
13
4
39
43
9
14
23
42
42
23
14
9
43
39
13
4
46
2
36
7
48
3
16
34
15
45
19
6
21
32
24
25
27
44
17
33
37
49
11
22
47
20
26
31
30
35
10
38
41
5
12
50
28
40
29
8
18
1
119
106
123
132
118
131
108
110
136
130
103
126
101
144
121
145
61
99
94
58
65
80
81
82
63
83
93
68
60
70
90
54
107
85
56
67
62
72
91
89
97
96
100
95
52
76
66
57
55
59
88
69
98
75
86
79
74
92
109
64
137
105
125
141
146
142
140
113
104
138
117
116
149
129
133
115
135
112
111
148
78
53
51
87
77
84
150
147
124
134
127
128
139
71
120
73
122
114
102
143

30 / 45
Level Plot
Function rainbow() creates a vector of contiguous colors.
rev() reverses a vector.
library(lattice)
levelplot(Petal.Width ~ Sepal.Length * Sepal.Width,
data=iris, cuts=8)

2.5

4.0

2.0

3.5
1.5
Sepal.Width

3.0
1.0

2.5
0.5

2.0
0.0

5 6 7

Sepal.Length
31 / 45
Contour
contour() and filled.contour() in package graphics
contourplot() in package lattice
filled.contour(volcano, color=terrain.colors, asp=1,
plot.axes=contour(volcano, add=T))

110

100
180
110

170
160

130
190

160
140
180

170

180 160
150
120
140

100
120
100
110

0
110 10

32 / 45
3D Surface
persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")

Y
volc
ano

33 / 45
Parallel Coordinates
Visualising multiple dimensions
library(MASS)
parcoord(iris[1:4], col = iris$Species)

Sepal.Length Sepal.Width Petal.Length Petal.Width

34 / 45
Parallel Coordinates with Package lattice
library(lattice)
parallelplot(~iris[1:4] | Species, data = iris)

virginica
Petal.Width

Petal.Length

Sepal.Width

Sepal.Length

setosa versicolor
Petal.Width

Petal.Length

Sepal.Width

Sepal.Length

Min Max
35 / 45
Visualisation with Package ggplot2
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)

4.5

4.0

3.5

setosa
3.0

2.5

2.0
4.5

4.0
Sepal.Width

3.5

versicolor
3.0

2.5

2.0
4.5

4.0

3.5

virginica
3.0

2.5

2.0
5 6 7 8
Sepal.Length

36 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

37 / 45
Save Charts to Files
I Save charts to PDF and PS files: pdf() and postscript()
I BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and
tiff()
I Close files (or graphics devices) with graphics.off() or
dev.off() after plotting

# save as a PDF file


pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
graphics.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()

38 / 45
Save ggplot Charts to Files

ggsave(): by defult, saving the last plot that you displayed. It


also guesses the type of graphics device from the extension.

ggsave("myPlot3.png")
ggsave("myPlot4.pdf")
ggsave("myPlot5.jpg")
ggsave("myPlot6.bmp")
ggsave("myPlot7.ps")
ggsave("myPlot8.eps")

39 / 45
Contents

Introduction

Have a Look at Data

Explore Individual Variables

Explore Multiple Variables

More Explorations

Save Charts to Files

Further Readings and Online Resources

40 / 45
Further Readings

I Examples of ggplot2 plotting:


https://ggplot2.tidyverse.org/
I Package iplots: interactive scatter plot, histogram, bar plot, and parallel
coordinates plot (iplots)
http://rosuda.org/software/iPlots/
I Package googleVis: interactive charts with the Google Visualisation API
http://cran.r-project.org/web/packages/googleVis/vignettes/
googleVis_examples.html
I Package ggvis: interactive grammar of graphics
http://ggvis.rstudio.com/
I Package rCharts: interactive javascript visualisations from R
https://ramnathv.github.io/rCharts/

41 / 45
Online Resources

I Book titled R and Data Mining: Examples and Case Studies


http://www.rdatamining.com/docs/RDataMining-book.pdf
I R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I Free online courses and documents
http://www.rdatamining.com/resources/
I RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
I Twitter (3,300+ followers)
@RDataMining

42 / 45
The End

Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
43 / 45
How to Cite This Work

I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}

44 / 45
References I

Frank, A. and Asuncion, A. (2010).


UCI machine learning repository. university of california, irvine, school of information and computer sciences.
http://archive.ics.uci.edu/ml.

45 / 45

You might also like