0% found this document useful (0 votes)
12 views22 pages

Module 2 ExploratoryDataAnalysis

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) using R and RStudio, focusing on the ggplot2 package for data visualization. It covers the installation of necessary packages, the basics of ggplot2, and techniques for analyzing single and multiple variables, including handling missing data. The document emphasizes the importance of EDA in discovering patterns and testing hypotheses in datasets.

Uploaded by

siddharth.tcsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views22 pages

Module 2 ExploratoryDataAnalysis

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) using R and RStudio, focusing on the ggplot2 package for data visualization. It covers the installation of necessary packages, the basics of ggplot2, and techniques for analyzing single and multiple variables, including handling missing data. The document emphasizes the importance of EDA in discovering patterns and testing hypotheses in datasets.

Uploaded by

siddharth.tcsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Exploratory Data Analysis using R & RStudio

Contents
1 Introducti on to E D A 2

2 Prepari ng R S t u d i o for E D A 2

3 U s i n g ggplot2 3
1. Begining Charti ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2. Using Facets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3. Using Multiple ggplot2 layers 4 D oi n g E xp l orator y D a t a Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1. Analysis
1. 6ofSingle
a Single Variable Variable . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . .
Categorical
4. 2.
Changing Single Continuous
ggplot2 themes . . . . .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. . 10
Variable
4.2 Analysing8 Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1. Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Multiple Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Correlogram and Heat-map . . . . . . . . . . . . . . . . . . . . . . . . . . 14
15
5 Dealing with Missing Data 18
15
1. Deletion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2. Imputation Methods: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3. Prediction Model: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
6 4.Sum m K NarNy Imputation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
19
7 5.References
Using R to deal with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23
20

1
1 Introducti on to E D A
Exploratory Data Analysis ( E D A ) (Tukey, 1997) can be simply defined as the critical pro-
cess of performing initial investigations on data so as to discover patterns,to spot anomalies,to
test hypothesis and to check assumptions with the help of summary statistics and graphical
representations.
In statistics, exploratory data analysis is an approach of analyzing data sets to summarize
their main characteristics, often using statistical graphics and other data visualization methods.
A statistical model can be used or not, but primarily E D A is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was
promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate
hypotheses that could lead to new data collection and experiments. E D A is different from
initial data analysis (IDA), which focuses more narrowly on checking assumptions required for
model fitting and hypothesis testing, and handling missing values and making transformations
of variables as needed. E D A is also different from Statistical Graphing in the sense that the
latter is mainly done for presentation and publication purposes; even though E D A is done using
many techqniques of statistical graphing.

2 Pre p a r in g R S t u d i o for E D A
In order to do E D A in R ( R Core Team, 2021) and RStudio (RStudio Team, 2020) we require
the installation of the ti dy ve rse package which can be installed on as follows:

i n s t a l l . p a c k a g e s ( " ti d y v e r s e " , dependencies = TRUE)

The rationale for choosing the package ti dy ve rs e (Wickham et al., 2019) is that the package
contains the required package ggplot2 which is necessary for doing exploratory data analysis.
We shall be using dataset mpg — which is available in the ggplot2 package — for understanding
the exploratory data analysis. Alternatively, if so desired, users can directly load the package
ggplot2 using the above i n s t a l l . p a c k a g e ( ) command by replacing ti dy ve rs e by ggplot2.
The next step would be to load the library ti dy ve rs e and make all its functionality available
to R engine. This is done by the l i b r a r y ( ) command as follows:

l i b r a r y ( ti d y v e rs e )

## Warning: package ’ ti d y v e r s e ’ was b u i l t under R version 4 . 1 . 1


## Att aching packages ti d y v e r s e 1 . 3 . 1
## v ggplot2 3 . 3 . 5 v purrr 0.3.4
## v ti b b l e 3 . 1 . 4 v dplyr 1.0.7
## v ti d y r 1.1.3 v strin gr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package ’ ti b b l e ’ was b u i l t under R
## Warning: version 4 . 1 . 1 package ’ r e a d r ’ was b u i l t
## C o n fl i c t s under R version 4 . 1 . 1 ti d y v e r s e c o n fl i c t s ( )
## x d p l y r : : fi l t e r ( ) masks s t a t s : : fi l t e r ( )
## x dplyr::lag() masks s t a t s : : l a g ( )

As you can see, the moment we attach the library ti dy ve rs e it goes on to load the 8
distinct package and makes it available to the R engine. We are not actually going to use the
remaining packages but these packages makes it possible to read in almost any kind of data,
clean it and arrange it for analysis with R . Our key focus would be on the ggplot2 (Wickham,
2016) package which would be used to create the graphs and plots necessary for understanding
the Exploratory Data Analysis. Also if you get, Conflicts — output do not worry about them.
When we load the ti dy ve rse environment, it also loads other packages and some function calls

2
conflict with other packages. No need to worry, as R masks these functions so that there is no
conflict between the functions.

3 U s i n g ggplot2
ggplot2 graphs are based on the phlosophy of data visualization. In most cases we start with
the function g g p l o t ( ) , supply a dataset and aesthetic mapping with function aes(). We then
add on layers like geom p o i n t ( ) or geom histogram(), scales like s c a l e colour brewer(),
faceting specifications like fa c e t wrap() and coordinate systems like coord fl i p ( ) . All the
layers are additive in nature and are added on with the + sign. The only mandatory requirement
is that it should begin with g g p l o t ( ) and a data-set. In order to understand the functioning
of the g g p l o t ( ) we shall use the data-set mpg. The data-set mpg gets attached to R and can be
seen by typing the following command:
mpg

## # A ti b b l e : 234 x 11
## manufacturer model displ year c y l t ra n s drv cty hwy fl
c l a s s ## <chr> <chr> <dbl> <int > <int > <chr> <chr> <int >
<int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quatt ro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quatt ro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quatt ro 2 2008 4 manu~ 4 20 28 p comp~
## # . . . with 224 more rows
To better understand the structure of the dataset we use the s t r ( ) function as follows:
str(mpg)

## ti b b l e [234 x 11] ( S 3 : tbl_df/tbl/data.frame)


## $ manufacturer: chr [ 1 : 2 3 4 ] "audi" "audi" "audi" "audi" . . .
## $ model : chr [ 1 : 2 3 4 ] "a4" "a4" "a4" "a4" . . .
## $ displ : num [ 1 : 2 3 4 ] 1 .8 1 .8 2 2 2 . 8 2 . 8 3 . 1 1 . 8 1 . 8 2 . . .
## $ year : int [ 1 : 2 3 4 ] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 . . .
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 . . .
## $ trans : chr [ 1 : 2 3 4 ] " a u to ( l 5 ) " "manual(m5)" "manual(m6)" "auto(av)" . . .
## $ drv : chr [1:234] " f " " f " " f " " f " . . .
## $ cty : int [ 1 : 2 3 4 ] 18 21 20 21 16 18 18 18 16 20 . . .
## $ hwy : int [ 1 : 2 3 4 ] 29 29 31 30 26 26 27 26 25 28 . . .
## $ fl : chr [ 1 : 2 3 4 ] "p" "p" "p" "p" . . .
## $ class : chr [ 1 : 2 3 4 ] "compact" "compact" "compact" "compact" . . .

The first row us that it is a ti b b l e with 234 rows (or cases) and 11 columns (or variables).
A ti b b l e , is a modern form of data.frame. Tibbles are data.frames that are lazy and surly:
they do less (i.e. they dont change variable names or types, and dont do partial matching)
and complain more (e.g. when a variable does not exist). This forces you to confront problems
earlier, typically leading to cleaner, more expressive code. The next lines display the names
of the variables, their type and some typical values. In our case we have char (stands for
“character”), num (stands for “number”) and i n t (stands for “integer”). There are other data
types which we will discover subsequently. To read up on the structure of the dataset type the
code below. This code starts the help server and displays information about the data set. Make
a note of the proper names of the variables

3
3.1 Begining Charti ng
Once we are familiar with the variable names and the structure of the data, let us now start
learning how to plot using ggplot2. Type the following line and R will come up with the
following output:

ggplot(data=mpg) + geom_point(mapping = a es( x = d i s p l , y = hwy))

4
0

3
hw

0
y

2
0

2 3 4 5 6 7
displ

Figure 1: A simple ggplot2 output

As we already know g g p l o t ( ) charts are made by adding layers. g gplo t(data = mpg) is the
base layer of ggplot2 where we specify the data to be used for plotting purpose. geom p o i n t ( )
tells ggplot2 to add a layer of points to the chart to create a scatterplot. The set of arguments
inside the geom points (mapping = ae s(x = d i s p l , y = hwy)) tells which variables to plot
on the x axis and y axis. As we can see, on the x-axis we have asked g g p l o t ( ) to plot the
“displ” (engine displacement) and on the y-axis we have asked it to plot the “hwy” (highway
mileage). And that is exactly what g g p l o t ( ) has plotted for us. The mapping argument is
always paired with a e s ( ) — which is a shorthand for aesthetics. We can see from the figure
that as the displacement increases the mileage tends to decrease. The problem with the above
chart is that black points make it rather difficult to see. Would it be possible to add color to
the chart. Well we can do that by adding the color outside of the a e s ( ) function as follows:

ggplot(data=mpg) + geom_point(mapping = a es( x = d i s p l , y = hwy), co l o r = "blue")

4
0
hw

3
y

2
0 2 3 4 5 6 7
displ

Figure 2: A ggplot2 output with blue points

Not quite colorful but for the time this will do. Now looking at the chart we see that
some
4
cars (situated on the right hand side) have huge displacements but also good highway mileage.
Are these cars of some different class? Let’s try to answer this question by coloring the chart
with the c l a s s attribute. We will allow g g p l o t ( ) to handle the coloring and therefore
this command would be diff erent from the previous command in the sense that
the c o l o r by c l a s s request would be passed on inside the a e s ( ) functi on. This is
essentially to tell ggplot2 that we want the chart colored by a nominal variable and that it
should decide how it will handle the coloring. Here is the code:

ggplot(data=mpg) + geom_point(mapping = a es( x = d i s p l , y = hwy, co l o r = c l a s s ) )

2seater
4
0 compact

3 midsize
hw

0 minivan
y

2 pickup
0
subcompa

2 3 4 5 6 7 ct suv
displ

Figure 3: A ggplot2 with points colored by Class

If you zoom in to the chart (figure 3) you will find that the dots which were causing us
problems belong to the 2 seater category. These are usually sports cars which have large
engine displacements but due to the lower passenger loads give better mileage on the highway.

3.2 Using Facets


Facets are used to split the chart on the basis of a “factor”. Factor in R is defined a
variable used to categorize and store the data, having a limited number of different values. It
stores the data as a vector of integer values. Factor in R is also known as a categorical variable
that stores both string and integer data values as levels. Factor is mostly used in Statistical
Modeling and Exploratory Data Analysis with R . Lets add a new layer using facet.

ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
facet_wrap(~ c l a s s , nrow = 2)

clas
2seate compac midsiz miniva s
4 r t e n 2seater
0 compact
3
0 midsize
2 3 4 5 6
hw

2 picku subcompa su
7 minivan
y

0 p ct v
4
0 pickup
3
0 subcompa
2 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6
ct suv
0 7 disp
l
Figure 4: A Simple Facet Example

5
The sign ~ is read as ‘ b y ’ ; thus the fa c e t wrap(~ c l a s s , nrow = 2 ) tells g g p l o t ( ) that
we want the graph but this time we want to split up the graph into two rows (nrow = 2) on
the basis of c l a s s . Now suppose we want to create slices by two different facets - the number
of drive train (drv) and the number of cylinder ( c y l ) then we proceed as follows:
ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
fa c e t _ g r i d ( drv ~ c y l )

4 5 6 8
clas
4
0 s 2seater

4
3
0 compact
2
4 midsize
0
hw

f
minivan
y

3
0
pickup
2
4
0 subcompa

r
3
0 ct suv
2 2 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 7 2 3 4 5 6
0 7 displ7

Figure 5: Faceting by two categorical variables

This chart covers a lot of grounds. On th extreme right it gives us the drive train types
( f = front wheel drive, r = rear wheel drive, 4wd = Four Wheel Drive). On the extreme top
it gives us the number of cylinders (4,5,6 and 8). Each individual subplot shows the plot of
d i s p l (displacement) on the x-axis and of hwy (highway mileage) on the y-axis and the points
are colored by class. In case we want to facet by only one variable we can use a dot ( ‘. ’ ) to
represent the second variable. For example

ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
fa c e t _ g r i d ( drv ~ . )

Try the above chunk of code. It should provide a similar graph as the previous one but the
number of cylinders on the top would be missing as we have only requested that the faceting be
done using drv. Try replacing fa c e t g r i d ( d r v ~ . ) by the following fa c e t g r i d ( . ~ cyl)
and check the results.

3.3 Using Multi ple ggplot2 layers


By now it should be clear as to how we can use the ggplot2 function to create graphs.
Let us now clarify a few more things and then move on to Exploratory Data Analysis. As of
now we have only supplied the aesthetic mapping only to the added layers; but if we add the
aesthetics to the g gplo t layer we have a global opti on which will be used by any of the layers
below the topmost layer. The global option can be over written by adding the desired
aesthetics to the secondary plotting layers. For example, the command:
ggplot(data=mpg, mapping = aes(x = d i s p l , y = hwy)) +
geom_point()

6
will produce a chart which looks identical to the figure 1 on page 4. Good enough, but can
we add another layer to the plot. The answer is yes and we will add a smoother layer to the
chart as follows:
g gplot(data = mpg, mapping = aes( x = d i s p l , y = hwy)) +
geom_point() + ## T h i s i s Layer 1 with the points
geom_smooth() ## T h i s i s l a y e r 2 adding the smoother

## ‘geom smooth()‘ using method = ’ l o e s s ’ and formula ’ y


~ x’

4
0
hw

3
y

2
0 2 3 4 5 6 7
displ

Figure 6: Adding Multiple Layers

The geom smooth() function adds a l o e s s smoother for a small dataset and for large dataset
it adds a gam function. The function, by default, adds the standard error as it makes it easier to
identify the trend. Replace geom smooth() to geom smooth(se = FALSE) and see the output.
Let us plot another graph and see the results:

g gplot(data = mpg, mapping = aes( x = d i s p l , y = hwy, co l o r = d r v ) ) +


geom_point() + geom_smooth()

## ‘geom smooth()‘ using method = ’ l o e s s ’ and formula ’ y ~ x ’

4
0 dr
v 4
3
hw

0 f
y

2 r
0

2 3 4 5 6 7
displ

Figure 7: Adding Smoothers based on Drive Type

In figure 7 we have plotted the points and have added a smoother; but we have also instructed
the ggplot2 to add the layers by the drive type (drv) leading to ggplot2 adding 3 smoother
based on the drive type.

7
3.4 Changing ggplot2 themes
ggplot2 allows the expert user to do a lot of customization to the charts drawn. We shall not
go into how we can control the individual elements; but rather on changing the appearances
of the output. The theme that is used by default in ggplot2 is the theme g ray ( ) ; but here
we shall see the use of two built in themes – theme c l a s s i c ( ) and theme v o i d ( ) . As R is an
object oriented language, we can completely assign a series of command to a single object.

GPLOT < - ggplot(data=mpg) +


geom_point(mapping = aes(x = d i s p l , y = hwy, c o l o r = c l a s s ) )

Here we have assigned the entire command to a new object GPLOT. Now, we will add the
theme layer and see how it works

GPLOT + theme_void()

class
2seater
compact
midsize
minivan
pickup
subcompa
ct suv

Figure 8: Using Theme Void

GPLOT + theme_classic()

2seater
4
0 compact

3 midsize
hw

0 minivan
y

2 pickup
0
subcompa

2 3 4 5 6 7 ct
su
displ v

Figure 9: Using Theme Classic

The advantage of putting a lot of commands in another object is that it saves us a lot of
typing. If the procedure confuses you then you should stick to the copy + paste + modify
method of working. Secondly, the ggplot2 package gives you the following themes. Please feel
free to try these themes at your leisure:

8
T heme Descripti on
theme gray() The default theme with grey background and white grid lines
theme bw() The classic dark-on-light ggplot2 theme
theme linedraw() A theme with only black lines of various widths on white backgrounds
theme light() A theme similar to theme linedraw() but with light grey lines and axes
theme dark() The dark cousin of theme light() but with a dark background
theme minimal() A minimalistic theme with no background annotations
theme classic() A classic looking theme, with x and y axis lines and no grid lines
theme void() A completely empty theme

Table 1: Themes in ggplot2

4 D o i n g E x p l o rato r y D a t a Analysis
Exploratory Data Analysis ( E D A ) is an approach/philosophy for data analysis that employs
a variety of techniques (mostly graphical) to (a) maximize insight into a data set (b) un-
cover underlying structure of the variables (c) extract important variables (d) detect outliers
and anomalies (e) test underlying assumptions (f) develop parsimonious models and determine
optimal factor settings. The E D A approach is precisely that – an approach – not a set of tech-
niques, but an attitude/ philosophy about how a data analysis should be carried out. E D A is
not identical to statistical graphics although the two terms are used almost interchangeably.
Statistical graphics is a collection of techniques – all graphically based and all focusing on one
data characterization aspect. E D A encompasses a larger venue; E D A is an approach to data
analysis that postpones the usual assumptions about what kind of model the data follow with
the more direct approach of allowing the data itself to reveal its underlying structure and model.
E D A is not a mere collection of techniques; E D A is a philosophy as to how we dissect a data
set; what we look for; how we look; and how we interpret. It is true that E D A heavily uses the
collection of techniques that we call “statistical graphics”, but it is not identical to statistical
graphics per se.
Most E D A techniques are graphical in nature with a few quantitative techniques. The reason
for the heavy reliance on graphics is that by its very nature the main role of E D A is to open-
mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data
to reveal its structural secrets, and being always ready to gain some new, often unsuspected,
insight into the data. In combination with the natural pattern-recognition capabilities that we
all possess, graphics provides, of course, unparalleled power to carry this out.
The particular graphical techniques employed in E D A are often quite simple, consisting of
various techniques of:

• Plotti ng the raw data (such as data traces, histograms, bihistograms, probability plots,
lag plots, block plots, and Youden plots.
• Plotti ng simple statistics such as mean plots, standard deviation plots, box plots, and
main effects plots of the raw data.
• Positioning such plots so as to maximize our natural pattern-recognition abilities, such as
using multiple plots per page.

In this section we shall explore some of the techniques of E D A using the mpg dataset. We see
from the structure on page 3 that the mpg dataset has a total of 234 rows and 11 columns. Out
of these there are six variables which are character variables (alternatively known as discrete
or categorical) and these are manufacturer, model, trans, drv, fl and class. d i s p l is a numeric
variable and y e a r, c y l , c t y, hwy are numeric integer variables.

9
1. Analysis of a Single Variable
How you visualize a variable would depend on whether the variable is categorical or continuous
in nature. A variable is said to be categorical is it takes a small number of discrete values. In
R , categorical variables are saved as “factors” or “character” variables and to examine them we
use the bar chart.

1. Single Categorical Variable


We already know which of the variables are categorical in nature, so lets plot a bar chart:

g gplot(data = mpg) + geom_bar(mapping = a es( x = c l a s s ) )

6
0
coun

4
0
t

2
0
2seate compac midsiz miniva picku subcompa su
r t e n p ct v
0 class
Figure 10: Vertical Bar Chart of Car Types

Okay fine. Now we want the graph to change orientation to horizontal.

g gplot(data = mpg) + geom_bar(mapping = aes(y = c l a s s ) )

su
v
subcompa
clas

ct
s

picku
p
miniva
n 0 2 4 6
midsize 0 count 0 0
compa
ct Figure 11: Horizontal Bar Chart of Car Types
2seater
Prett y good but the bar chart does not give us any major information. A question naturally
comes is whether we can count the number of such cars. This requires a little trickery and the
use of d p l y r ( ) package. If you have not loaded the d p l y r ( ) package please do so using the
l i b r a r y ( ) function. In case you loaded the ti d y v e r s e ( ) package then the package should be
loaded by default. In any case try the following command:

mpg %>% co u nt ( c l a s s )

## # A ti b b l e : 7 x 2

10
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62

The “ % > % ” is known as the piping operator and is used to redirect the output of one
command to another command. The co u nt ( ) function is available in d p l y r ( ) package and the
command effectively says that from the dataset mpg take the c l a s s variable and run it through
the co u nt ( ) function and return the results. The result are illuminating – S U V tends to
dominate the market, followed by midsized cars, followed compact cars, sub-compacts, pickups,
minivans and two-seaters. Let us now go one step futher and ask ggplot2 to add colour as per
the drive.
g gplot(data = mpg) + geom_bar(mapping = aes(y = c l a s s , fi l l = d r v ) )

su
v dr
subcompa v 4
clas

ct
f
s

picku
p r

miniva
n 0 2 4 6
midsize 0 count 0 0
compa
ct Figure 12: Horizontal Bar Chart Car Types filled by Drive Train
2seater
The results are even more interesting. 4-wheel drive are most common in S U V segment and
pickup segment. The rear wheel drive are mostly seen in two-seater and the S U V segment;
while other car types generally have front wheel drive.

4.1.2 Single Conti nuous Variable


In case of a continuous variable we have a larger set of analysis available. First and
foremost we could create a histogram to check the distribution of the continuous variable as
follows:
Now the above set of commands do a lot of functions. Let us examine them one by one.
geom histogram() tells g g p l o t ( ) to draw the histogram of the city mileage. The bi ns =
10 inform g g p l o t ( ) that divide the data into 10 equal class interval. Additionally, we have
instructed it to use the color grey to fill the histogram. In the next layer, geom f r e q p o l y ( ) is
used to add the frequency polygon to the city mileage. The binwidth = 3 is used to inform that
we want to have the width of the bins as 3 and c o l o r = "red" informs the g g p l o t ( ) to draw
the frequency polygon in red color. You should note here that if you leave the default values for
bin and binwidth, g g p l o t ( ) may complain and you will have to use trial and error method to
decide which number produces the best histogram and the frequency polygon. Lastly, to make
the diagram more visible, I have instructed g g p l o t ( ) to use the theme l i g h t ( ) to draw the
diagram (see figure 13 on page 12).
11
g gplot(data = mpg, mapping = aes( x = c t y ) ) + geom_histogram(bins = 1 0 , fi l l = " g re y " ) +
geom_freqpoly(binwidth = 3 , co l o r = "red") + theme_light()

6
0
coun

4
0
t

2
0
1 2 3 4
0 0 cty 0 0
0

Figure 13: Histogram of City Mileage

The diagram tells us that the city mileage is slightly skewed to the right. To check whether
it is really so, we run the summary command and get the following results:

summary(mpg$cty)

## Min. 1 st Qu. Median Mean 3rd Qu. Max.


## 9.00 14.00 17.00 16.86 19.00 35.00

The results confirm our belief from the histogram. The distance between 3rd quartile and
maximum value is 16.00 where as the distance between 1st quartile and minimum is 5.00. To
reconfirm the values let us run the same data and plot it into a boxplot.

g gplot(data = mpg, mapping = aes( x = c t y ) ) + geom_boxplot()

0.
4

0.
2

0.
0

−0.
2

−0.
4 1 1 2 2 3 3
0 5 0 cty 5 0 5

Figure 14: Boxplot of City Mileage

A boxplot is a type of visual shorthand for a distribution of values that is popular among
statisticians. Each boxplot consists of a box that stretches from the 25th percentile (1st quartile)
of the distribution to the 75th percentile (3rd quartile), a distance known as the interquartile

12
range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile,
of the distribution. These three lines give you a sense of the spread of the distribution and
whether or not the distribution is symmetric about the median or skewed to one side. Visual
points display observations that fall more than 1.5 times the I Q R from either edge of the box.
These outlying points are unusual so are plotted individually and are known as outliers. A line
(or whisker) that extends from each end of the box and goes to the farthest non-outlier point
in the distribution.
Another important plot in case of continuous variable is the QQ (Quantile-Quantile) plot –
which is used to test whether the continuous variable is following a normal distribution or not.
This is achieved by the following commands:

g gplot(data = mpg, mapping = aes(sample = c t y ) ) +


stat_qq( co lo r = "red") + stat_qq_line(color="blue")

3
0
y

2
0

1
0

−3 −2 −1 0 1 2 3
x

Figure 15: QQ Plot of City Mileage

The quantile quantile plot is an interesting plot but before explaining the QQ plot please
note the difference in the a e s ( ) function of the plot. QQ plot takes in (sample = v a r i a b l e ) as
opposed to the x = va riable . This difference is because on the x-axis the normal distribution
is plotted and on the y-axis the data is plotted. The y-axis data is considered to be a sample and
the quantiles are computed from the data using the s t a t qq( ) function. For ease of visibility, I
have requested g g p l o t ( ) to plot the sample quantile in “red” color and the normal distribution
line ( s t a t qq l i n e ( ) ) in “blue” color. For interpretation purpose, if the points are far away
from the normal distribution line then the sample is non-normal and vice-versa. This is an
eyeballing technique and more precise tests are available in R.

4.2 Analysing Two Variables


When it comes to analysing two variables the issues are a bit more complicated as there
are be the following combination of variables – (a) categorical versus categorical (b) categorical
versus continuous or (c) continuous versus continuous. To analyse these combinations require
a bit more ingenuity than analyzing the single variable; but we shall restrict our analysis to
simple plots which are easily doable.

13
4.2.1 Scatt er Plots
One of the most used plot for data analysis is the scatter plot. Scatter plot help us understand
the nature of relationship between two continuous variables. In order to make a scatter plot we
use the same set of commands that were used to produce figure 1–3 on page 4 – 5 respectively.
Lets make a new scatter plot between city and highway mileage:

g gplot(data = mpg, mapping = aes( x = c t y, y = hwy)) + geom_point()

4
0

3
hw

0
y

2
0

1 1 2 2 3 3
0 5 0 cty 5 0 5

Figure 16: Scatter Plot of City Mileage versus Highway Mileage

Figure 16 tells us that city and highway mileage have a very strong positive relationship.
The original data has 234 data points but the chart seems to display fewer points. This is
because both c t y and hwy are integers and there are many overlapping points appearing as a
single dot. This can be handled by using geom j i tt e r ( ) . The jitter function randomly jitters
around original points based on a threshold controlled by the width and s i z e argument. Lets
also add a straight line smoother to the plot while we are at it:

g gplot(data = mpg, mapping = aes( x = c t y, y = hwy)) + geom_point() +


geom_smooth(method = "lm") + geom_jitt er(width = 0 . 5 , s i z e = 1)

## ‘geom smooth()‘ using formula ’ y ~ x ’

5
0

4
0
hw
y

3
0

2
0 1 2 3
0 0 cty 0

1
0
Figure 17: Scatter Plot with Jitt er + Smoother Line of City & Highway Mileage

14
4.2.2 Multi ple B o x Plot
For two variables – one continuous and the other categorical – it sometimes makes more sense
to use multiple bar charts. We have already seen how to create boxplot (refer figure 14 on
page 12); we shall use the same to draw multiple boxplots but shall segregate the same on the
basis of class:
g gplot(data = mpg, mapping = aes( x = c t y, y= c l a s s ) ) + geom_boxplo t(fi ll = "plum")

suv

subcompa
ct

picku
clas

p
s

miniva

midsize
1 1 2 2 3 3
compa 0 5 0 cty 5 0 5

ct
Figure 18: Boxplot of City Mileage by Class
2seater
Another way of looking at the data would be a violin plot in which the violin are effectively
the histogram for the particular variable. It is created by adding geom v i o l i n ( ) . This exercise
if left to the readers of the document. The analysis of boxplot shows up some interesting
outputs. For example, the maximum variation in city mileage occurs in case of subcompacts
and the minimum most variation occurs in terms of two seaters. The compact category, for
example, provides about 15 miles per gallon to 23 miles per gallon; but has three outliers. SUV
and pickups provide the lowest mileage, but there are two S U V which are outperforming other
car categories in terms of city mileage. Another set of insight that is provided is about the
skewness of the data. Subcompact mileage is highly skewed towards the right; which in effect
is a good thing as the subcompact industry may be investing more in improving the mileage
of the subcompacts. On the other hand, two seaters have one of the least city mileage in the
category and no spread is seen in the category. This may lead one to infer that the two seater
category does not sell on the basis of mileage but on other factors. Similar analysis can be
done for highway mileage and some useful insights can be drawn into the data.

4.2.3 Correlogram and Heat-map


Another very important plot is the correlogram plot which lets us examine the
correlation of multiple continuous variables present in the same dataset. This is conveniently
implemented using the g g c o r r p l o t ( ) (Kassambara, 2019) package. As the dataset being
used ti ll now is mostly categorical I would be using the diamonds dataset. The dataset is
available with the g g p l o t ( ) library. First we shall load the library and the dataset:

i n sta l l . p a c k a g e s ( " g g c o r r p l o t " ) ## i n s t a l l the l i b r a r y ggcorrplot


l i b r a r y ( ti d y v e r s e ) ## load a l l packages and datasets

15
l i b r a r y ( g g c o r r p l o t ) ## load the l i b r a r y ggcorrplot
str(diamonds) ## Check the st r u c t u r e of the diamonds dataset

## ti b b l e [53,940 x 10] ( S 3 : tbl_df/tbl/data.frame)


## $ ca rat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 . . .
## $ cut : O rd.fa c to r w/ 5 l e v e l s "Fa ir" <" Go o d" <. . : 5 4 2 4 2 3 3 3 1 3 . . .
## $ co lor : O rd.fa c to r w/ 7 l e v e l s "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 . . .
## $ c l a r i t y : O rd.fa c to r w/ 8 l e v e l s " I 1 " < " S I 2 " < " S I 1 " < . . : 2 3 5 4 2 6 7 3 4 5 . . .
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 . . .
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 . . .
## $ pric e : i n t [1:53940] 326 326 327 334 335 336 336 337 337 338 . . .
## $ x : num [1:53940] 3.95 3.89 4.05 4 . 2 4.34 3.94 3.95 4.07 3.87 4 . . .
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 . . .
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 . . .

Now from the diamonds package we see that the carat, depth, table, pric e, x, y and z are
the numeric variables. However, we may wish to label some of the variables anew as because
“x” is length, “y” is width and “z” is depth and the original variable “depth” is actually called
depth percentage. To check these lets run the following:

?diamonds

We also need to pull the numeric variables and discard the non numeric variables for corre-
lation to be computed. We do this as follows:
att ach(diamonds) ## Att ach the dataset i n t o the workspace
DIA < - c b i n d ( p r i c e , c a r a t , depth, t a b l e , x , y, z )
head(DIA)

## pric e ca rat depth table x y z


## [1,] 326 0.23 61.5 55 3.95 3.98 2.43
## [2,] 326 0.21 59.8 61 3.89 3.84 2.31
## [3,] 327 0.23 56.9 65 4.05 4.07 2.31
## [4,] 334 0.29 62.4 58 4.20 4.23 2.63
## [5,] 335 0.31 63.3 58 4.34 4.35 2.75
## [6,] 336 0.24 62.8 57 3.94 3.96 2.48

As can be seen the dataset D I A has the variables x , y and z. We shall now replace the
entire column names as follows
colnames(DIA) < - c ( " P r i c e " , " C a r a t " , "DepthPct", " Ta b l e " , "Length", "Width", "Depth")
head(DIA)

## P r i c e Carat DepthPct Table Length Width Depth


## [1,] 326 0.23 61.5 55 3.95 3.98 2.43
## [2,] 326 0.21 59.8 61 3.89 3.84 2.31
## [3,] 327 0.23 56.9 65 4.05 4.07 2.31
## [4,] 334 0.29 62.4 58 4.20 4.23 2.63
## [5,] 335 0.31 63.3 58 4.34 4.35 2.75
## [6,] 336 0.24 62.8 57 3.94 3.96 2.48

Okay now that we have got the variables in one place, it is time to compute the correlation
matrix. The function co r ( D IA) will compute the correlation matrix on all the variables in the
data frame D I A . Let us compute the correlation and store it into a matrix named as C O R M AT.
While using the c o r ( ) the correlation matrix would be computed to about 7 decimal places;
which makes reading the thing pretty exhausting. So we also apply the round() function and
round up the results to 3 decimal places

16
CORMAT < - round(cor(DIA),3)
CORMAT

## P r i c e Carat DepthPct Table Length Width Depth


## P r i c e 1.000 0.922 -0.011 0.127 0.884 0.865
0.861
## Carat 0.922 1.000 0.028 0.182 0.975 0.952
0.953
## DepthPct -0.011 0.028 1.000 -0.296 -0.025 -0.029
0.095
## Table 0.127 0.182 -0.296 1.000 0.195 0.184 0.151
## Length 0.884 0.975 -0.025 0.195 1.000 0.975 0.971
## Width 0.865 0.952 -0.029 0.184 0.975 1.000 0.952
## Depth 0.861 0.953 0.095 0.151 0.971 0.952 1.000
Now it is a simple matter to produce the heatmap as follows:

ggcorrplot(CORMAT)

Dept
h

Widt Corr
h
Lengt 1.
0
h 0.
Table 5
0.
0
DepthPc −0.
5
t −1.
Cara 0
t

Price
t
e

e
Pc
ra

gt

t
t
bl
ic

h ep
h id
th
Ca

h n
Pr

Ta

D
Le
t ep
D

Figure 19: Heatmap of the Computed Correlation Matrix

A heatmap is a data visualization technique that shows magnitude of a phenomenon as


color in two dimensions. The variation in color may be by hue or intensity, giving obvious
visual cues to the reader about how the phenomenon is clustered or varies over space. There
are two fundamentally different categories of heat maps: the cluster heatmap and the spati al
heatmap. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose
rows and columns are discrete phenomena and categories, and the sorting of rows and columns
is intentional and somewhat arbitrary, with the goal of suggesting clusters or portraying them
as discovered via statistical analysis. The size of the cell is arbitrary but large enough to be
clearly visible. By contrast, the position of a magnitude in a spati al heatmap is forced by
the location of the magnitude in that space, and there is no notion of cells; the phenomenon is
considered to vary continuously.
The heatmap drawn in figure 19 is a cluster heatmap and uses three colors to reflect the
correlation – blue for a negative correlation, white for zero correlation and red for positive
correlation. However the problem with such a heatmap is that it is pretty difficult to figure
out what each cell represents. In our case we have seven (7) variables and as such it is easy to
see; but suppose we had 30 variables; in such a case it becomes difficult to understand what

17
is going in. Fortunately, g g c o r r p l o t ( ) provides a lot of customisations which can be used to
make things easier. Let us run the following command and see the output:

ggcorrplot(CORMAT, method = " c i r c l e " , type = " lower" , ggtheme = ggplot2::theme_classic,


l a b = TRUE, l a b _ s i ze = 3 , co l o rs = c ( " g re e n " , " w hite" , " p i n k " ) )

## Warning: ‘g u i d e s( <s ca l e> = FAL SE)‘ i s deprecated. Please use ‘g u i d e s( <s ca l e > = "none")‘
i n ste a d .

Widt 0.9
h 5

Lengt 0.9 0.9


h 8 7 Corr
1.
0
Table 0. 0.1 0.1
0.
2 8 5
5
0.
DepthPc −0. −0.0 −0.03 0
t 3 3 0.1 −0.
5
−1.
Cara 0.03 0.9 0.9 0.9
0
t 0.18 8 5 5

Price 0.92 −0.01 0.8 0.8 0.8


0.13 8 6 6
t

e
Pc
ra

gt

t
t
bl

h ep
h id
th
Ca

h en
Ta

D
t ep

L
D

Figure 20: Lower Triange Heatmap with modifications

As usual g g c o r r p l o t ( ) allows us to apply a plethora of customization to the heatmap. For


example method = can be either square or c i r c l e . The default is square. Similarly the type
= can be either f u l l (default) or lower or upper. ggtheme = can be used to apply any of
the theme provided by the g g p l o t 2 ( ) package. (For a list of themes please see table 1 on page
9). Default theme used is theme minimal. Labels are added using the command l a b = which
can take either TRUE or FALSE. If TRUE is selected the heatmap shows the correlation coefficients
on the graph. Related to it is the l a b s i z e = which determines the size of the labels when
l a b is TRUE. The co l o rs = c ( ) is a vector of three colors which are used to shade from low to
mid to high correlation values. I would urge the readers to experiment with the same and
decide on the combination which works best for them.

5D ea lin g with Missing D ata


In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data. Missing data can occur because of
nonresponse: no information is provided for one or more items or for a whole unit (“subject”).
Some items are more likely to generate a nonresponse than others: for example items about
private subjects such as income. Data often are missing in research in economics, sociology, and
political science because governments or private entities choose not to, or fail to, report
critical statistics, or because the information is not available. Sometimes missing values are
caused by the researcher

18
- for example, when data collection is done improperly or mistakes are made in data entry.
Missing values can be treated using following methods :

1. Deleti on:
The deletion method is used when the probability of missing variable is same for all observations.
Here each observation has equal chance of missing value. Deletion can be performed in two types:
List Wise Deletion and Pair Wise Deletion.
In listwise deleti on, we delete observations where any of the variable is missing. Simplicity
is one of the major advantage of this method, but this method reduces the power of model
because it reduces the sample size. For simplicity we can say that, this method deletes the
whole row of observations in which the data is missing. In pairwise deleti on, we perform
analysis with all cases in which the variables of interest are present. Advantage of this method
is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses
different sample size for different variables.

2. Imputati on Methods:
Imputation is a method to fill in the missing values with estimated ones. The objective is
to employ known relationships that can be identified in the valid values of the data set to
assist in estimating the missing values. Mean / Mode / Median imputation is one of the most
frequently used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of
that variable. It can be of two types. In Generalized Imputati on, we calculate the mean
or median for all non missing values of that variable then replace missing value with mean or
median. In similar case Imputati on In this case, we calculate averages individually of non
missing values then replace the missing value based on gender.

3. Predicti on Model:
Prediction model is one of the sophisticated method for handling missing data. Here, we create
a predictive model to estimate values that will substitute the missing data. In this case, we
divide our data set into two sets: One set with no missing values for the variable and another
one with missing values. First data set become training data set of the model while second data
set with missing values is test data set and variable with missing values is treated as target
variable. Next, we create a model to predict target variable based on other attributes of the
training data set and populate missing values of test data set.We can use regression, ANOVA,
Logistic regression and various modeling technique to perform this. There are 2 drawbacks for
this approach. Firstly the model estimated values are usually more well-behaved than the true
values If there are no relationships with attributes in the data set and the attribute with missing
values, then the model will not be precise for estimating missing values.

4. K N N Imputati on:
In this method of imputation, the missing values of an attribute are imputed using the given
number of attributes that are most similar to the attribute whose values are missing. The
similarity of two attributes is determined using a distance function. The advantag of K N N is
that it can predict both qualitative & quantitative attributes. C of predictive model for each
attribute with missing data is not required. Attributes with multiple missing values can be
easily treated. Correlation structure of the data is taken into consideration However also have
some disadvantages. K N N algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances. Choice of K- value is

19
very critical. Higher value of K would include attributes which are significantly different from
what we need whereas lower value of K implies missing out of significant attributes.

5.5 Using R to deal with Missing Data


R has various packages to deal with the missing data. We shall discuss the Hmisc() package
for dealing with missing data. I am not saying that the Hmisc() is the best package but it is
the simplest to used.
Hmisc() (Harrell Jr, with contributions from Charles Dupont, & many others., 2021) is
a multiple purpose package useful for data analysis, high – level graphics, imputing missing
values, advanced table making, model fitting diagnostics (linear regression, logistic regression
cox regression) etc. Amidst, the wide range of functions contained in this package, it offers 2
powerful functions for imputing missing values. These are impute() and aregImpute().
impute() function simply imputes missing value using user defined statistical method (mean,
max, mean). Its default is median. On the other hand, aregImpute() allows mean imputation
using additive regression, bootstrapping, and predictive mean matching. In bootstrapping,
different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive
model (non parametric regression method) is fitted on samples taken with replacements from
original data and missing values (acts as dependent variable) are predicted using non-missing
values (independent variable). Then, it uses predictive mean matching (default) to impute
missing values. Predictive mean matching works well for continuous and categorical (binary
multi-level) without the need for computing residuals and maximum likelihood fit. The package
assumes that there is linearity in the variables being predicted and Fishers optimum scoring
method is used for predicting categorical variables.
Instead of creating a random dataset, we shall be using the msleep() dataset made available
with the ti d y v e r s e ( ) package. Lets install the package and load it

i n sta l l . p a c ka g e s ( " H m i s c " , dependencies = TRUE)


librar y ( H m isc )

## Loading required package: l a tti c e


## Loading required package: survival
## Loading required package: Formula
##
## Att aching package: ’Hmisc’
## The follow in g ob j e c t s are masked from ’p a c ka g e : d p l y r ’ :
##
## s r c , summarize
## The follow in g ob j e c t s are masked from ’package:base’:
##
## for ma t.pva l, u n i t s

Lets now apply the summary() function on the dataset


summary(msleep)

## name genus vore order


## Length:83 Length:83 Length:83 Length:83
## C l a s s :character C l a s s :character C l a s s :character
C l a s s :character
## Mode :character Mode :character Mode :character
Mode :character
##
##
##
##
## conservati on sleep_to tal sleep_rem sleep_cycle
## Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167

20
## C l a s s :character 1 st Qu.: 7.85 1 st Qu.:0.900 1 st Qu.:0.1833
## Mode :character Median :10.10 Median :1.500 Median :0.3333
## Mean :10.43 Mean :1.875 Mean :0.4396
## 3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
## Max. :19.90 Max. :6.600 Max. :1.5000
## NA's :22 NA's :51
## brainwt bodywt
awake
## Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1 st Qu.:10.25 1 st Qu.:0.00290 1 st Qu.: 0.174
## Median :13.90 Median :0.01240 Median : 1.670
## Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :27

From the output we see that sleep rem, sleep c yc l e and brainwt have 22, 51 and 27
NA’s respectively. Lets try and impute the missing values in sleep rem using arithmetic mean:
msleep$sleep_rem < - with(msleep, impute(sleep_rem, mean))
summary(msleep$sleep_rem)

##
## 22 values imputed to 1.87541
## Min. 1 st Qu. Median Mean 3rd Qu. Max.
## 0.100 1.150 1.875 1.875 2.200 6.600

The above command requires a bit of explanation. The L H S tells R that from dataset
msleep take the variable sleep rem and assign it to the output generated on the R HS. On
the R H S it tells the Hmisc() to use the dataset msleep and fill in all the missing values in the
variable sleep rem with mean. The summary(msleep$sleep rem) gives us the summary output
for the variable and as can be seen 22 values imputed with the arithmetic mean. Similarly, if
we specify random then the package replaces all the values with random values. The default
replacement value is median. We can also use max to replace everything with maximum values
or with min to replace everything with minimum values.
Let us now use a slightly upgraded imputation method. The function aregImpute() func-
tion, by default, uses predictive mean matching (PMM) method to impute the missing data.
One thing to note here is that the original dataset needs to be assigned to a new dataset for
analysis purpose. Lets first try the function:

msleep.imputed < - aregImpute(~ s l ee p _ to ta l + sleep_rem + sleep_cycle + awake + brainwt +


bodywt, data = msleep, n.impute = 10)

## I t e r a ti o n 1
I t e r a ti o n 2
I t e r a ti o n 3
I t e r a ti o n 4
I t e r a ti o n 5
I t e r a ti o n 6
I t e r a ti o n 7
I t e r a ti o n 8
I t e r a ti o n 9
I t e r a ti o n 10
I t e r a ti o n 11
I t e r a ti o n 12
I t e r a ti o n 13

msleep.imputed

21
##
## Multi ple Imputati on using Bootstrap and PMM
##
## aregImpute(formula = ~sleep_to tal + sleep_rem + sleep_cycle +
## awake + brainwt + bodywt, data = msleep, n.impute = 10)
##
## n : 83 p : 6 Imputati ons: 10 nk:
3 ##
## Number o f NAs:

## sleep_to tal sleep_rem sleep_cycle awake brainwt bodywt


## 0 0 51 0 27 0
##
## type d . f.
## sleep_to tal s
2
## sleep_rem s
2
## sleep_cycle s
2
## awake s
2
## brainwt s
1
## bodywt s
2
##
## Transformati on o f Target Varia bles Forced to be Linear
##
## R-squares fo r P re d i c ti n g Non-Missing Values fo r Each Variable
## Using L a st Imputati ons o f Predictors
In the above function, we have created a new variable named as msleep.imputed.
## sleep_cycle Then
we have## 0.843
brainwt instructed the aregImpute() function to conduct a predictive mean matching using
0.993
sleep t o t a l + awake + bodywt, and fill in the missing values in the msleep dataset. the
n.impute = 5 function tells the function to perform 10 bootstraps on the fitted model. The
second line msleep.imputed gives the output of the imputation. One major advantage of using
argImpute() is that it automatically identifies the variable types and treats them accordingly.
The R 2 values given provide an indication as to how good the imputation was; obviously the
higher the values of R 2 the better the imputation.
Unfortunately, the imputed variables are not so easily available nor are they easily combine
able with the old dataset. The data resides in a separately created matrix called imputed. So
if we want to extract the imputed dataset for analysis we would have to do as follows:
new_msleep < - impute.transcan(msleep.imputed, imputati on = 1 , data = msleep, l i s t . o u t = TRUE)

##
##
##
Im
pu
te
d
m
is
si
ng

va
lu
es
wi
th
th
e
fo
ll
o
w
in
g
fr
e 22
q
u
e
nc
ie

You might also like