Module 2 ExploratoryDataAnalysis
Module 2 ExploratoryDataAnalysis
Contents
1 Introducti on to E D A 2
2 Prepari ng R S t u d i o for E D A 2
3 U s i n g ggplot2 3
1. Begining Charti ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2. Using Facets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3. Using Multiple ggplot2 layers 4 D oi n g E xp l orator y D a t a Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1. Analysis
1. 6ofSingle
a Single Variable Variable . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . .
Categorical
4. 2.
Changing Single Continuous
ggplot2 themes . . . . .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. . 10
Variable
4.2 Analysing8 Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1. Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Multiple Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Correlogram and Heat-map . . . . . . . . . . . . . . . . . . . . . . . . . . 14
15
5 Dealing with Missing Data 18
15
1. Deletion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2. Imputation Methods: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3. Prediction Model: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
6 4.Sum m K NarNy Imputation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
19
7 5.References
Using R to deal with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23
20
1
1 Introducti on to E D A
Exploratory Data Analysis ( E D A ) (Tukey, 1997) can be simply defined as the critical pro-
cess of performing initial investigations on data so as to discover patterns,to spot anomalies,to
test hypothesis and to check assumptions with the help of summary statistics and graphical
representations.
In statistics, exploratory data analysis is an approach of analyzing data sets to summarize
their main characteristics, often using statistical graphics and other data visualization methods.
A statistical model can be used or not, but primarily E D A is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was
promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate
hypotheses that could lead to new data collection and experiments. E D A is different from
initial data analysis (IDA), which focuses more narrowly on checking assumptions required for
model fitting and hypothesis testing, and handling missing values and making transformations
of variables as needed. E D A is also different from Statistical Graphing in the sense that the
latter is mainly done for presentation and publication purposes; even though E D A is done using
many techqniques of statistical graphing.
2 Pre p a r in g R S t u d i o for E D A
In order to do E D A in R ( R Core Team, 2021) and RStudio (RStudio Team, 2020) we require
the installation of the ti dy ve rse package which can be installed on as follows:
The rationale for choosing the package ti dy ve rs e (Wickham et al., 2019) is that the package
contains the required package ggplot2 which is necessary for doing exploratory data analysis.
We shall be using dataset mpg — which is available in the ggplot2 package — for understanding
the exploratory data analysis. Alternatively, if so desired, users can directly load the package
ggplot2 using the above i n s t a l l . p a c k a g e ( ) command by replacing ti dy ve rs e by ggplot2.
The next step would be to load the library ti dy ve rs e and make all its functionality available
to R engine. This is done by the l i b r a r y ( ) command as follows:
l i b r a r y ( ti d y v e rs e )
As you can see, the moment we attach the library ti dy ve rs e it goes on to load the 8
distinct package and makes it available to the R engine. We are not actually going to use the
remaining packages but these packages makes it possible to read in almost any kind of data,
clean it and arrange it for analysis with R . Our key focus would be on the ggplot2 (Wickham,
2016) package which would be used to create the graphs and plots necessary for understanding
the Exploratory Data Analysis. Also if you get, Conflicts — output do not worry about them.
When we load the ti dy ve rse environment, it also loads other packages and some function calls
2
conflict with other packages. No need to worry, as R masks these functions so that there is no
conflict between the functions.
3 U s i n g ggplot2
ggplot2 graphs are based on the phlosophy of data visualization. In most cases we start with
the function g g p l o t ( ) , supply a dataset and aesthetic mapping with function aes(). We then
add on layers like geom p o i n t ( ) or geom histogram(), scales like s c a l e colour brewer(),
faceting specifications like fa c e t wrap() and coordinate systems like coord fl i p ( ) . All the
layers are additive in nature and are added on with the + sign. The only mandatory requirement
is that it should begin with g g p l o t ( ) and a data-set. In order to understand the functioning
of the g g p l o t ( ) we shall use the data-set mpg. The data-set mpg gets attached to R and can be
seen by typing the following command:
mpg
## # A ti b b l e : 234 x 11
## manufacturer model displ year c y l t ra n s drv cty hwy fl
c l a s s ## <chr> <chr> <dbl> <int > <int > <chr> <chr> <int >
<int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quatt ro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quatt ro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quatt ro 2 2008 4 manu~ 4 20 28 p comp~
## # . . . with 224 more rows
To better understand the structure of the dataset we use the s t r ( ) function as follows:
str(mpg)
The first row us that it is a ti b b l e with 234 rows (or cases) and 11 columns (or variables).
A ti b b l e , is a modern form of data.frame. Tibbles are data.frames that are lazy and surly:
they do less (i.e. they dont change variable names or types, and dont do partial matching)
and complain more (e.g. when a variable does not exist). This forces you to confront problems
earlier, typically leading to cleaner, more expressive code. The next lines display the names
of the variables, their type and some typical values. In our case we have char (stands for
“character”), num (stands for “number”) and i n t (stands for “integer”). There are other data
types which we will discover subsequently. To read up on the structure of the dataset type the
code below. This code starts the help server and displays information about the data set. Make
a note of the proper names of the variables
3
3.1 Begining Charti ng
Once we are familiar with the variable names and the structure of the data, let us now start
learning how to plot using ggplot2. Type the following line and R will come up with the
following output:
4
0
3
hw
0
y
2
0
2 3 4 5 6 7
displ
As we already know g g p l o t ( ) charts are made by adding layers. g gplo t(data = mpg) is the
base layer of ggplot2 where we specify the data to be used for plotting purpose. geom p o i n t ( )
tells ggplot2 to add a layer of points to the chart to create a scatterplot. The set of arguments
inside the geom points (mapping = ae s(x = d i s p l , y = hwy)) tells which variables to plot
on the x axis and y axis. As we can see, on the x-axis we have asked g g p l o t ( ) to plot the
“displ” (engine displacement) and on the y-axis we have asked it to plot the “hwy” (highway
mileage). And that is exactly what g g p l o t ( ) has plotted for us. The mapping argument is
always paired with a e s ( ) — which is a shorthand for aesthetics. We can see from the figure
that as the displacement increases the mileage tends to decrease. The problem with the above
chart is that black points make it rather difficult to see. Would it be possible to add color to
the chart. Well we can do that by adding the color outside of the a e s ( ) function as follows:
4
0
hw
3
y
2
0 2 3 4 5 6 7
displ
Not quite colorful but for the time this will do. Now looking at the chart we see that
some
4
cars (situated on the right hand side) have huge displacements but also good highway mileage.
Are these cars of some different class? Let’s try to answer this question by coloring the chart
with the c l a s s attribute. We will allow g g p l o t ( ) to handle the coloring and therefore
this command would be diff erent from the previous command in the sense that
the c o l o r by c l a s s request would be passed on inside the a e s ( ) functi on. This is
essentially to tell ggplot2 that we want the chart colored by a nominal variable and that it
should decide how it will handle the coloring. Here is the code:
2seater
4
0 compact
3 midsize
hw
0 minivan
y
2 pickup
0
subcompa
2 3 4 5 6 7 ct suv
displ
If you zoom in to the chart (figure 3) you will find that the dots which were causing us
problems belong to the 2 seater category. These are usually sports cars which have large
engine displacements but due to the lower passenger loads give better mileage on the highway.
ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
facet_wrap(~ c l a s s , nrow = 2)
clas
2seate compac midsiz miniva s
4 r t e n 2seater
0 compact
3
0 midsize
2 3 4 5 6
hw
2 picku subcompa su
7 minivan
y
0 p ct v
4
0 pickup
3
0 subcompa
2 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6
ct suv
0 7 disp
l
Figure 4: A Simple Facet Example
5
The sign ~ is read as ‘ b y ’ ; thus the fa c e t wrap(~ c l a s s , nrow = 2 ) tells g g p l o t ( ) that
we want the graph but this time we want to split up the graph into two rows (nrow = 2) on
the basis of c l a s s . Now suppose we want to create slices by two different facets - the number
of drive train (drv) and the number of cylinder ( c y l ) then we proceed as follows:
ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
fa c e t _ g r i d ( drv ~ c y l )
4 5 6 8
clas
4
0 s 2seater
4
3
0 compact
2
4 midsize
0
hw
f
minivan
y
3
0
pickup
2
4
0 subcompa
r
3
0 ct suv
2 2 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 7 2 3 4 5 6
0 7 displ7
This chart covers a lot of grounds. On th extreme right it gives us the drive train types
( f = front wheel drive, r = rear wheel drive, 4wd = Four Wheel Drive). On the extreme top
it gives us the number of cylinders (4,5,6 and 8). Each individual subplot shows the plot of
d i s p l (displacement) on the x-axis and of hwy (highway mileage) on the y-axis and the points
are colored by class. In case we want to facet by only one variable we can use a dot ( ‘. ’ ) to
represent the second variable. For example
ggplot(data=mpg) +
geom_point(mapping = aes( x = d i s p l , y = hwy, co l o r = c l a s s ) ) +
fa c e t _ g r i d ( drv ~ . )
Try the above chunk of code. It should provide a similar graph as the previous one but the
number of cylinders on the top would be missing as we have only requested that the faceting be
done using drv. Try replacing fa c e t g r i d ( d r v ~ . ) by the following fa c e t g r i d ( . ~ cyl)
and check the results.
6
will produce a chart which looks identical to the figure 1 on page 4. Good enough, but can
we add another layer to the plot. The answer is yes and we will add a smoother layer to the
chart as follows:
g gplot(data = mpg, mapping = aes( x = d i s p l , y = hwy)) +
geom_point() + ## T h i s i s Layer 1 with the points
geom_smooth() ## T h i s i s l a y e r 2 adding the smoother
4
0
hw
3
y
2
0 2 3 4 5 6 7
displ
The geom smooth() function adds a l o e s s smoother for a small dataset and for large dataset
it adds a gam function. The function, by default, adds the standard error as it makes it easier to
identify the trend. Replace geom smooth() to geom smooth(se = FALSE) and see the output.
Let us plot another graph and see the results:
4
0 dr
v 4
3
hw
0 f
y
2 r
0
2 3 4 5 6 7
displ
In figure 7 we have plotted the points and have added a smoother; but we have also instructed
the ggplot2 to add the layers by the drive type (drv) leading to ggplot2 adding 3 smoother
based on the drive type.
7
3.4 Changing ggplot2 themes
ggplot2 allows the expert user to do a lot of customization to the charts drawn. We shall not
go into how we can control the individual elements; but rather on changing the appearances
of the output. The theme that is used by default in ggplot2 is the theme g ray ( ) ; but here
we shall see the use of two built in themes – theme c l a s s i c ( ) and theme v o i d ( ) . As R is an
object oriented language, we can completely assign a series of command to a single object.
Here we have assigned the entire command to a new object GPLOT. Now, we will add the
theme layer and see how it works
GPLOT + theme_void()
class
2seater
compact
midsize
minivan
pickup
subcompa
ct suv
GPLOT + theme_classic()
2seater
4
0 compact
3 midsize
hw
0 minivan
y
2 pickup
0
subcompa
2 3 4 5 6 7 ct
su
displ v
The advantage of putting a lot of commands in another object is that it saves us a lot of
typing. If the procedure confuses you then you should stick to the copy + paste + modify
method of working. Secondly, the ggplot2 package gives you the following themes. Please feel
free to try these themes at your leisure:
8
T heme Descripti on
theme gray() The default theme with grey background and white grid lines
theme bw() The classic dark-on-light ggplot2 theme
theme linedraw() A theme with only black lines of various widths on white backgrounds
theme light() A theme similar to theme linedraw() but with light grey lines and axes
theme dark() The dark cousin of theme light() but with a dark background
theme minimal() A minimalistic theme with no background annotations
theme classic() A classic looking theme, with x and y axis lines and no grid lines
theme void() A completely empty theme
4 D o i n g E x p l o rato r y D a t a Analysis
Exploratory Data Analysis ( E D A ) is an approach/philosophy for data analysis that employs
a variety of techniques (mostly graphical) to (a) maximize insight into a data set (b) un-
cover underlying structure of the variables (c) extract important variables (d) detect outliers
and anomalies (e) test underlying assumptions (f) develop parsimonious models and determine
optimal factor settings. The E D A approach is precisely that – an approach – not a set of tech-
niques, but an attitude/ philosophy about how a data analysis should be carried out. E D A is
not identical to statistical graphics although the two terms are used almost interchangeably.
Statistical graphics is a collection of techniques – all graphically based and all focusing on one
data characterization aspect. E D A encompasses a larger venue; E D A is an approach to data
analysis that postpones the usual assumptions about what kind of model the data follow with
the more direct approach of allowing the data itself to reveal its underlying structure and model.
E D A is not a mere collection of techniques; E D A is a philosophy as to how we dissect a data
set; what we look for; how we look; and how we interpret. It is true that E D A heavily uses the
collection of techniques that we call “statistical graphics”, but it is not identical to statistical
graphics per se.
Most E D A techniques are graphical in nature with a few quantitative techniques. The reason
for the heavy reliance on graphics is that by its very nature the main role of E D A is to open-
mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data
to reveal its structural secrets, and being always ready to gain some new, often unsuspected,
insight into the data. In combination with the natural pattern-recognition capabilities that we
all possess, graphics provides, of course, unparalleled power to carry this out.
The particular graphical techniques employed in E D A are often quite simple, consisting of
various techniques of:
• Plotti ng the raw data (such as data traces, histograms, bihistograms, probability plots,
lag plots, block plots, and Youden plots.
• Plotti ng simple statistics such as mean plots, standard deviation plots, box plots, and
main effects plots of the raw data.
• Positioning such plots so as to maximize our natural pattern-recognition abilities, such as
using multiple plots per page.
In this section we shall explore some of the techniques of E D A using the mpg dataset. We see
from the structure on page 3 that the mpg dataset has a total of 234 rows and 11 columns. Out
of these there are six variables which are character variables (alternatively known as discrete
or categorical) and these are manufacturer, model, trans, drv, fl and class. d i s p l is a numeric
variable and y e a r, c y l , c t y, hwy are numeric integer variables.
9
1. Analysis of a Single Variable
How you visualize a variable would depend on whether the variable is categorical or continuous
in nature. A variable is said to be categorical is it takes a small number of discrete values. In
R , categorical variables are saved as “factors” or “character” variables and to examine them we
use the bar chart.
6
0
coun
4
0
t
2
0
2seate compac midsiz miniva picku subcompa su
r t e n p ct v
0 class
Figure 10: Vertical Bar Chart of Car Types
su
v
subcompa
clas
ct
s
picku
p
miniva
n 0 2 4 6
midsize 0 count 0 0
compa
ct Figure 11: Horizontal Bar Chart of Car Types
2seater
Prett y good but the bar chart does not give us any major information. A question naturally
comes is whether we can count the number of such cars. This requires a little trickery and the
use of d p l y r ( ) package. If you have not loaded the d p l y r ( ) package please do so using the
l i b r a r y ( ) function. In case you loaded the ti d y v e r s e ( ) package then the package should be
loaded by default. In any case try the following command:
mpg %>% co u nt ( c l a s s )
## # A ti b b l e : 7 x 2
10
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
The “ % > % ” is known as the piping operator and is used to redirect the output of one
command to another command. The co u nt ( ) function is available in d p l y r ( ) package and the
command effectively says that from the dataset mpg take the c l a s s variable and run it through
the co u nt ( ) function and return the results. The result are illuminating – S U V tends to
dominate the market, followed by midsized cars, followed compact cars, sub-compacts, pickups,
minivans and two-seaters. Let us now go one step futher and ask ggplot2 to add colour as per
the drive.
g gplot(data = mpg) + geom_bar(mapping = aes(y = c l a s s , fi l l = d r v ) )
su
v dr
subcompa v 4
clas
ct
f
s
picku
p r
miniva
n 0 2 4 6
midsize 0 count 0 0
compa
ct Figure 12: Horizontal Bar Chart Car Types filled by Drive Train
2seater
The results are even more interesting. 4-wheel drive are most common in S U V segment and
pickup segment. The rear wheel drive are mostly seen in two-seater and the S U V segment;
while other car types generally have front wheel drive.
6
0
coun
4
0
t
2
0
1 2 3 4
0 0 cty 0 0
0
The diagram tells us that the city mileage is slightly skewed to the right. To check whether
it is really so, we run the summary command and get the following results:
summary(mpg$cty)
The results confirm our belief from the histogram. The distance between 3rd quartile and
maximum value is 16.00 where as the distance between 1st quartile and minimum is 5.00. To
reconfirm the values let us run the same data and plot it into a boxplot.
0.
4
0.
2
0.
0
−0.
2
−0.
4 1 1 2 2 3 3
0 5 0 cty 5 0 5
A boxplot is a type of visual shorthand for a distribution of values that is popular among
statisticians. Each boxplot consists of a box that stretches from the 25th percentile (1st quartile)
of the distribution to the 75th percentile (3rd quartile), a distance known as the interquartile
12
range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile,
of the distribution. These three lines give you a sense of the spread of the distribution and
whether or not the distribution is symmetric about the median or skewed to one side. Visual
points display observations that fall more than 1.5 times the I Q R from either edge of the box.
These outlying points are unusual so are plotted individually and are known as outliers. A line
(or whisker) that extends from each end of the box and goes to the farthest non-outlier point
in the distribution.
Another important plot in case of continuous variable is the QQ (Quantile-Quantile) plot –
which is used to test whether the continuous variable is following a normal distribution or not.
This is achieved by the following commands:
3
0
y
2
0
1
0
−3 −2 −1 0 1 2 3
x
The quantile quantile plot is an interesting plot but before explaining the QQ plot please
note the difference in the a e s ( ) function of the plot. QQ plot takes in (sample = v a r i a b l e ) as
opposed to the x = va riable . This difference is because on the x-axis the normal distribution
is plotted and on the y-axis the data is plotted. The y-axis data is considered to be a sample and
the quantiles are computed from the data using the s t a t qq( ) function. For ease of visibility, I
have requested g g p l o t ( ) to plot the sample quantile in “red” color and the normal distribution
line ( s t a t qq l i n e ( ) ) in “blue” color. For interpretation purpose, if the points are far away
from the normal distribution line then the sample is non-normal and vice-versa. This is an
eyeballing technique and more precise tests are available in R.
13
4.2.1 Scatt er Plots
One of the most used plot for data analysis is the scatter plot. Scatter plot help us understand
the nature of relationship between two continuous variables. In order to make a scatter plot we
use the same set of commands that were used to produce figure 1–3 on page 4 – 5 respectively.
Lets make a new scatter plot between city and highway mileage:
4
0
3
hw
0
y
2
0
1 1 2 2 3 3
0 5 0 cty 5 0 5
Figure 16 tells us that city and highway mileage have a very strong positive relationship.
The original data has 234 data points but the chart seems to display fewer points. This is
because both c t y and hwy are integers and there are many overlapping points appearing as a
single dot. This can be handled by using geom j i tt e r ( ) . The jitter function randomly jitters
around original points based on a threshold controlled by the width and s i z e argument. Lets
also add a straight line smoother to the plot while we are at it:
5
0
4
0
hw
y
3
0
2
0 1 2 3
0 0 cty 0
1
0
Figure 17: Scatter Plot with Jitt er + Smoother Line of City & Highway Mileage
14
4.2.2 Multi ple B o x Plot
For two variables – one continuous and the other categorical – it sometimes makes more sense
to use multiple bar charts. We have already seen how to create boxplot (refer figure 14 on
page 12); we shall use the same to draw multiple boxplots but shall segregate the same on the
basis of class:
g gplot(data = mpg, mapping = aes( x = c t y, y= c l a s s ) ) + geom_boxplo t(fi ll = "plum")
suv
subcompa
ct
picku
clas
p
s
miniva
midsize
1 1 2 2 3 3
compa 0 5 0 cty 5 0 5
ct
Figure 18: Boxplot of City Mileage by Class
2seater
Another way of looking at the data would be a violin plot in which the violin are effectively
the histogram for the particular variable. It is created by adding geom v i o l i n ( ) . This exercise
if left to the readers of the document. The analysis of boxplot shows up some interesting
outputs. For example, the maximum variation in city mileage occurs in case of subcompacts
and the minimum most variation occurs in terms of two seaters. The compact category, for
example, provides about 15 miles per gallon to 23 miles per gallon; but has three outliers. SUV
and pickups provide the lowest mileage, but there are two S U V which are outperforming other
car categories in terms of city mileage. Another set of insight that is provided is about the
skewness of the data. Subcompact mileage is highly skewed towards the right; which in effect
is a good thing as the subcompact industry may be investing more in improving the mileage
of the subcompacts. On the other hand, two seaters have one of the least city mileage in the
category and no spread is seen in the category. This may lead one to infer that the two seater
category does not sell on the basis of mileage but on other factors. Similar analysis can be
done for highway mileage and some useful insights can be drawn into the data.
15
l i b r a r y ( g g c o r r p l o t ) ## load the l i b r a r y ggcorrplot
str(diamonds) ## Check the st r u c t u r e of the diamonds dataset
Now from the diamonds package we see that the carat, depth, table, pric e, x, y and z are
the numeric variables. However, we may wish to label some of the variables anew as because
“x” is length, “y” is width and “z” is depth and the original variable “depth” is actually called
depth percentage. To check these lets run the following:
?diamonds
We also need to pull the numeric variables and discard the non numeric variables for corre-
lation to be computed. We do this as follows:
att ach(diamonds) ## Att ach the dataset i n t o the workspace
DIA < - c b i n d ( p r i c e , c a r a t , depth, t a b l e , x , y, z )
head(DIA)
As can be seen the dataset D I A has the variables x , y and z. We shall now replace the
entire column names as follows
colnames(DIA) < - c ( " P r i c e " , " C a r a t " , "DepthPct", " Ta b l e " , "Length", "Width", "Depth")
head(DIA)
Okay now that we have got the variables in one place, it is time to compute the correlation
matrix. The function co r ( D IA) will compute the correlation matrix on all the variables in the
data frame D I A . Let us compute the correlation and store it into a matrix named as C O R M AT.
While using the c o r ( ) the correlation matrix would be computed to about 7 decimal places;
which makes reading the thing pretty exhausting. So we also apply the round() function and
round up the results to 3 decimal places
16
CORMAT < - round(cor(DIA),3)
CORMAT
ggcorrplot(CORMAT)
Dept
h
Widt Corr
h
Lengt 1.
0
h 0.
Table 5
0.
0
DepthPc −0.
5
t −1.
Cara 0
t
Price
t
e
e
Pc
ra
gt
t
t
bl
ic
h ep
h id
th
Ca
h n
Pr
Ta
D
Le
t ep
D
17
is going in. Fortunately, g g c o r r p l o t ( ) provides a lot of customisations which can be used to
make things easier. Let us run the following command and see the output:
## Warning: ‘g u i d e s( <s ca l e> = FAL SE)‘ i s deprecated. Please use ‘g u i d e s( <s ca l e > = "none")‘
i n ste a d .
Widt 0.9
h 5
e
Pc
ra
gt
t
t
bl
h ep
h id
th
Ca
h en
Ta
D
t ep
L
D
18
- for example, when data collection is done improperly or mistakes are made in data entry.
Missing values can be treated using following methods :
1. Deleti on:
The deletion method is used when the probability of missing variable is same for all observations.
Here each observation has equal chance of missing value. Deletion can be performed in two types:
List Wise Deletion and Pair Wise Deletion.
In listwise deleti on, we delete observations where any of the variable is missing. Simplicity
is one of the major advantage of this method, but this method reduces the power of model
because it reduces the sample size. For simplicity we can say that, this method deletes the
whole row of observations in which the data is missing. In pairwise deleti on, we perform
analysis with all cases in which the variables of interest are present. Advantage of this method
is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses
different sample size for different variables.
2. Imputati on Methods:
Imputation is a method to fill in the missing values with estimated ones. The objective is
to employ known relationships that can be identified in the valid values of the data set to
assist in estimating the missing values. Mean / Mode / Median imputation is one of the most
frequently used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of
that variable. It can be of two types. In Generalized Imputati on, we calculate the mean
or median for all non missing values of that variable then replace missing value with mean or
median. In similar case Imputati on In this case, we calculate averages individually of non
missing values then replace the missing value based on gender.
3. Predicti on Model:
Prediction model is one of the sophisticated method for handling missing data. Here, we create
a predictive model to estimate values that will substitute the missing data. In this case, we
divide our data set into two sets: One set with no missing values for the variable and another
one with missing values. First data set become training data set of the model while second data
set with missing values is test data set and variable with missing values is treated as target
variable. Next, we create a model to predict target variable based on other attributes of the
training data set and populate missing values of test data set.We can use regression, ANOVA,
Logistic regression and various modeling technique to perform this. There are 2 drawbacks for
this approach. Firstly the model estimated values are usually more well-behaved than the true
values If there are no relationships with attributes in the data set and the attribute with missing
values, then the model will not be precise for estimating missing values.
4. K N N Imputati on:
In this method of imputation, the missing values of an attribute are imputed using the given
number of attributes that are most similar to the attribute whose values are missing. The
similarity of two attributes is determined using a distance function. The advantag of K N N is
that it can predict both qualitative & quantitative attributes. C of predictive model for each
attribute with missing data is not required. Attributes with multiple missing values can be
easily treated. Correlation structure of the data is taken into consideration However also have
some disadvantages. K N N algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances. Choice of K- value is
19
very critical. Higher value of K would include attributes which are significantly different from
what we need whereas lower value of K implies missing out of significant attributes.
20
## C l a s s :character 1 st Qu.: 7.85 1 st Qu.:0.900 1 st Qu.:0.1833
## Mode :character Median :10.10 Median :1.500 Median :0.3333
## Mean :10.43 Mean :1.875 Mean :0.4396
## 3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
## Max. :19.90 Max. :6.600 Max. :1.5000
## NA's :22 NA's :51
## brainwt bodywt
awake
## Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1 st Qu.:10.25 1 st Qu.:0.00290 1 st Qu.: 0.174
## Median :13.90 Median :0.01240 Median : 1.670
## Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :27
From the output we see that sleep rem, sleep c yc l e and brainwt have 22, 51 and 27
NA’s respectively. Lets try and impute the missing values in sleep rem using arithmetic mean:
msleep$sleep_rem < - with(msleep, impute(sleep_rem, mean))
summary(msleep$sleep_rem)
##
## 22 values imputed to 1.87541
## Min. 1 st Qu. Median Mean 3rd Qu. Max.
## 0.100 1.150 1.875 1.875 2.200 6.600
The above command requires a bit of explanation. The L H S tells R that from dataset
msleep take the variable sleep rem and assign it to the output generated on the R HS. On
the R H S it tells the Hmisc() to use the dataset msleep and fill in all the missing values in the
variable sleep rem with mean. The summary(msleep$sleep rem) gives us the summary output
for the variable and as can be seen 22 values imputed with the arithmetic mean. Similarly, if
we specify random then the package replaces all the values with random values. The default
replacement value is median. We can also use max to replace everything with maximum values
or with min to replace everything with minimum values.
Let us now use a slightly upgraded imputation method. The function aregImpute() func-
tion, by default, uses predictive mean matching (PMM) method to impute the missing data.
One thing to note here is that the original dataset needs to be assigned to a new dataset for
analysis purpose. Lets first try the function:
## I t e r a ti o n 1
I t e r a ti o n 2
I t e r a ti o n 3
I t e r a ti o n 4
I t e r a ti o n 5
I t e r a ti o n 6
I t e r a ti o n 7
I t e r a ti o n 8
I t e r a ti o n 9
I t e r a ti o n 10
I t e r a ti o n 11
I t e r a ti o n 12
I t e r a ti o n 13
msleep.imputed
21
##
## Multi ple Imputati on using Bootstrap and PMM
##
## aregImpute(formula = ~sleep_to tal + sleep_rem + sleep_cycle +
## awake + brainwt + bodywt, data = msleep, n.impute = 10)
##
## n : 83 p : 6 Imputati ons: 10 nk:
3 ##
## Number o f NAs:
##
##
##
Im
pu
te
d
m
is
si
ng
va
lu
es
wi
th
th
e
fo
ll
o
w
in
g
fr
e 22
q
u
e
nc
ie