MICE
MICE
List of R Packages
1. MICE
2. Amelia
3. missForest
4. Hmisc
5. mi
MICE Package
MICE (Multivariate Imputation via Chained Equations) is one of the commonly used
package by R users. Creating multiple imputations as compared to a single imputation
(such as mean) takes care of uncertainty in missing values.
MICE assumes that the missing data are Missing at Random (MAR), which means that
the probability that a value is missing depends only on observed value and can be
predicted using them. It imputes data on a variable by variable basis by specifying an
imputation model per variable.
For example: Suppose we have X1, X2….Xk variables. If X1 has missing values, then it
will be regressed on other variables X2 to Xk. The missing values in X1 will be then
replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3
to Xk variables will be used in prediction model as independent variables. Later, missing
values will be replaced with predicted values.
#load data
> data <- iris
#Get summary
> summary(iris)
Since, MICE assumes missing at random values. Let’s seed missing values in our data
set using prodNA function. You can access this function by installing missForest
package.
I’ve removed categorical variable. Let’s here focus on continuous values. To treat
categorical variable, simply encode the levels and follow the procedure below.
#install MICE
> install.packages("mice")
> library(mice)
mice package has a function known as md.pattern(). It returns a tabular form of missing
value present in each variable in a data set.
> md.pattern(iris.mis)
Let’s understand this table. There are 98 observations with no missing values. There are
10 observations with missing values in Sepal.Length. Similarly, there are 13 missing
values with Sepal.Width and so on.
This looks ugly. Right ? We can also create a visual which represents missing values. It
looks pretty cool too. Let’s check it out.
> install.packages("VIM")
> library(VIM)
> mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(iris.mis), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
Let’s quickly understand this. There are 67% values in the data set with no missing
value. There are 10% missing values in Petal.Length, 8% missing values in Petal.Width
and so on. You can also look at histogram which clearly depicts the influence of missing
values in the variables.
> imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed =
500)
> summary(imputed_Data)
Since there are 5 imputed data sets, you can select any using complete() function.
Also, if you wish to build models on all 5 datasets, you can do it in one go
using with() command. You can also combine the result from these models and obtain a
consolidated output using pool() command.
Please note that I’ve used the command above just for demonstration purpose. You can
replace the variable values at your end and try it.