I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that
- make it easy to define both imputation method and imputation model;
- for multiple variables at once;
- while grouping data by categorical variables;
- all fitting in the magrittr not-a-pipeline.
A few examples
To start with an example, let us first create a data set with some missings.
dat <- iris
# empty a few fields
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
head(dat,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 NA 3.5 1.4 0.2 setosa
## 2 NA 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.6 NA 1.5 0.2 setosa
## 5 5.0 NA 1.4 0.2 setosa
## 6 5.4 NA 1.7 0.4 setosa
## 7 4.6 NA 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 <NA>
## 9 4.4 2.9 1.4 0.2 <NA>
## 10 4.9 3.1 1.5 0.1 <NA>
Below, we first impute Sepal.Width
and Sepal.Length
by regression on Petal.Width
and Species
. After this we impute Species
using a decision tree model (CART) using every other variable as a predictor (including the ones just imputed).
library(magrittr) # load the %>% operator
library(simputation)
imputed <- dat %>%
impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) %>%
impute_cart(Species ~ .)
head(imputed,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.979844 3.500000 1.4 0.2 setosa
## 2 4.979844 3.000000 1.4 0.2 setosa
## 3 4.979844 3.409547 1.3 0.2 setosa
## 4 4.600000 3.409547 1.5 0.2 setosa
## 5 5.000000 3.409547 1.4 0.2 setosa
## 6 5.400000 3.561835 1.7 0.4 setosa
## 7 4.600000 3.485691 1.4 0.3 setosa
## 8 5.000000 3.400000 1.5 0.2 setosa
## 9 4.400000 2.900000 1.4 0.2 setosa
## 10 4.900000 3.100000 1.5 0.1 setosa
The package is pretty lenient against failure of imputation. For example, if one of the predictors is missing, fields just remain unimputed and if one of the models cannot be fitted, only a warning is issued (not shown here).
dat %>% impute_lm(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.076579 3.5 1.4 0.2 setosa
## 2 4.675654 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
So here, the third Sepal.Length
value could not be imputed since the predictor Sepal.Width
is missing.
It is possible to split data into groups before estimating the imputation model and predicting missing values. There are two ways. The first is to use the |
operator to specify grouping variables.
# We first need to complete 'Species'. Here, we use sequential
# hot deck after sorting by Petal.Length
dat %<>% impute_shd(Species ~ Petal.Length)
# Now impute Sepal.Length by regressing on
# Sepal.Width, computing a model for each Species.
dat %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.067813 3.5 1.4 0.2 setosa
## 2 4.725677 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
The second way is to use the group_by
command from dplyr
dat %>% dplyr::group_by(Species) %>%
impute_lm(Sepal.Length ~ Sepal.Width) %>%
head(3)
## Source: local data frame [3 x 5]
## Groups: Species [1]
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.067813 3.5 1.4 0.2 setosa
## 2 4.725677 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
Note: by using group_by
, we also transformed the data.frame to a tibble, which not only sounds funny when you pronounce it (tibble, TIBBLE, tibble? tibbebbebbebble) but is also pretty useful.
Supported methods and how to specify them
Currently, the package supports the following methods:
- Model based (optionally add [non-]parametric random residual)
- linear regression
- robust linear regression
- CART models
- Random forest
- Donor imputation (including various donor pool specifications)
- k-nearest neigbour (based on gower's distance)
- sequential hotdeck (LOCF, NOCB)
- random hotdeck
- Predictive mean matching
- Other
- (groupwise) median imputation (optional random residual)
- Proxy imputation (copy from other variable)
Any call to one of the impute_
functions looks as follows:
impute_<method>(data, formula [, <method-specific options>])
and the formula always has the following form:
<imputed variables> ~ <model specification> [|<grouping variables>]
The parts in square brackets are optional.
Please see the package vignette for more examples and details, or ?simputation::impute_
for an overview of all imputation functions.
Happy imputing!
Another truly great
and easy to use R-PKG from Mark.
Read this clear, brief tutorial ...very useful!
@sf99
Good article! I like the fact that you use magrittr syntax to simplify things. I remember trying to do econometric work in R and found doing things that are almost basic in STATA to be a nightmare in R, probably because of bad syntax. More packages like this are needed.
Cheers!
I've always loved R but over the last couple of years much of the plumbing became easier with the introduction of the tidyverse packages (dplyr, tidyr, reshape2, broom,...).
Hi Mark,
I'm getting quite excited in using your simputation package. Very very easy to use.
Now I have a question with the grouping. I'm having an error and I'm not sure why it keeps throwing it although I do have a factor variable.
I'm using the Pima Indian diabetes data set here.
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
df <- read_csv("dataset/diabetes.csv")
colnames(df) <- c("pregnant", "glucose", "diastolic", "triceps", "insulin", "bmi", "diabetes", "age", "test")
# The 0's are actually NAs
df2 <- df
df2$glucose[df2$glucose == 0] <- NA
df2$diastolic[df2$diastolic == 0] <- NA
df2$triceps[df2$triceps == 0] <- NA
df2$insulin[df2$insulin == 0] <- NA
df2$bmi[df2$bmi == 0] <- NA
df3 % group_by(test) %>%
impute_lm(glucose ~ .)
And I'm always getting this error:
Could not execute fun for 'glucose': contrasts can be applied only to factors with 2 or more levelsCould not execute fun for 'glucose': contrasts can be applied only to factors with 2 or more levels
The message is quite clear. But I do not understand why I get it as the test variable is of type factor with 2 levels.
Do you know what I can be missing / misunderstand?
Thank you in advance and thanks again for sharing that wonderful package.
Hard to say w/o seeing the data, but try to exclude the grouping variable from the model:
df3 %>% group_by(test) %>% impute_lm(glucose ~ . - test)
cheers
-M