Make your package spherical

Larger packages typically consist of functions that are visible to the users (exported functions) and functions that are used by the exported functions, but that are invisible to the user. For example:


# exported, user-visible function
inch2cm <- function(x){
  x*conversion_factor("inch")
}
# not exported function, package-internal
conversion_factor <- function(unit){
  confac <- c(inch=2.54, pound=1/2.2056)
  confac[unit]
}

We can think of the exported functions (or more correctly, the interface of the exported functins) as the surface of a package, and all the other functions as the volume. The surface is what a user sees, the volume is what the developer sees. The surface is how a user interacts with a package.

If the surface is small (few functions exported, no unnecessary parameters in the interface), users are limited in the ways they can interact with your package, and that means there is less to test. It also means that you, as a package developer, have more room to move and change things in the volume. So as a rule of thumb, it is a good idea to keep the surface small.

Since a sphere has the smallest surface-to-volume ratio possible, I refer to this rule of thumb as as make your package spherical.

Note
This post was first published as a paragraph in the vignette of the tinytest package. I repeat it here with a few changes for more visibility.

Posted in programming, R, Uncategorized | Tagged , | Leave a comment

Checking reverse dependencies: the tiny way

The tools package that comes with base R makes checking reverse dependencies super easy.

  1. Build your package tarball (the pkg_x.y.z.tar.gz file).

    R CMD build /your/package/location

It is a good idea to make sure that the tarball is in a dedicated directory, because the next step will download and install reverse dependencies in the directory where the tarball resides.

  1. In an R terminal type

result <- check_packages_in_dir("/directory/containing/tarball"
                    , reverse = list() )

The result can be printed and summarized and analyzed further if there is any breakage. Here's an example of output when I ran this on my gower package today.


> result
Check results for packages in dir '/home/mark/projects/gower/output':
Package sources: 1, Reverse depends: 5
Use summary() for more information.
> summary(result)
Check results for packages in dir '/home/mark/projects/gower/output':

Check status summary:
                  ERROR NOTE OK
  Source packages     0    0  1
  Reverse depends     1    3  1

Check results summary:
gower ... OK
rdepends_ceterisParibus ... NOTE
* checking dependencies in R code ... NOTE
rdepends_lime ... ERROR
* checking tests ... ERROR
* checking re-building of vignette outputs ... WARNING
rdepends_live ... NOTE
* checking dependencies in R code ... NOTE
rdepends_recipes ... NOTE
* checking dependencies in R code ... NOTE
rdepends_simputation ... OK

(Checking the logs in output/rdepends_lime.Rcheck/00check.log shows that lime fails because of a missing JAVA engine [I just updated my OS and have no JAVA installed yet].)

Notes

  1. Checking reverse dependencies can be done in parallel by setting the Ncpus argument larger than one.
  2. Be aware that the documentation states that (R 3.5.2) This functionality is still experimental: interfaces may change in future versions. Nevertheless, it has worked fine for me so far.
Posted in programming, R | Tagged , , , | Leave a comment

gower 0.2.0 is on CRAN

A new version of R package gower has just been released on CRAN.

Thanks to our new contributor David Turner who was kind enough to provide a pull request, gower now also computes weighted gower distances.

From the NEWS file:

  • gower_dist and gower_topn gain weight argument for weighted matching (thanks to David Turner)
  • gower_dist and gower_topn gain ignore_case argument for automatic column matching.
  • gower_dist now returns numeric(0) invisibly when there are no columns to compare.
  • gower_topn now returns a list with empty matrices when there are no columns to compare.
  • gower_topn now warns when n>nrow(y) and sets n=nrow(y)
  • bugfix: comparing factors with characters would cause a crash (thanks to max Kuhn)

Compute Gower's distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP.

Posted in data cleaning, data correction methods, imputation, programming, R | 1 Comment

uRos2019: tutorials, keynote speakers, registration and call for papers!

The 7th use of R in Official Statistics conference is the event for all things R in the production and use of government statistics. The 7th installment of this conference will take place from 20 to 21 May 2019 at the National Institute of Statistics in Bucharest, Romania.

Keynote Speakers

We are very proud to announce that we have two excellent keynote speakers.

  • Julie Josse will talk about her work on theory and tools related to imputation and inference in the presence of missing data.
  • Giulio Barcaroli will talk about 12 years of using R at ISTAT, the Italian Statistical Office.

Full abstracts can be found here

Tutorials

The conference is preceded by three tutorials on Data Cleaning, Statistical Disclosure Control and Optimal Sampling Stratification.

Call for papers

Yes, abstracts and papers are welcomed until 12 April 2019! You can contribute by sending an abstract in any of the following topics (relating to official statistics):

Sampling and estimation | R in organization | Data cleaning | R in production: data analysis | Methods for official statistics | Shiny applications | Time series | Report and GUI programming | R in production: automation | Big data | Dissemination and visualization

Registration is open

You can now register by following instructions here.

Posted in official statistics, R | Tagged , , | Leave a comment

Add a static pdf vignette to an R package

Most vignettes are built when a package is built, but there are occasions where you just want to include a pdf. For example when you want to include a paper. Of course there is a package supporting this, but in this post I will show you how to do it yourself with ease.

The idea is very simple: vignettes can be in LaTeX, and it is possible to include pdf documents in LaTeX using the pdfpages package. So here's the step-by-step recipe:

  1. If you do not already have it, create the vignettes folder in your package directory.
  2. Put your static pdf there. Let's call it mypaper.pdf for now.
  3. Create a .Rnw file with the following content.
\documentclass{article}
\usepackage{pdfpages}
%\VignetteIndexEntry{author2019mypaper}

\begin{document}
\includepdf[pages=-, fitpaper=true]{mypaper.pdf}
\end{document}

That's it.

Some notes.

  1. This repo contains an example.
  2. The option fitpaper=true is necessary because the Sweave package that is included when the vignette is built somehow causes the pages to rescale if it is not included.
  3. If you post your package to CRAN, myfile.pdf will be deleted from the directory so it is not part of a binary download.
  4. You can include errata or other notes, for example as follows:
\documentclass{article}
\usepackage{pdfpages}
%\VignetteIndexEntry{author2019mypaper}

\begin{document}
\includepdf[pages=-, fitpaper=true]{mypaper.pdf}

\newpage{}
\subsection*{Errata}

A few things were borked in the original publication, here
is a list of sto0pid things I did:

\begin{itemize}
\item{fubar 1}
\item{fubar 2}
\end{itemize}

\end{document}
Posted in programming, R | Tagged , , | 1 Comment

The program for uRos2018 is online

The uRos2018 conference is aimed at professionals and academics who are involved in producing or consuming official (government) statistics.

We are happy to announce that we recently posted the full program of the 6th international conference on the use of R in official Statistics (uRos2018) on our website.

In summary:

  • Six tutorials in the areas of
    • Data Cleaning
    • Network Analyses
    • Survey Estimation
    • Data manipulation with data.table
    • Analyzing spatial data
    • Visualizing spatial data.
  • Two keynote speakers:
    • Alina Matei, professor of statistics at the University of Neuchatel and maintainer of the sampling package.
    • Jeroen Ooms, R superprogrammer and maintainer of R and Rtools for Windows (UC Berkeley)
  • Eleven sessions with contributed talks with five presentations from
    all over the world.
  • One session devoted to the results of a two-day unconf that is held prior to the conference.
  • One social dinner 🙂
  • Two journals will devote a special topic to the conference.

All the abstracts will be published online soon.

Registration is still open

  • You are welcome to register
  • Follow us on twitter for the latest news and updates!
Posted in official statistics, programming, R | Leave a comment

stringdist 0.9.5.1: now with C API

Version 0.9.5.1 of stringdist is on CRAN. The main new feature, with a huge thanks to our awesome new contributor Chris Muir, is that we made it easy to call stringdist functionality from your package's C or C++ code.

The main steps to get it done are:

  1. Make sure to add stringdist to the Imports: and LinkingTo: fields in your DESRIPTION file
  2. Add the #include <stringdist_api> to your C/C++ source file.
  3. Start using stringdist from C!

Here's an example source file

#include <R.h>
#include <Rdefines.h>
#include <stringdist_api.h>


SEXP my_soundex(SEXP strings, SEXP useBytes){
  Rprintf("\nWow, using 'stringdist' soundex encoding, from my own C code!\n");
  return sd_soundex(strings, useBytes);
}

Great! how can I learn more?

  • The full API is desribed in a pdf file that is generated from doxygen that comes with the package. You can find it by typing ?stringdist_api on the R command line.
  • A minimal example package that links to stringdist is available on GitHub
  • A more sophisticated package with more elaborate examples can be found here: refinr (By Chris)

Any other news?

A few fixes, and a couple of long-deprecated function arguments have finally been removed. Check out the NEWS file on CRAN for a complete overview.

Happy coding!

Posted in data cleaning, programming, R, string metrics | 2 Comments

The use of R in official statistics conference 2018

On September 12-14 the 6th international conference on the use of R in official statistics (#uRos2018) will take place at the Dutch National Statistical Office in Den Haag, the Netherlands. The conference is aimed at producers and users of official statistics from government, academia, and industry. The conference is modeled after the useR! conference and will consist of one day of tutorials (12th September 2018) followed by two days of conference (13, 14 September 2018). Topics include:

  • Examples of applying R in statistical production.
  • Examples of applying R in dissemination of statistics (visualisation, apps, reporting).
  • Analyses of big data and/or application of machine learning for official statistics.
  • Implementations of statistical methodology in the areas of sampling, editing, modelling and estimation, or disclosure control.
  • R packages connecting R to other standard tools/technical standards
  • Organisational and technical aspects of introducing R to the statistical office.
  • Teaching R to users in the office
  • Examples of accessing or using official statistics publications with R in other fields

    Keynote speakers
    We are very happy to announce that we confirmed two fantastic keynote speakers.

  • Alina Matei is a professor of statistics at the University of Neuchatel and maintainer of the important sampling package.
  • Jeroen Ooms is a postdoc at UC Berkeley, author of many infrastructural R packages and maintainer of R and Rtools for Windows.

    Call for abstracts

    The call for abstracts is open until 31 May. You can contribute to the conference by proposing a 20-minute talk, or a 3-hour tutorial. Also, authors have the opportunity to submit a paper for one of the two journals that will devote a special issue to the conference. Read all about it over here.

    Pointers

  • conference website
  • Follow uRos2018 on twitter

  • Posted in official statistics, R | Leave a comment

    Track changes in data with the lumberjack %>>%

    So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

    > data(retailers, package="validate")
    > head(retailers, 3)
      size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
    1  sc0      0.02    75       NA        NA      1130          NA       18915  20045  NA
    2  sc3      0.14     9     1607        NA      1607         131        1544     63  NA
    3  sc3      0.14    NA     6886       -33      6919         324        6493    426  NA
    

    This data is dirty with missings and full of errors. Let us do some imputations with simputation.

    > out <- retailers %>% 
    +   impute_lm(other.rev ~ turnover) %>%
    +   impute_median(other.rev ~ size)
    > 
    > head(out,3)
      size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
    1  sc0      0.02    75       NA  6114.775      1130          NA       18915  20045  NA
    2  sc3      0.14     9     1607  5427.113      1607         131        1544     63  NA
    3  sc3      0.14    NA     6886   -33.000      6919         324        6493    426  NA
    > 
    

    Ok, cool, we know all that. But what if you'd like to know what value was imputed with which method? That's where the lumberjack comes in.

    The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

    > library(lumberjack)
    > retailers$id <- seq_len(nrow(retailers))
    > out <- retailers %>>% 
    +   start_log(log=cellwise$new(key="id")) %>>%
    +   impute_lm(other.rev ~ turnover) %>>%
    +   impute_median(other.rev ~ size) %>>%
    +   dump_log(stop=TRUE)
    Dumped a log at cellwise.csv
    > 
    > read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
      step                     time                      expression key  variable old      new
    1    2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size)   1 other.rev  NA 6114.775
    2    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   2 other.rev  NA 5427.113
    3    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   6 other.rev  NA 6341.683
    > 
    

    So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

    There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

    If this post got you interested, please install the package using

    install.packages('lumberjack')
    

    You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.

    As always, I am open to suggestions and comments. Either through the packages github page.

    Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

    And finally, here's a picture of a lumberjack smoking a pipe.

    [1] It really should be called a function composition operator, but potetoes/potatoes.

    Posted in data cleaning, data manipulation, programming, R | Leave a comment

    Announcing the simputation package: make imputation simple

    I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that

    • make it easy to define both imputation method and imputation model;
    • for multiple variables at once;
    • while grouping data by categorical variables;
    • all fitting in the magrittr not-a-pipeline.

    A few examples

    To start with an example, let us first create a data set with some missings.

    dat <- iris
    # empty a few fields
    dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
    head(dat,10)
    
    ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1            NA         3.5          1.4         0.2  setosa
    ## 2            NA         3.0          1.4         0.2  setosa
    ## 3            NA          NA          1.3         0.2  setosa
    ## 4           4.6          NA          1.5         0.2  setosa
    ## 5           5.0          NA          1.4         0.2  setosa
    ## 6           5.4          NA          1.7         0.4  setosa
    ## 7           4.6          NA          1.4         0.3  setosa
    ## 8           5.0         3.4          1.5         0.2    <NA>
    ## 9           4.4         2.9          1.4         0.2    <NA>
    ## 10          4.9         3.1          1.5         0.1    <NA>
    

    Below, we first impute Sepal.Width and Sepal.Length by regression on Petal.Width and Species. After this we impute Species using a decision tree model (CART) using every other variable as a predictor (including the ones just imputed).

    library(magrittr)    # load the %>% operator
    library(simputation) 
    imputed <- dat %>% 
      impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) %>%
      impute_cart(Species ~ .)
    head(imputed,10)
    
    ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1      4.979844    3.500000          1.4         0.2  setosa
    ## 2      4.979844    3.000000          1.4         0.2  setosa
    ## 3      4.979844    3.409547          1.3         0.2  setosa
    ## 4      4.600000    3.409547          1.5         0.2  setosa
    ## 5      5.000000    3.409547          1.4         0.2  setosa
    ## 6      5.400000    3.561835          1.7         0.4  setosa
    ## 7      4.600000    3.485691          1.4         0.3  setosa
    ## 8      5.000000    3.400000          1.5         0.2  setosa
    ## 9      4.400000    2.900000          1.4         0.2  setosa
    ## 10     4.900000    3.100000          1.5         0.1  setosa
    

    The package is pretty lenient against failure of imputation. For example, if one of the predictors is missing, fields just remain unimputed and if one of the models cannot be fitted, only a warning is issued (not shown here).

    dat %>% impute_lm(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
    
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1     5.076579         3.5          1.4         0.2  setosa
    ## 2     4.675654         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    So here, the third Sepal.Length value could not be imputed since the predictor Sepal.Width is missing.

    It is possible to split data into groups before estimating the imputation model and predicting missing values. There are two ways. The first is to use the | operator to specify grouping variables.

    # We first need to complete 'Species'. Here, we use sequential 
    # hot deck after sorting by Petal.Length
    dat %<>% impute_shd(Species ~ Petal.Length) 
    # Now impute Sepal.Length by regressing on 
    # Sepal.Width, computing a model for each Species.
    dat %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head(3)
    
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1     5.067813         3.5          1.4         0.2  setosa
    ## 2     4.725677         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    The second way is to use the group_by command from dplyr

    dat %>% dplyr::group_by(Species) %>% 
        impute_lm(Sepal.Length ~ Sepal.Width) %>% 
        head(3)
    
    ## Source: local data frame [3 x 5]
    ## Groups: Species [1]
    ## 
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
    ## 1     5.067813         3.5          1.4         0.2  setosa
    ## 2     4.725677         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    Note: by using group_by, we also transformed the data.frame to a tibble, which not only sounds funny when you pronounce it (tibble, TIBBLE, tibble? tibbebbebbebble) but is also pretty useful.

    Supported methods and how to specify them

    Currently, the package supports the following methods:

    • Model based (optionally add [non-]parametric random residual)
      • linear regression
      • robust linear regression
      • CART models
      • Random forest
    • Donor imputation (including various donor pool specifications)
      • k-nearest neigbour (based on gower's distance)
      • sequential hotdeck (LOCF, NOCB)
      • random hotdeck
      • Predictive mean matching
    • Other
      • (groupwise) median imputation (optional random residual)
      • Proxy imputation (copy from other variable)

    Any call to one of the impute_ functions looks as follows:

    impute_<method>(data, formula [, <method-specific options>])
    

    and the formula always has the following form:

    <imputed variables> ~ <model specification> [|<grouping variables>]
    

    The parts in square brackets are optional.

    Please see the package vignette for more examples and details, or ?simputation::impute_ for an overview of all imputation functions.

    Happy imputing!

    Posted in data cleaning, data correction methods, imputation, programming, R | 5 Comments