stringdist 0.9.6 on CRAN: new features

stringdist version 0.9.6 arrived on CRAN on 16 july 2020.

This release brings a few new features.

Fuzzy text search

Search text for approximate matches of a search string using any stringdist distance. There are several functions that allow you to

  • detect whether there is a match within a certain maximum distance
  • return the position of the first best match
  • return the best match.

There are several interfaces for this. Functions grab and grabl work like base grep and grepl. The function extract has output similar to stringr::str_extract. The workhorse function is called afind (approximate find), which returns all results for multiple search patterns.

There is also a new implementation of the popular 'cosine' distance that I developed especially for this purpose. It is called 'running_cosine' and it avoids double work otherwise done with by the standard 'cosine' method. The result is a much faster implementation (up to about 100 times faster).

string similarity matrices

Thanks to a PR by Johannes Gruber stringdist now has a function to compute string similarity matrices: stringsimmatrix

Posted in programming, R, Uncategorized | Leave a comment

gower 1.2.2 is on CRAN

R package gower was accepted on CRAN on 23 june 2020.

This release fixes an edge case, affecting cases with a small number of records and a large number of threads. Thanks to Max Austin for kindly reporting the issue.

Here's a picture of John C. Gower: the man who came up with this famous (dis)similarity measure.

John C. Gower

Posted in programming, R | Leave a comment

tinytest 1.2.2 is on CRAN

The tinytest package was accepted on CRAN at 18 june.

This release improves the build_install_test() function with two extra arguments that were previously missing, as reported by Patrick Breheny.

Meanwhile, usage of the package continues to grow steadily. At time of writing, 72 packages are using tinytest.

growth of tinytest

Posted in programming, R | Tagged | Leave a comment

lumberjack 1.2.0 is on CRAN and paper accepted by JSS

lumberjack is a package that allows you to track (log) changes in data while an R script is running. This allows you to detect exactly which code had what effect on your data.

The only thing you have to do is

  1. add one line of code at the top of your script, for example:
, expression_logger$new(m = mean(salary, na.rm=TRUE))

to follow the mean of variable salary in mydata as mydata gets processed by one line of code at the time.

  1. Run your script with

and your script will run as usual, except that the mean salary is tracked across the run, and (in this case) automatically written to a file. The package is extensible so you can use one of the built-in loggers, or write your own.

If you want to know more, I highly recommend reading through my short paper on lumberjack that was recently accepted by the Journal of Statistical Software.

New release

Version 1.2.0 was accepted by CRAN on 8 may 2020. There are a few new features and fixes. Some of them suggested by one of the JSS

  • The most important change is that loggers now not only know what expression is running, but also from which file it originated, and from which line. This means that loggers can give even more informative information on when a script did what to data.
  • The JSS paper is now included as a vignette.


Here's a picture of a man, trying hard to look like a lumberjack while staring into the void.

I'm a lumberjack and I'm ok

Posted in programming, R | Tagged , , , , , | Leave a comment

tinytest 1.2.0 is on CRAN

tinytest is a relatively new, light-weight (no-dependency) but full-featured unit testing framework for R. It is currently used by 60+ packages, including the famous Rcpp package.

onwards to the tinyverse

The latest version of tinytest was accepted by CRAN on 5 May 2020. This is a minor release with some fixes and a few new features.

New features

tinytest now gives you some extra control over the environment the tests are un in. The workhorse function run_test_file now has an argument set_env, so you can do

run_test_file("test_mypkg.R", set_env=list(LC_TIME = "nl_NL.utf8", FOO="bar"))

and your time locale settings will be Dutch during the time of your test run (and who wouldn't want that?). Also your FOO will be bar, but only during the test run. All functions that rely on run_test_file get the same argument via the ....

You can now compare output to stored output using expect_equal_to_reference() and expect_equivalent_to_reference(). This compares a value to a value stored in and RDS file.

There's a new function expect_stdout() that catches everything that is printed or cat'ed to the terminal. The expect_message() function used to do this, but this task is now split over two functions, making the interface somewhat more consistent.

The new function get_call_wd() can be used from within a test file. It returns the working directory that was active when the test sequence was invoked (e.g. by test_all()). Useful because tinytest temporarily and safely switches working directory to the location of the test file (and returns afterwards).

The condition-catching functions expect_error, expect_warning, and expect_message gain a class argument, that can be used to check whether a signaling condition inherits from a certain class.

Updates, fixes

  • Argument 'tol' now renamed 'tolerance'. Also removed internal reliance on
    partial argument matching (Thanks to Michel Lang).
  • Updated documentation on how to extend tinytest with new recommendation.
  • Using tinytest vignette gains section on testing internal functions.
  • Breaking: function 'expect_message' no longer intercepts messages sent
    to stdout (e.g. via 'print' or 'cat'), but only messages send as a 'message'
    condition (Thanks to Michel Lang for pointing this out).
  • Fix: 'test_package' would return NULL when called interactively and the package
    contained a failing test.


I'm truly grateful for the contributions by people who have helped me improve tinytest by providing suggestions, bug reports, and comments. For this release, I am especially grateful to Michel Lang, Maja Zaloznik and Jon Clayden.

Posted in programming, R | Tagged , , | Leave a comment

validate 0.9.3 is on CRAN

CRAN just accepted the latest version of our R package validate.

The validate package provides an infrastructure to perform any data quality check in a flexible and extensible way.

This is a minor update with the following new features:

  • New functions exists_any and exists_one to help define cross-record validation rules (thanks to David Salgado)
  • results of sort and aggregate now include key columns (if any)
  • Added JSS paper and CITATION file.

We are also very happy to report that our paper on validate has been accepted by the Journal of Statistical Software. It will take a while before it is published but a preprint was added as a vignette.

  • [PDF] M. van der Loo and E. de Jonge, "Data validation infrastructure for R," Journal of statistical software, p. Accepted for publication, 2019.
    title = {Data Validation Infrastructure for {R}},
    year = {2019},
    author = {MPJ van der Loo and E de Jonge},
    journal = {Journal of Statistical Software},
    pages = {Accepted for publication},
    volume = {},
    note = {},
    pdf = {}

darth validator

Posted in R, Uncategorized | Tagged , | 2 Comments

Call for abstracts and tutorials: use of R in official statistics 2020 in Vienna


The eight international conference on the Use of R in Official Statistics (#uRos2020) will take place place from 6 to 8 May 2020 at Statistics Austria, the Austrian office of National Statistics.

The meeting in a nutshell

  • 4-5 May: unconfUROS hackathon and General R tutorials for beginners
  • 6 May: Specific R tutorials
  • 7-8 May: uRos2020 Conference

The call for papers is still open.

Keynote speakers: R Core is coming to uRos!

We are very excited to announce that Isabel Molina Peralta and Matthias Templ will give keynotes at uRos2020.

Moreover we are proud to announce that a member of the R core team will join uRos2020 as well! Stay tuned for further announcements.


Posted in R | Tagged , | Leave a comment

lintools 0.1.3 is on CRAN

Version 0.1.3 of the lintools package was accepted on CRAN today.

This version brings a few internal improvements and switches the testing suite to the tinytest test infrastructure.

lintools is provides basic manipulations of linear systems of equalities and inequalities including: variable elimination (Gaussian elimination, Fourier-Motzkin elimination), Moore-Penrose pseudoinverse, reduction to reduced row echelon form, value substitution, projecting a vector on the convex polytope described by a system of (in)equations, simplifing systems by removing spurious columns and rows and collapsing implied equalities, testing whether a matrix is totally unimodular and computing variable ranges implied by linear (in)equalities.

Posted in programming, R, Uncategorized | Tagged , | Leave a comment

Make your package spherical

Larger packages typically consist of functions that are visible to the users (exported functions) and functions that are used by the exported functions, but that are invisible to the user. For example:

# exported, user-visible function
inch2cm <- function(x){
# not exported function, package-internal
conversion_factor <- function(unit){
  confac <- c(inch=2.54, pound=1/2.2056)

We can think of the exported functions (or more correctly, the interface of the exported functins) as the surface of a package, and all the other functions as the volume. The surface is what a user sees, the volume is what the developer sees. The surface is how a user interacts with a package.

If the surface is small (few functions exported, no unnecessary parameters in the interface), users are limited in the ways they can interact with your package, and that means there is less to test. It also means that you, as a package developer, have more room to move and change things in the volume. So as a rule of thumb, it is a good idea to keep the surface small.

Since a sphere has the smallest surface-to-volume ratio possible, I refer to this rule of thumb as as make your package spherical.

This post was first published as a paragraph in the vignette of the tinytest package. I repeat it here with a few changes for more visibility.

Posted in programming, R, Uncategorized | Tagged , | Leave a comment

Checking reverse dependencies: the tiny way

The tools package that comes with base R makes checking reverse dependencies super easy.

  1. Build your package tarball (the pkg_x.y.z.tar.gz file).

    R CMD build /your/package/location

It is a good idea to make sure that the tarball is in a dedicated directory, because the next step will download and install reverse dependencies in the directory where the tarball resides.

  1. In an R terminal type

result <- check_packages_in_dir("/directory/containing/tarball"
                    , reverse = list() )

The result can be printed and summarized and analyzed further if there is any breakage. Here's an example of output when I ran this on my gower package today.

> result
Check results for packages in dir '/home/mark/projects/gower/output':
Package sources: 1, Reverse depends: 5
Use summary() for more information.
> summary(result)
Check results for packages in dir '/home/mark/projects/gower/output':

Check status summary:
                  ERROR NOTE OK
  Source packages     0    0  1
  Reverse depends     1    3  1

Check results summary:
gower ... OK
rdepends_ceterisParibus ... NOTE
* checking dependencies in R code ... NOTE
rdepends_lime ... ERROR
* checking tests ... ERROR
* checking re-building of vignette outputs ... WARNING
rdepends_live ... NOTE
* checking dependencies in R code ... NOTE
rdepends_recipes ... NOTE
* checking dependencies in R code ... NOTE
rdepends_simputation ... OK

(Checking the logs in output/rdepends_lime.Rcheck/00check.log shows that lime fails because of a missing JAVA engine [I just updated my OS and have no JAVA installed yet].)


  1. Checking reverse dependencies can be done in parallel by setting the Ncpus argument larger than one.
  2. Be aware that the documentation states that (R 3.5.2) This functionality is still experimental: interfaces may change in future versions. Nevertheless, it has worked fine for me so far.
Posted in programming, R | Tagged , , , | Leave a comment