tinytest 1.2.4 is on CRAN

Version 1.2.4 of tinytest arrived on CRAN at 17 December 2020.

This release has a couple of new features.

  • You can specify the library location in test_package with lib.loc.
  • New function expect_inherits() to check the class of an object - Printing of file exit message is now shorter and on same line as
    test report.
  • Duration per file is now reported, total duration is stored.
  • New function expect_inherits()

For all features and fixes, see NEWS.

I owe a big thanks to Dirk Eddelbuettel and Sebastian Meyer for numerous suggestions and feedback while developing the new features.

Test durations are printed in two colors: anything measured in ms is printed light blue, test durations lasting longer than that are printed in dark blue. Here's an example testing RcppArmadillo on five cores.

screenshot

Currently 127 packages are suggesting tinytest.

growth of tinytest usage

Posted in programming, R | Tagged , , | Leave a comment

How to set library path on a {parallel} R cluster

In R you can add extra library locations (directories where your packages are installed) with the .libPaths() function. For example, to add "~/my/lib", you can do

libs <- c("~/my/lib", .libPaths())
libPaths(new = libs)

If you want to set library locations for all workers in a cluster using the parallel package, the intuitive way of doing this is as follows.

libs <- c("~/my/lib", .libPaths())
cluster <- parallel::makeCluster(2)
clusterCall(cluster, .libPaths, new=libs)

However, this does not work. I have not spent any time figuring out why, but presumably the side effect caused by .libPaths() is sent to the wrong place. Here are the internals of .libPaths().

> .libPaths
function (new) 
{
    if (!missing(new)) {
        new <- Sys.glob(path.expand(new))
        paths <- c(new, .Library.site, .Library)
        paths <- paths[dir.exists(paths)]
        .lib.loc <<- unique(normalizePath(paths, "/"))
    }
    else .lib.loc
}

The side effect is where .lib.loc is altered.

In any case, the following approach does work. We export the libs variable to the workers and then set libPaths() using clusterEvalQ().

e <- new.env()
e$libs <- c("~/my/lib", .libPaths())

cluster <- makeCluster(2)
clusterExport(cluster, "libs", envir=e)
clusterEvalQ(cluster, .libPaths(libs))

Update (2020-12-23) I posted a question about this on the R-devel mailinglist, and Luke Thierny was kind enough to explain what is happening here. He also provides a simpler workaround, namely by passing .libPaths as a string.

clusterCall(cluster, ".libPaths", new=libs)
Posted in programming, R | Tagged , , | Leave a comment

Author with affiliation in bookdown: HTML and pdf

If you are using Yihui Xie's bookdown package, it is quite easy to add authors and affiliations for HTML output. You only need to put the following pandoc options in the header of index.Rmd.

author:
- name: John Doe
  email: me@johndoe.com
  affiliation: A really important institute

However, this will destroy the rendering of author in pdf output. In fact there is an issue in the bookdown repo exactly about this. The only solution mentioned there is to create your own pandoc template, but I found a somewhat simpler solution. Since the issue is closed, I will post my solution here.

In your YAML header you can add a header-includes: option where you can put raw input for the latex preamble. Here's what I use in my Data Validation Cookbook:

header-includes:
- |
  ```{=latex}
  \usepackage{authblk}
  \author{Mark P.J. van der Loo\\ mpj.vanderloo@cbs.nl}
  \affil{Statistics Netherlands \\
         \href{https://www.cbs.nl}{\texttt{https://www.cbs.nl}} 
        }
  ```

Here, the authblk LaTeX package takes care of typesetting the author and affiliations. I am not sure if this will work with multiple authors and multiple affiliations.

As a reference: the complete YAML header.

Posted in programming, R | Tagged , , | Leave a comment

validate 1.0.1: new features and a cookbook

Version 1.0.1 of our validate package has arrived on CRAN on 2020-12-08. At the same time, a complete Data Validation Cookbook has been published online, and is also included with the package as a vignette.

The new features of validate 1.0.1 include

  • Improved visualisation for data validation results
  • Easier to select data that passes or fails at least one data validation rule
  • Improved support for data in long format, including
    • Checking aggregations over hierarchical classifications
    • Checking linear sequences, such as time series
    • Checking generic part-whole relations
    • Checking for presence or absence of whole records
    • Check for presence of forbidden value combinations
  • Support for checking field formats and value ranges
    • Checking numeric formats
    • Check field lengths
    • Check numeric ranges
  • Support for globbing and regular expressions.
  • Several new example data sets

As mentioned, a brand new cookbook was added that will not only teach you how to use validate, but also demonstrates many examples of common data validation tasks, and how they can be performed with validate. The cookbook replaces two old vignettes
and can be opend from the commandline with:

vignette("cookbook", package="validate")

For a full list of changes, please have a look at release notes in the NEWS file.

For suggestions, questions, bug reports, please create an issue at GitHub

Posted in programming, R | Tagged , , , | Leave a comment

stringdist 0.9.6 on CRAN: new features

stringdist version 0.9.6 arrived on CRAN on 16 july 2020.

This release brings a few new features.

Fuzzy text search

Search text for approximate matches of a search string using any stringdist distance. There are several functions that allow you to

  • detect whether there is a match within a certain maximum distance
  • return the position of the first best match
  • return the best match.

There are several interfaces for this. Functions grab and grabl work like base grep and grepl. The function extract has output similar to stringr::str_extract. The workhorse function is called afind (approximate find), which returns all results for multiple search patterns.

There is also a new implementation of the popular 'cosine' distance that I developed especially for this purpose. It is called 'running_cosine' and it avoids double work otherwise done with by the standard 'cosine' method. The result is a much faster implementation (up to about 100 times faster).

string similarity matrices

Thanks to a PR by Johannes Gruber stringdist now has a function to compute string similarity matrices: stringsimmatrix

Posted in programming, R, Uncategorized | Leave a comment

gower 1.2.2 is on CRAN

R package gower was accepted on CRAN on 23 june 2020.

This release fixes an edge case, affecting cases with a small number of records and a large number of threads. Thanks to Max Austin for kindly reporting the issue.

Here's a picture of John C. Gower: the man who came up with this famous (dis)similarity measure.

John C. Gower

Posted in programming, R | Leave a comment

tinytest 1.2.2 is on CRAN

The tinytest package was accepted on CRAN at 18 june.

This release improves the build_install_test() function with two extra arguments that were previously missing, as reported by Patrick Breheny.

Meanwhile, usage of the package continues to grow steadily. At time of writing, 72 packages are using tinytest.

growth of tinytest

Posted in programming, R | Tagged | Leave a comment

lumberjack 1.2.0 is on CRAN and paper accepted by JSS

lumberjack is a package that allows you to track (log) changes in data while an R script is running. This allows you to detect exactly which code had what effect on your data.

The only thing you have to do is

  1. add one line of code at the top of your script, for example:
start_log(mydata
, expression_logger$new(m = mean(salary, na.rm=TRUE))

to follow the mean of variable salary in mydata as mydata gets processed by one line of code at the time.

  1. Run your script with
library(lumberjack)
run_file("myscript.R")

and your script will run as usual, except that the mean salary is tracked across the run, and (in this case) automatically written to a file. The package is extensible so you can use one of the built-in loggers, or write your own.

If you want to know more, I highly recommend reading through my short paper on lumberjack that was recently accepted by the Journal of Statistical Software.

New release

Version 1.2.0 was accepted by CRAN on 8 may 2020. There are a few new features and fixes. Some of them suggested by one of the JSS
reviewers.

  • The most important change is that loggers now not only know what expression is running, but also from which file it originated, and from which line. This means that loggers can give even more informative information on when a script did what to data.
  • The JSS paper is now included as a vignette.

Finally

Here's a picture of a man, trying hard to look like a lumberjack while staring into the void.

I'm a lumberjack and I'm ok

Posted in programming, R | Tagged , , , , , | Leave a comment

tinytest 1.2.0 is on CRAN

tinytest is a relatively new, light-weight (no-dependency) but full-featured unit testing framework for R. It is currently used by 60+ packages, including the famous Rcpp package.

onwards to the tinyverse

The latest version of tinytest was accepted by CRAN on 5 May 2020. This is a minor release with some fixes and a few new features.

New features

tinytest now gives you some extra control over the environment the tests are un in. The workhorse function run_test_file now has an argument set_env, so you can do

run_test_file("test_mypkg.R", set_env=list(LC_TIME = "nl_NL.utf8", FOO="bar"))

and your time locale settings will be Dutch during the time of your test run (and who wouldn't want that?). Also your FOO will be bar, but only during the test run. All functions that rely on run_test_file get the same argument via the ....

You can now compare output to stored output using expect_equal_to_reference() and expect_equivalent_to_reference(). This compares a value to a value stored in and RDS file.

There's a new function expect_stdout() that catches everything that is printed or cat'ed to the terminal. The expect_message() function used to do this, but this task is now split over two functions, making the interface somewhat more consistent.

The new function get_call_wd() can be used from within a test file. It returns the working directory that was active when the test sequence was invoked (e.g. by test_all()). Useful because tinytest temporarily and safely switches working directory to the location of the test file (and returns afterwards).

The condition-catching functions expect_error, expect_warning, and expect_message gain a class argument, that can be used to check whether a signaling condition inherits from a certain class.

Updates, fixes

  • Argument 'tol' now renamed 'tolerance'. Also removed internal reliance on
    partial argument matching (Thanks to Michel Lang).
  • Updated documentation on how to extend tinytest with new recommendation.
  • Using tinytest vignette gains section on testing internal functions.
  • Breaking: function 'expect_message' no longer intercepts messages sent
    to stdout (e.g. via 'print' or 'cat'), but only messages send as a 'message'
    condition (Thanks to Michel Lang for pointing this out).
  • Fix: 'test_package' would return NULL when called interactively and the package
    contained a failing test.

Thanks

I'm truly grateful for the contributions by people who have helped me improve tinytest by providing suggestions, bug reports, and comments. For this release, I am especially grateful to Michel Lang, Maja Zaloznik and Jon Clayden.

Posted in programming, R | Tagged , , | Leave a comment

validate 0.9.3 is on CRAN

CRAN just accepted the latest version of our R package validate.

The validate package provides an infrastructure to perform any data quality check in a flexible and extensible way.

This is a minor update with the following new features:

  • New functions exists_any and exists_one to help define cross-record validation rules (thanks to David Salgado)
  • results of sort and aggregate now include key columns (if any)
  • Added JSS paper and CITATION file.

We are also very happy to report that our paper on validate has been accepted by the Journal of Statistical Software. It will take a while before it is published but a preprint was added as a vignette.

  • [PDF] M. van der Loo and E. de Jonge, "Data validation infrastructure for r," Journal of statistical software, vol. 1–22, p. 97, 2021.
    [Bibtex]
    @article{loo2019validation,
    title = {Data Validation Infrastructure for R},
    year = {2021},
    author = {MPJ van der Loo and E de Jonge},
    journal = {Journal of Statistical Software},
    pages = {97},
    volume = {1--22},
    note = {},
    pdf = {https://www.markvanderloo.eu/files/statistics/jss3483.pdf}
    }

darth validator

Posted in R, Uncategorized | Tagged , | Leave a comment