stringdist released

stringdist was accepted on CRAN at the end of last week.

This release just fixes a few bugs affecting the stringdistmatrix function, when called with a single argument.

From the NEWS file:

  • bugfix in stringdistmatrix(a): value of p, for jw-distance was ignored (thanks to Max Fritsche)
  • bugfix in stringdistmatrix(a): Would segfault on q-gram w/input > ~7k strings and q>1 (thanks to Connor McKay)
  • bugfix in jaccard distance: distance not always correct when passing multiple strings (thanks to Robert Carlson)

Actually the last bug has not bitten anyone since it was masked by the second one 🙂 (it was reported and fixed a long time ago but popped up again after fixing the second bug -- hat tip to Hadley for testthat!). The second fix also ensures that stringdist's memory allocator for q-gram storage is called fewer times which yields a speed gain in computation of q-gram based distances.

Posted in programming, R, string metrics | 2 Comments

validate version 0.1.5 is out

A new version of the validate package for data validation was just accepted on CRAN and will be available on all mirrors in a few days.

The most important addition is that you can now reference the data set as a whole, using the "dot" syntax like so:

iris %>% check_that(
  , "Sepal.Width" %in% names(.)) %>% 

  rule items passes fails nNA error warning                  expression
1   V1     1      1     0   0 FALSE   FALSE               nrow(.) > 100
2   V2     1      1     0   0 FALSE   FALSE "Sepal.Width" %in% names(.)

Also, it is now possible to return a logical, even when the result is NA, by passing the na.value option.

dat = data.frame(x=c(1,NA,-1))
v = validator(x > 0)
[1,]  TRUE
[2,]    NA
[3,] FALSE
[1,]  TRUE
[2,] FALSE
[3,] FALSE

A complete list of changes and bugfixes can be found in the NEWS file. Below I include changes in 1.4 since I did not write about it before.

I will be talking about this package at the upcoming useR!2016 event, so join me if you're interested!

version 0.1.5

  • The '.' is now used to reference the validated data set as whole.
  • Small change in output of 'compare' to match the table in van den Broek et al. (2013)

version 0.1.4

  • 'confront' now emits a warining when variable name conflicts with name of a reference data set
  • Deprecated 'validate_reset', in favour of the shorter 'reset' (use 'validate::reset' in case of ambiguity)
  • Deprecated 'validate_options' in favour of the shorter 'voptions'
  • New option na.value with default value NA, controlling the output when a rule evaluates to NA.
  • Added rules from the ESSnet on validation (deliverable 17) to automated tests.
  • added 'grepl' to allowed validation syntax (suggested by Dusan Sovic)
  • exported a few functions w/ keywords internal for extensibility
  • Bugfix: blocks sometimes reported wrong nr of blocks (in case of a single connected block.)
  • Bugfix: macro expansion failed when macros were reused in other macros.
  • Bugfix: certain nonlinear relations were recognized as linear
  • Bugfix: rules that use (anonymous) function definitions raised error when printed.
Posted in data cleaning, data correction methods, data manipulation, programming, R, Uncategorized | Leave a comment

Easy data validation with the validate package

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.


iris %>% check_that(
  Sepal.Width > 0.5 * Sepal.Length
  , mean(Sepal.Width) > 0
  , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()

#  rule items passes fails nNA error warning                                              expression
# 1   V1   150     66    84   0 FALSE   FALSE                        Sepal.Width > 0.5 * Sepal.Length
# 2   V2     1      1     0   0 FALSE   FALSE                                   mean(Sepal.Width) > 0
# 3   V3   150     84    66   0 FALSE   FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.

In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.

v <-  validator(
  ratio = Sepal.Width > 0.5 * Sepal.Length
  , mean = mean(Sepal.Width) > 0
  , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10

# Object of class 'validator' with 3 elements:
#  ratio: Sepal.Width > 0.5 * Sepal.Length
#  mean : mean(Sepal.Width) > 0
#  cnd  : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

We can confront the iris data set with this validator. The results are stored in a validation object.

cf <- confront(iris, v)

# Object of class 'validation'
# Call:
#     confront(x = iris, dat = v)
# Confrontations: 3
# With fails    : 2
# Warnings      : 0
# Errors        : 0

validate-iris These are just the basics of what can be done with this package.

  • If this post got you interested, you can go through our introductory vignette
  • Some theory on data validation can be found here
  • We'd love to hear your suggestions, opinions, bugreports here
  • An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
  • Github repo, CRAN page
Posted in data cleaning, programming, R | 14 Comments

settings 0.2.3

An updated version of the settings package has been accepted on CRAN.

The settings package provides alternative options settings management for R. It is aimed to allow for layered options management where global options are the default that can easily be overruled locally (e.g. when calling a function, or options as part of an object).

New features:

  • Setting ranges or lists of allowed values.

See the vignette to get started, all code is on github.

Posted in programming, R | Leave a comment

stringdist 0.9.4 and 0.9.3: distances between integer sequences

A new release of stringdist has been accepted on CRAN.

stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding.

version 0.9.4

  • bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused UBSAN to fire, but gave correct results).
  • bugfix in jw distance: not symmetric for certain cases (thanks to github user gtumuluri)

Since 0.9.3, stringdist can compute distances between integer sequences, and you can use hashr to compute an integer representation of any (sequence of) R objects, based on the superFastHash algorithm of Paul Hsieh.

version 0.9.3

  • new functions computing distances between integer sequences: seq_dist, seq_distmatrix
  • new function for tokenizing integer sequences: seq_qgrams
  • new function for matching integer sequences: seq_amatch
  • q-gram based distances are now always 0 when q=0 (used to be Inf if at least one of the arguments was not the empty string)
  • stringdist, stringdistmatrix now emit warning when presented with list argument
  • small c-side code optimizations
  • bugfix in dl, lv, osa distance: weights were not taken into account properly (thanks to Zach Price)

All code on github.

Posted in programming, R, string metrics | Leave a comment

Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples.

Computing 'dist' objects with 'stringdistmatrix'

The R dist object is used as input for many clustering algorithms such as cluster::hclust. It is stores the lower triangle of a matrix of distances between a vector of objects. The function stringdist::stringdistmatrix now takes a variable number of character arguments. If two vectors are given, it behaves the same as it used to.

> x <- c("fu","bar","baz","barb")
> stringdistmatrix(x,x,useNames="strings")
     fu bar baz barb
fu    0   3   3    4
bar   3   0   1    1
baz   3   1   0    2
barb  4   1   2    0

However, we're doing more work then necessary. Feeding stringdistmatrix just a single character argument yields the same information, but at half the computational and storage cost.

> stringdistmatrix(x,useNames="strings")
     fu bar baz
bar   3        
baz   3   1    
barb  4   1   2

The output is a dist object storing only the subdiagonal triangle. This makes it particularly easy to cluster texts using any algorithm that takes a dist object as argument. Many such algorithms available in R do, for example:

d <- stringdistmatrix(x,useNames="strings")
h <- stats::hclust(d)


(by the way, parallelizing the calculation of a lower triangle of a matrix poses an interesting exercise in index calculation. For those interested, I wrote it down)

Better labeling of distance matrices

Distance matrices can be labeled with the input strings by setting the useNames argument in stringdistmatrix to TRUE or FALSE (the default). However, if you're computing distances between looooong strings, like complete texts it is more convenient to use the names attribute of the input vector. So, the useNames arguments now takes three different values.

> x <- c(one="fu",two="bar",three="baz",four="barb")
> y <- c(a="foo",b="fuu")
> # the default:
> stringdistmatrix(x,y,useNames="none") 
     [,1] [,2]
[1,]    2    1
[2,]    3    3
[3,]    3    3
[4,]    4    4
> # like useNames=TRUE
> stringdistmatrix(x,y,useNames = "strings")
     foo fuu
fu     2   1
bar    3   3
baz    3   3
barb   4   4
> # use labels
> stringdistmatrix(x,y,useNames="names")
      a b
one   2 1
two   3 3
three 3 3
four  4 4

String similarities

Thanks to Jan van der Laan, a string similarity convenience function has been added. It computes the distance metric between two strings and then rescales it as , where the maximum possible distance depends on the type of distance metric and (depending on the metric) the length of the strings.

# similarity based on the damerau-levenshtein distance
> stringsim(c("hello", "World"), c("Ola", "Mundo"),method="dl")
[1] 0.2 0.0
# similarity based on the jaro distance
> stringsim(c("hello", "World"), c("Ola", "Mundo"),method="jw")
[1] 0.5111111 0.4666667

Here a similarity of 0 means completely different and 1 means exactly the same (within the chosen metric).

Deprecated arguments

The stringdistmatrix function had to option to be computed in parallel based on facilities of the parallel package. However, as of stringdist 0.9.0, all distance calculations are multicored by default.

Therefore, I'm phasing out the following options in stringdistmatrix:

  • ncores (how many R-sessions should be started by parallel to compute the matrix?)
  • cluster (optionally, provide your own cluster, created by parallel::makeCluster.

These argument are now ignored with a message but they'll be available untill somewhere in 2016 so users have time to adapt their code. Please mail me if you have any trouble doing so.

Posted in programming, R, string metrics | Leave a comment

stringdist 0.9: exercise all your cores

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.

Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS's that support openmp.

The stringdist package offers two main functions, both of which are now parallelized with openmp:

  • stringdist can compute a number of different string metrics between vectors of strings (see here)
  • amatch is an approximate text matching version of R's native match function.

By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, cores are used and the number of cores is determined by a call to parallel::detectCores(). This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using cores. This is probably because one of the cores is occupied with (for example) competing OS tasks, but I haven't thourougly investigated that. You may still de- or increase the maximum amount of resources consumed since both amatch and stringdist now have a nthread argument. You may also alter the global option


or change the environmental variable OMP_THREAD_LIMIT prior to loading stringdist, but I'm digressing in details now.

A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).


According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn't really matter.


# number of strings
N <- 10000

# Generate N random strings of length min_len to max_len
rand_str <- function(N, min_len=5, max_len=31){
  len <- sample(min_len:max_len, size=N, replace=TRUE)
  sapply(len,function(n) paste(sample(letters,n,replace=TRUE),collapse=""))

# plot results. bm: an object of class microbenchmark
bmplot <- function(bm,...){
  s <- summary(bm)
  unit <- attr(s,"unit")
  med <- s$median
  uq <- s$uq
  lq <- s$lq
  cores <- seq_along(med)
  plot(cores,med, col='white'
    , xlab = "Cores used"
    , ylab = sprintf("Time (%s)",unit)
    , ...
  polygon(c(cores,rev(cores)), c(lq,rev(uq))
    , col=adjustcolor('blue',alpha.f = 0.1)
    , border=NA)

x <- rand_str(N)
y <- rand_str(N)

bm_sd <- microbenchmark(times=100
  , stringdist(x,y,nthread=1)               
  , stringdist(x,y,nthread=2)
  , stringdist(x,y,nthread=3)
  , stringdist(x,y,nthread=4)

n <- 15
x1 <- x[1:n]
bm_am <- microbenchmark(times=25
  , amatch(x1,y,nthread=1)               
  , amatch(x1,y,nthread=2)
  , amatch(x1,y,nthread=3)
  , amatch(x1,y,nthread=4)

bmplot(bm_sd,main=sprintf("stringdist %d strings",N))
bmplot(bm_am,main=sprintf("amatch %dx%d strings",n,N))
Posted in programming, R, string metrics | 4 Comments

Easy to use option settings management with the 'settings' package

Last week I released a new package called settings. It grew out of my frustration built up during several small projects where I'm generating heavily parameterized d3/js output. What I wanted was support to

  • define a whole bunch of option settings with default values;
  • be able to set them globally or locally within a function or object without explicitly re-assigning every setting;
  • reset (global) option settings to default with ease.

Turns out, the first and last wishes on my list are fulfilled with the futile.options package. I really wanted the inheritance features though so I experimented a bunch of times with different implementations. Most of those were based on reference classes holding (global) option settings. In the end I chose a functional approach, inspired by futile.options. I feel this approach is both lightweight (the package's code basically fits readably on an A4 page) and elegant[1].

I'm going to give a quick glance of the package here, and refer to the package vignette for extensive examples.

You can define an options manager like this.

opt <- options_manager(foo=0,bar=1)

opt is a function that acts like R's default options function.

# get option settings:
> opt('foo')
[1] 0
# change option settings
> opt(bar=10,foo=6)
> opt()
[1] 6

[1] 10

The cool thing is that you can reset it to defaults like this.

> reset(opt)
> opt()
[1] 0

[1] 1

The second cool thing is that you can create a copy, where the copy has the same defaults but new current settings.

> loc_opt <- clone_and_merge(opt,foo=7)
> loc_opt()
[1] 7

[1] 1
# loc_opt can be reset locally:
> reset(loc_opt)
> loc_opt()
[1] 0

[1] 1

Resetting or otherwise altering loc_opt does not affect the global options set in opt. Of course, loc_opt can be cloned again and again.
This stuff is useful when you write a function and you want to merge options in dot-dot-dot arguments with global options. For example

# user may or may not want to add options like foo=10 when calling 'myfunc'
myfunc <- function(x,...){

  # merge user-defined options with global options in globally defined 'opt'
  loc_opt <- clone_and_merge(opt,...)
  # use local options
  loc_opt('foo') + loc_opt('bar') * x

For more examples, including on how to use this in S4 or reference classes, or how to use settings as an options manager in a package, please see the package vignette. As always, the code is available on github.

[1] Well, it hurts to say there's a bit of abstraction leakage here: there are two option names that cannot be used: .__defaults and .__reset, but the package provides methods to protect against that.

Posted in programming, R | Leave a comment

stringdist 0.8: now with soundex

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words that sound similar (when pronounced in English) to the same code.

Soundex codes can be computed with the new phonetic function, for example:

> phonetic(c('Euler','Gauss','Hilbert','Knuth','Lloyd','Lukasiewicz','Wachs'))
[1] "E460" "G200" "H416" "K530" "L300" "L222" "W200"

Two strings are considered equal when they have the same soundex code, we have a two-valued distance function.

> stringdist('Claire','Clare',method='soundex')
[1] 0
[1] 1

Since soundex is really only defined on the printable ASCII character set, a warning is given when non-ascii or non-printable ascii characters are encountered.

> phonetic("Jörgen")
[1] "J?62"
Warning message:
In phonetic("Jörgen") :
  soundex encountered 1 non-printable ASCII or non-ASCII
  characters. Results may be unreliable, see ?printable_ascii

The also new function printable_ascii can help you to detect such characters.

> printable_ascii(c("jörgen","jurgen"))

To get rid of such characters in a sensible way there are a few options. First of all, you may want to try R's built-in iconv interface to translate accented characters to ascii.

> iconv("jörgen",to="ASCII//TRANSLIT")
[1] "jorgen"

However, behaviour of iconv may be system-dependent, see the iconv documentation for a thorough discussion. Another option is to install the stringi package.

> stri_trans_general("jörgen","Latin-ASCII")
[1] "jorgen"

This package should yield the same result, regardless of the OS you're working on.

Posted in data correction methods, data manipulation, R, string metrics | Leave a comment

I came accross this post on SO, where several solutions to sorting data.frames are presented. It must have been solved a million times, but here's a solution I like to use. It benefits from the fact that sort is an S3 generic. <- function(x, decreasing=FALSE, by=1, ... ){
  f <- function(...) order(...,decreasing=decreasing)
  i <-,x[by])

It sorts on the first column by default, but you may use any vector of valid column indices. Here are some examples.

sort(iris, by="Sepal.Length")
sort(iris, by=c("Species","Sepal.Length"))
sort(iris, by=1:2)
sort(iris, by="Sepal.Length",decreasing=TRUE)
Posted in data manipulation, R | 5 Comments