stringdist 0.9: exercise all your cores

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.

Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS's that support openmp.

The stringdist package offers two main functions, both of which are now parallelized with openmp:

  • stringdist can compute a number of different string metrics between vectors of strings (see here)
  • amatch is an approximate text matching version of R's native match function.

By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, cores are used and the number of cores is determined by a call to parallel::detectCores(). This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using cores. This is probably because one of the cores is occupied with (for example) competing OS tasks, but I haven't thourougly investigated that. You may still de- or increase the maximum amount of resources consumed since both amatch and stringdist now have a nthread argument. You may also alter the global option

  options("sd_num_thread")

or change the environmental variable OMP_THREAD_LIMIT prior to loading stringdist, but I'm digressing in details now.

A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).

benchmark

According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn't really matter.


library(microbenchmark)
library(stringdist)
set.seed(2015)

# number of strings
N <- 10000

# Generate N random strings of length min_len to max_len
rand_str <- function(N, min_len=5, max_len=31){
  len <- sample(min_len:max_len, size=N, replace=TRUE)
  sapply(len,function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
}

# plot results. bm: an object of class microbenchmark
bmplot <- function(bm,...){
  s <- summary(bm)
  unit <- attr(s,"unit")
  med <- s$median
  uq <- s$uq
  lq <- s$lq
  cores <- seq_along(med)
  plot(cores,med, col='white'
    , xlab = "Cores used"
    , ylab = sprintf("Time (%s)",unit)
    , ...
  )
  polygon(c(cores,rev(cores)), c(lq,rev(uq))
    , col=adjustcolor('blue',alpha.f = 0.1)
    , border=NA)
  lines(cores,med,lwd=2)
  points(cores,med,pch=16)
}


x <- rand_str(N)
y <- rand_str(N)


bm_sd <- microbenchmark(times=100
  , stringdist(x,y,nthread=1)               
  , stringdist(x,y,nthread=2)
  , stringdist(x,y,nthread=3)
  , stringdist(x,y,nthread=4)
)

n <- 15
x1 <- x[1:n]
bm_am <- microbenchmark(times=25
  , amatch(x1,y,nthread=1)               
  , amatch(x1,y,nthread=2)
  , amatch(x1,y,nthread=3)
  , amatch(x1,y,nthread=4)
)

par(mfrow=c(1,2))
bmplot(bm_sd,main=sprintf("stringdist %d strings",N))
bmplot(bm_am,main=sprintf("amatch %dx%d strings",n,N))
This entry was posted in programming, R, string metrics. Bookmark the permalink.

4 Responses to stringdist 0.9: exercise all your cores

  1. Pingback: Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments | Mark van der Loo

  2. N says:

    It seems like nthread has not effect on the benchmark I did with the same code.
    Is there a need for additional packages (e.g., parallel, snow)?

    • N says:

      Oh, it's probably because I tried on OS X w/o openmp.

      • mark says:

        yep, I just wanted to guess you're on OSX... To get it working on OSX you need to compile R and then install stringdist from source, after installing the gcc toolchain. AFAIK openmp will be supported by OSX/clang in the future, but not sure how remote.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

*