stringdist 0.9.4 and 0.9.3: distances between integer sequences

A new release of stringdist has been accepted on CRAN.

stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding.

version 0.9.4

  • bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused UBSAN to fire, but gave correct results).
  • bugfix in jw distance: not symmetric for certain cases (thanks to github user gtumuluri)

Since 0.9.3, stringdist can compute distances between integer sequences, and you can use hashr to compute an integer representation of any (sequence of) R objects, based on the superFastHash algorithm of Paul Hsieh.

version 0.9.3

  • new functions computing distances between integer sequences: seq_dist, seq_distmatrix
  • new function for tokenizing integer sequences: seq_qgrams
  • new function for matching integer sequences: seq_amatch
  • q-gram based distances are now always 0 when q=0 (used to be Inf if at least one of the arguments was not the empty string)
  • stringdist, stringdistmatrix now emit warning when presented with list argument
  • small c-side code optimizations
  • bugfix in dl, lv, osa distance: weights were not taken into account properly (thanks to Zach Price)

All code on github.

Posted in programming, R, string metrics | Leave a comment

Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples.

Computing 'dist' objects with 'stringdistmatrix'

The R dist object is used as input for many clustering algorithms such as cluster::hclust. It is stores the lower triangle of a matrix of distances between a vector of objects. The function stringdist::stringdistmatrix now takes a variable number of character arguments. If two vectors are given, it behaves the same as it used to.

> x <- c("fu","bar","baz","barb")
> stringdistmatrix(x,x,useNames="strings")
     fu bar baz barb
fu    0   3   3    4
bar   3   0   1    1
baz   3   1   0    2
barb  4   1   2    0

However, we're doing more work then necessary. Feeding stringdistmatrix just a single character argument yields the same information, but at half the computational and storage cost.

> stringdistmatrix(x,useNames="strings")
     fu bar baz
bar   3        
baz   3   1    
barb  4   1   2

The output is a dist object storing only the subdiagonal triangle. This makes it particularly easy to cluster texts using any algorithm that takes a dist object as argument. Many such algorithms available in R do, for example:

d <- stringdistmatrix(x,useNames="strings")
h <- stats::hclust(d)
plot(h)

cluster

(by the way, parallelizing the calculation of a lower triangle of a matrix poses an interesting exercise in index calculation. For those interested, I wrote it down)

Better labeling of distance matrices

Distance matrices can be labeled with the input strings by setting the useNames argument in stringdistmatrix to TRUE or FALSE (the default). However, if you're computing distances between looooong strings, like complete texts it is more convenient to use the names attribute of the input vector. So, the useNames arguments now takes three different values.

> x <- c(one="fu",two="bar",three="baz",four="barb")
> y <- c(a="foo",b="fuu")
> # the default:
> stringdistmatrix(x,y,useNames="none") 
     [,1] [,2]
[1,]    2    1
[2,]    3    3
[3,]    3    3
[4,]    4    4
> # like useNames=TRUE
> stringdistmatrix(x,y,useNames = "strings")
     foo fuu
fu     2   1
bar    3   3
baz    3   3
barb   4   4
> # use labels
> stringdistmatrix(x,y,useNames="names")
      a b
one   2 1
two   3 3
three 3 3
four  4 4

String similarities

Thanks to Jan van der Laan, a string similarity convenience function has been added. It computes the distance metric between two strings and then rescales it as , where the maximum possible distance depends on the type of distance metric and (depending on the metric) the length of the strings.

# similarity based on the damerau-levenshtein distance
> stringsim(c("hello", "World"), c("Ola", "Mundo"),method="dl")
[1] 0.2 0.0
# similarity based on the jaro distance
> stringsim(c("hello", "World"), c("Ola", "Mundo"),method="jw")
[1] 0.5111111 0.4666667

Here a similarity of 0 means completely different and 1 means exactly the same (within the chosen metric).

Deprecated arguments

The stringdistmatrix function had to option to be computed in parallel based on facilities of the parallel package. However, as of stringdist 0.9.0, all distance calculations are multicored by default.

Therefore, I'm phasing out the following options in stringdistmatrix:

  • ncores (how many R-sessions should be started by parallel to compute the matrix?)
  • cluster (optionally, provide your own cluster, created by parallel::makeCluster.

These argument are now ignored with a message but they'll be available untill somewhere in 2016 so users have time to adapt their code. Please mail me if you have any trouble doing so.

Posted in programming, R, string metrics | Leave a comment

stringdist 0.9: exercise all your cores

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.

Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS's that support openmp.

The stringdist package offers two main functions, both of which are now parallelized with openmp:

  • stringdist can compute a number of different string metrics between vectors of strings (see here)
  • amatch is an approximate text matching version of R's native match function.

By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, cores are used and the number of cores is determined by a call to parallel::detectCores(). This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using cores. This is probably because one of the cores is occupied with (for example) competing OS tasks, but I haven't thourougly investigated that. You may still de- or increase the maximum amount of resources consumed since both amatch and stringdist now have a nthread argument. You may also alter the global option

  options("sd_num_thread")

or change the environmental variable OMP_THREAD_LIMIT prior to loading stringdist, but I'm digressing in details now.

A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).

benchmark

According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn't really matter.


library(microbenchmark)
library(stringdist)
set.seed(2015)

# number of strings
N <- 10000

# Generate N random strings of length min_len to max_len
rand_str <- function(N, min_len=5, max_len=31){
  len <- sample(min_len:max_len, size=N, replace=TRUE)
  sapply(len,function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
}

# plot results. bm: an object of class microbenchmark
bmplot <- function(bm,...){
  s <- summary(bm)
  unit <- attr(s,"unit")
  med <- s$median
  uq <- s$uq
  lq <- s$lq
  cores <- seq_along(med)
  plot(cores,med, col='white'
    , xlab = "Cores used"
    , ylab = sprintf("Time (%s)",unit)
    , ...
  )
  polygon(c(cores,rev(cores)), c(lq,rev(uq))
    , col=adjustcolor('blue',alpha.f = 0.1)
    , border=NA)
  lines(cores,med,lwd=2)
  points(cores,med,pch=16)
}


x <- rand_str(N)
y <- rand_str(N)


bm_sd <- microbenchmark(times=100
  , stringdist(x,y,nthread=1)               
  , stringdist(x,y,nthread=2)
  , stringdist(x,y,nthread=3)
  , stringdist(x,y,nthread=4)
)

n <- 15
x1 <- x[1:n]
bm_am <- microbenchmark(times=25
  , amatch(x1,y,nthread=1)               
  , amatch(x1,y,nthread=2)
  , amatch(x1,y,nthread=3)
  , amatch(x1,y,nthread=4)
)

par(mfrow=c(1,2))
bmplot(bm_sd,main=sprintf("stringdist %d strings",N))
bmplot(bm_am,main=sprintf("amatch %dx%d strings",n,N))
Posted in programming, R, string metrics | 4 Comments

Easy to use option settings management with the 'settings' package

Last week I released a new package called settings. It grew out of my frustration built up during several small projects where I'm generating heavily parameterized d3/js output. What I wanted was support to

  • define a whole bunch of option settings with default values;
  • be able to set them globally or locally within a function or object without explicitly re-assigning every setting;
  • reset (global) option settings to default with ease.

Turns out, the first and last wishes on my list are fulfilled with the futile.options package. I really wanted the inheritance features though so I experimented a bunch of times with different implementations. Most of those were based on reference classes holding (global) option settings. In the end I chose a functional approach, inspired by futile.options. I feel this approach is both lightweight (the package's code basically fits readably on an A4 page) and elegant[1].

I'm going to give a quick glance of the package here, and refer to the package vignette for extensive examples.

You can define an options manager like this.

library(settings)
opt <- options_manager(foo=0,bar=1)

opt is a function that acts like R's default options function.

# get option settings:
> opt('foo')
[1] 0
# change option settings
opt(bar=10)
opt()
> opt(bar=10,foo=6)
> opt()
$foo
[1] 6

$bar
[1] 10

The cool thing is that you can reset it to defaults like this.

> reset(opt)
> opt()
$foo
[1] 0

$bar
[1] 1

The second cool thing is that you can create a copy, where the copy has the same defaults but new current settings.

> loc_opt <- clone_and_merge(opt,foo=7)
> loc_opt()
$foo
[1] 7

$bar
[1] 1
# loc_opt can be reset locally:
> reset(loc_opt)
> loc_opt()
$foo
[1] 0

$bar
[1] 1

Resetting or otherwise altering loc_opt does not affect the global options set in opt. Of course, loc_opt can be cloned again and again.
This stuff is useful when you write a function and you want to merge options in dot-dot-dot arguments with global options. For example

# user may or may not want to add options like foo=10 when calling 'myfunc'
myfunc <- function(x,...){

  # merge user-defined options with global options in globally defined 'opt'
  loc_opt <- clone_and_merge(opt,...)
  # use local options
  loc_opt('foo') + loc_opt('bar') * x
}

For more examples, including on how to use this in S4 or reference classes, or how to use settings as an options manager in a package, please see the package vignette. As always, the code is available on github.

[1] Well, it hurts to say there's a bit of abstraction leakage here: there are two option names that cannot be used: .__defaults and .__reset, but the package provides methods to protect against that.

Posted in programming, R | Leave a comment

stringdist 0.8: now with soundex

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words that sound similar (when pronounced in English) to the same code.

Soundex codes can be computed with the new phonetic function, for example:

library(stringdist)
> phonetic(c('Euler','Gauss','Hilbert','Knuth','Lloyd','Lukasiewicz','Wachs'))
[1] "E460" "G200" "H416" "K530" "L300" "L222" "W200"

Two strings are considered equal when they have the same soundex code, we have a two-valued distance function.

> stringdist('Claire','Clare',method='soundex')
[1] 0
stringdist('Harry','Joe',method='soundex')
[1] 1

Since soundex is really only defined on the printable ASCII character set, a warning is given when non-ascii or non-printable ascii characters are encountered.

> phonetic("Jörgen")
[1] "J?62"
Warning message:
In phonetic("Jörgen") :
  soundex encountered 1 non-printable ASCII or non-ASCII
  characters. Results may be unreliable, see ?printable_ascii

The also new function printable_ascii can help you to detect such characters.

> printable_ascii(c("jörgen","jurgen"))
[1] FALSE  TRUE

To get rid of such characters in a sensible way there are a few options. First of all, you may want to try R's built-in iconv interface to translate accented characters to ascii.

> iconv("jörgen",to="ASCII//TRANSLIT")
[1] "jorgen"

However, behaviour of iconv may be system-dependent, see the iconv documentation for a thorough discussion. Another option is to install the stringi package.

library(stringi)
> stri_trans_general("jörgen","Latin-ASCII")
[1] "jorgen"

This package should yield the same result, regardless of the OS you're working on.

Posted in data correction methods, data manipulation, R, string metrics | Leave a comment

sort.data.frame

I came accross this post on SO, where several solutions to sorting data.frames are presented. It must have been solved a million times, but here's a solution I like to use. It benefits from the fact that sort is an S3 generic.

sort.data.frame <- function(x, decreasing=FALSE, by=1, ... ){
  f <- function(...) order(...,decreasing=decreasing)
  i <- do.call(f,x[by])
  x[i,,drop=FALSE]
}

It sorts on the first column by default, but you may use any vector of valid column indices. Here are some examples.

sort(iris, by="Sepal.Length")
sort(iris, by=c("Species","Sepal.Length"))
sort(iris, by=1:2)
sort(iris, by="Sepal.Length",decreasing=TRUE)
Posted in data manipulation, R | 5 Comments

Review of "Building interactive graphs with ggplot2 and shiny"

Recently, Packt published a video course with the above title, and I've just spent a pleasant morning reviewing it on Packt's request. Pleasant, because I think the course gives an excellent introduction to both ggplot2 and shiny. The course is authored by Christophe Ladroue.

Course material

Both video and sound are of good quality, and Christophe's clearly pronounced English is easy to follow. I somtimes find that the speed of the course is a bit to high but it is easy to rewind and pause. All the scripts shown in the course come with the downloaded package, and I highly recommend people who follow the course run the examples and alter them.

Required knowledge

You need to be familiar with R. The course assumes you have some programming experience and if you've never written a function in R, you will need to brush up your R knowledge for the more advanced parts of the course.

Structure

The course is divided in 8 chapters, each consisting of 5 videos of about 2-3 minutes. It is set up in a top-down manner, starting (after some download-and-install instructions) with a quick intro to the grammar of graphics, after which a number of common visualizations are treated without lingering on details: default settings are used throughout to introduce line plots, paths, density plots and histograms and (multiple) boxplots. The course continues explaining how to create grouped plots (Ch. 3), add smoothers (Ch. 4), and then follows with details on how to tweak things like axes, scales and colors (Ch. 5). After this, chapters 6, 7 and 8 introduce shiny, reactive programming, and a more complex application.

Pros

One of the things that really adds value to the course is that Christophe points out some commonly made mistakes by (new) users of ggplot. Having someone explain to you the difference between setting point color as an aesthetic rather than a fixed color can be a real time-saver for beginners.

Many subjects are discussed using a task-based structure. Flipping axes, changing the order of bars in a bar chart: all such tasks are easily found by browsing the index. Also, the section on tweaking axes and colors includes a short discussion on the BigVis package for visualizing larger datasets.

Explanation of shiny is well-structured: first, the application is shown, next the essential parts of the code are explained. I find that the course does a good job in explaining fairly complex material (reactive programming, scopes) in a concise and clear way.

Cons

As said above, I sometimes feel that the course is a little fast. I had to rewind a number of times. Since it's a video course I don't really consider this a big issue.

The course could have included a bit more pointers on how to graphically analyze data. For example, when discussing density plots and histogram, it could have been emphasized that it is very important to play with bandwidth/binwidth. On the other hand: it is a course on ggplot2 and shiny and not on graphical data analyses.

The discussion on condense and bin of the BigVis package are a bit short. I needed to read their help files to really understand what they do.

A discussion on the formula interface in the facet_wrap and facet_grid will probably be helpful for users following the course.

Conclusion

Learning ggplot2 is not easier than learning base graphics. In fact, one may argue that the learning curve for ggplot2 is a bit steeper since you need to familiarize yourself with concepts of the grammar of graphics. The big plus is that ggplot2 makes a lot of things easy once you learn it: axes, scales and legends have really good defaults in ggplot2. Learning shiny is another step up for R programmers since you need to learn about reactive programming.

I find that this course introduces both tools well and in a practical manner. I recommend this course to anyone who has sufficient R experience (see above) and who seriously wants to get going with ggplot2 and shiny. After that, I'd still keep Hadley's book at hand as a reference.

Posted in R | Leave a comment

A bit of benchmarking with string distances

After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings of 5-25 characters, which probably covers many use cases involving natural language. If your language happens to be denoted in single-byte encoding that is, but more on that below. Here's Zachary's benchmark (rerun by myself).

library(rbenchmark)
library(RecordLinkage)
library(stringdist)

> x <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> y <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> benchmark(
+   d1 <- stringdist(x,y,method='lv') 
+   , d2 <- levenshteinDist(x,y)
+   , replications=10
+ )
                                   test replications elapsed relative user.self
1 d1 <- stringdist(x, y, method = "lv")           10   4.939    1.876     4.924
2           d2 <- levenshteinDist(x, y)           10   2.633    1.000     2.632

Auch, indeed.

As we all know, premature optimization is the root of all evil so I felt it was time for some mature optimization.

Checking the differences between stringdist and RecordLinkage I found that RecordLinkage uses C code that works directly on the char representation of strings while I designed stringdist to make sure that multibyte characters are treated as a single character. This is done by converting them to integers before feeding them to my underlying C routines (using native R functionality). Hence, there is some conversion overhead that is linear in string length while the time spent on computing the Levenshtein distance grows with the product of the length of strings compared (say, quadratically). This is easily checked by benchmarking over longer strings.

> x <- sapply(sample(25:250,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> y M- sapply(sample(25:250,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))  
> 
> benchmark(
+   d1 <- stringdist(x,y,method='lv') 
+   , d2 <- levenshteinDist(x,y)
+   , replications=10
+ )
                                   test replications elapsed relative user.self
1 d1 <- stringdist(x, y, method = "lv")           10 111.668    1.258   111.584
2           d2 <- levenshteinDist(x, y)           10  88.784    1.000    88.716

And indeed, the relative difference decreased from a factor of 1.8 to 1.25. So, I rolled up my sleeves and built in a useBytes option that allows one to skip the conversion step. Here's the result on Zachary's original benchmark.

> x <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> y <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
>
> 
> benchmark(
+   d1 <- stringdist(x,y,method='lv') 
+   , d2 <- stringdist(x,y,method='lv',useBytes=TRUE) 
+   , d3 <- levenshteinDist(x,y)
+   , replications=10
+ )
                                                    test replications elapsed
1                  d1 <- stringdist(x, y, method = "lv")           10   4.806
2 d2 <- stringdist(x, y, method = "lv", useBytes = TRUE)           10   1.463
3                            d3 <- levenshteinDist(x, y)           10   2.942

Yowza! not only did I get rid of some overhead, stringdist is also about twice as fast as RecordLinkage in this benchmark. I haven't really profiled this but the main reason is probably that stringdist has a better memory allocation strategy[*]

We can also check this for the jaro-winkler distance, which is also implemented by RecordLinkage.

# (output of rbenchmark)
1                  d1 <- stringdist(x, y, method = "jw")           10   4.526
2 d2 <- stringdist(x, y, method = "jw", useBytes = TRUE)           10   0.513
3                                d3 <- jarowinkler(x, y)           10   1.303

Again, stringdist is a factor of two faster.

But, there's always a but. To be fair we need to do the benchmark on longer strings as well, and that's where RecordLinkage comes back with a vengeance.

> x <- sapply(sample(50:250,1e4,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> y <- sapply(sample(50:250,1e4,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) 
> 
> 
> benchmark(
+   d1 <- stringdist(x,y,method='lv') 
+   , d2 <- stringdist(x,y,method='lv',useBytes=TRUE) 
+   , d3 <- levenshteinDist(x,y)
+   , replications=10
+ )
                                                    test replications elapsed
1                  d1 <- stringdist(x, y, method = "lv")           10  12.557
2 d2 <- stringdist(x, y, method = "lv", useBytes = TRUE)           10  11.972
3                            d3 <- levenshteinDist(x, y)           10  10.091

Here, stringdist is 10-20 percent slower again. Again, I have not seriously profiled this but there is still some overhead in stringdist that's not in RecordLinkage. stringdist needs to select the algorithm, perform some more argument checking, and I still have to do a (very simple) conversion from char to int. Oh well, you can't win 'em all.

The main point for users is to decide whether the useBytes option is appropriate for the problem their trying to solve. The problem of working directly on the bytes in a character string is that for non-ASCII characters, the results depend on the character encoding scheme (and therefore most likely on the OS you're using.). Here's an example.

# the function from RecordLinkage
> levenshteinDist("ü","u")
[1] 2
# stringdist
> stringdist("ü","u",method='lv')
[1] 1
> stringdist("ü","u",method='lv',useBytes=TRUE)
[2] 2

On my system (linux) strings are encoded in utf8 where, u-umlaut is a 2-byte character[**]. Since RecordLinkage treats strings byte-by-byte it 'sees' two characters. It needs to substitute one and delete another yielding a distance value of 2. stringdist sees u-umlaut as a single character, but this may be turned off with the useBytes switch.

The bottom line is that it depends on the application which solution is best for your purpose. Here are some quick rules of thumb:

- Comparing strings with characters possibly outside of the ASCII range, with results that needs to be portable: stringdist
- Comparing single-byte encoded strings (like ASCII): for strings upto ~100 characters: stringdist with useBytes=TRUE. Otherwise RecordLinkage
- Distances other than Levenshtein or Jaro-Winkler: stringdist.

The latest version of stringdist (0.7.0) was accepted on CRAN yesterday (06.09.2013) and comes with the useBytes option in all functions. Also read the NEWS for bugfixes and new features.

The latest version always comes fully packed with the newest, most advanced bugs that offer game-changing insights[***] into the cranial processes of the programmer, so please don't forget to drop me a line when you find one.

[*]By the way, you can do really cool things with oprofile under linux if you want to profile C-code linked to R, but that's for another blog.
[**] well there are several ways to represent u-umlaut in utf8 but let's not get too technical.
[***] Buzzword bingo!

Posted in R, string metrics | 4 Comments

Approximate string matching in R

I have released a new version of the stringdist package.

Besides a some new string distance algorithms it now contains two convenient matching functions:

  • amatch: Equivalent to R's match function but allowing for approximate matching.
  • ain: Similar to R's %in% operator
# here's an example of amatch
> x <- c('foo', 'bar')
> amatch('fu',x,maxDist=2)
[1] 1

# if we decrease the maximum allowd distance, we get 
> amatch('fu',x,maxDist=1)
[1] NA

# just like with 'match' you can control the output of no-matches:
> amatch('fu',x,maxDist=1,nomatch=0)
[1] 0

# to see if 'fu' matches approximately with any element of x:
ain('fu',x)
[1] FALSE

# however, if we allow for larger distances
ain('fu',x,maxDist=2)
[1] TRUE

Check the helpfile of for other options, like how to choose the string distance algorithm.

Note previously stringdist and stringdistmatrix returned -1 if a distance was undefined or exceeding a predefined maximum. Now,
these functions return Inf in such cases, making it easier to do comparisons. It may break your code if you explicitly test output for this.

With the latest release also arrive the latest bugs, so please drop me a line if you happen to stumble upon one.

The next release will probably not include any user-facing changes, but I'm planning to improve performance by smarter memory allocation and better maxDist handling for some of the string distance algorithms. I've also started working on a 'useBytes' option, which gives considerable performance improvement for edit-based distances, at the cost of getting encoding-dependent results.

Posted in data correction methods, R, string metrics | 12 Comments

The stringdist package

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.

String distance functions seem to have been partly missing and partly scattered around R and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R's native adist function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.

I've just published a package that (re-)implements four different string metrics and offers them through a uniform interface:

  • Hamming distance: for strings of equal size only; counts the number of different characters.
  • Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
  • Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
  • True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.

As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web - including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.

The package offers two basic interfaces:

  • stringdist computes pairwise distance between character vectors,where the shorter one is recycled.
  • stringdistmatrix: computes the full distance matrix, optionally using multiple cores.

See the built-in manual for more details.

I'm planning to add more distance metrics in the future and I'm happy to receive suggestions, comments, bugreports etc.

The github repo is here and the CRAN page is here.

Posted in R, string metrics | 8 Comments