The stringdist package

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.

String distance functions seem to have been partly missing and partly scattered around R and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R's native adist function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.

I've just published a package that (re-)implements four different string metrics and offers them through a uniform interface:

  • Hamming distance: for strings of equal size only; counts the number of different characters.
  • Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
  • Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
  • True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.

As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web - including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.

The package offers two basic interfaces:

  • stringdist computes pairwise distance between character vectors,where the shorter one is recycled.
  • stringdistmatrix: computes the full distance matrix, optionally using multiple cores.

See the built-in manual for more details.

I'm planning to add more distance metrics in the future and I'm happy to receive suggestions, comments, bugreports etc.

The github repo is here and the CRAN page is here.

This entry was posted in R, string metrics. Bookmark the permalink.

6 Responses to The stringdist package

  1. John R. Vokey says:

    Mark,
    As you don't list it among related packages, let me draw your attention to the `vwr' package by Emmanuel Keuleers. AFAIK, I don't think it implements the weighted Damerau-Levenshtein distance either, but is otherwise a convenient and very easy-to-use package.

  2. Ruprecht says:

    Hi,
    thanks for your work!
    Two suggestions: you might want to allow the user to specify missing data (i.e. '-'), so those can be excluded in the computation. Also, it would be nice if the package generated a distance matrix, similar as dist(x) does for numerical matrices. To have those functions would be very handy.
    Ruprecht

    • mark says:

      Hey Ruprecht, (sorry for the late reply, I've been traveling)

      The package has a 'stringdistmatrix' function that generates the distance matrix. I wanted to separate the use case where you may want to recycle the shortes character vector, e.q. when matching a word against a vector of possibilities.

      About the NA request. I consider this a part of "string normalisation" which is something that is better done separate from the distance calculation. The tools in this package are ment to be pretty low-level.

      After all, it is easy for the user to do

      x[grepl("-",x,fixed=TRUE)] <- NA

      Having said that, I'll probably add some normalisation tools in the future, depending on what's already out there (e.g. in the tm package).

      btw: version 0.5 was submitted to CRAN yesterday.

  3. Jason says:

    I am trying to match two data sets (~65,000 and ~180,000) across a number of biographical fields (Name (first, last), DOB, address, etc.). The data as you can imagine is "dirty". I am very much a novice with this stuff, so knowledge of application isn't very deep. I would like to identify at least 3 near-matches for each observation. Will this package allow me to do this? Any suggestions would greatly be appreciated as well.

    • mark says:

      Hi Jason,

      Yes it can. With the newest version (0.6 uploaded just now and hopefully on CRAN soon) also comes the 'amatch' function which resembles R's match function but allows for inexact matches.

      Your main problem will be the large number of comparisons to make: just using stringdistmatrix will produce a 65k times 180k numerical array. That's about 1.2E10 elements, gobbling up about 10G of RAM.

      To get the closest three matches of a key in a vector of keys, you can do something like this:

      x = 'hello'
      y = c('world!', 'hello', 'this', 'is', 'foobar')
      m <- stringdistmatrix(x,y)
      y[order(x,dercreasing=FALSE)[1:3]]

      You may also want to have a look at the RecordLinkage package which uses the Jaro-Winkler distance for approximate matching.

Leave a Reply

Your email address will not be published. Required fields are marked *


2 + six =


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">