The stringdist package

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.

String distance functions seem to have been partly missing and partly scattered around R and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R's native adist function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.

I've just published a package that (re-)implements four different string metrics and offers them through a uniform interface:

  • Hamming distance: for strings of equal size only; counts the number of different characters.
  • Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
  • Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
  • True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.

As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web - including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.

The package offers two basic interfaces:

  • stringdist computes pairwise distance between character vectors,where the shorter one is recycled.
  • stringdistmatrix: computes the full distance matrix, optionally using multiple cores.

See the built-in manual for more details.

I'm planning to add more distance metrics in the future and I'm happy to receive suggestions, comments, bugreports etc.

The github repo is here and the CRAN page is here.

This entry was posted in R. Bookmark the permalink.

2 Responses to The stringdist package

  1. John R. Vokey says:

    Mark,
    As you don't list it among related packages, let me draw your attention to the `vwr' package by Emmanuel Keuleers. AFAIK, I don't think it implements the weighted Damerau-Levenshtein distance either, but is otherwise a convenient and very easy-to-use package.

Leave a Reply

Your email address will not be published. Required fields are marked *


7 × = seven


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>