String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.
String distance functions seem to have been partly missing and partly scattered around R
and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R's native adist
function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.
I've just published a package that (re-)implements four different string metrics and offers them through a uniform interface:
- Hamming distance: for strings of equal size only; counts the number of different characters.
- Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
- Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
- True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.
As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web - including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.
The package offers two basic interfaces:
stringdist
computes pairwise distance between character vectors,where the shorter one is recycled.stringdistmatrix
: computes the full distance matrix, optionally using multiple cores.
See the built-in manual for more details.
I'm planning to add more distance metrics in the future and I'm happy to receive suggestions, comments, bugreports etc.
Mark,
As you don't list it among related packages, let me draw your attention to the `vwr' package by Emmanuel Keuleers. AFAIK, I don't think it implements the weighted Damerau-Levenshtein distance either, but is otherwise a convenient and very easy-to-use package.
Thanks! Il'l look into that.
Hi,
thanks for your work!
Two suggestions: you might want to allow the user to specify missing data (i.e. '-'), so those can be excluded in the computation. Also, it would be nice if the package generated a distance matrix, similar as dist(x) does for numerical matrices. To have those functions would be very handy.
Ruprecht
Hey Ruprecht, (sorry for the late reply, I've been traveling)
The package has a 'stringdistmatrix' function that generates the distance matrix. I wanted to separate the use case where you may want to recycle the shortes character vector, e.q. when matching a word against a vector of possibilities.
About the NA request. I consider this a part of "string normalisation" which is something that is better done separate from the distance calculation. The tools in this package are ment to be pretty low-level.
After all, it is easy for the user to do
x[grepl("-",x,fixed=TRUE)] <- NA Having said that, I'll probably add some normalisation tools in the future, depending on what's already out there (e.g. in the tm package). btw: version 0.5 was submitted to CRAN yesterday.
I am trying to match two data sets (~65,000 and ~180,000) across a number of biographical fields (Name (first, last), DOB, address, etc.). The data as you can imagine is "dirty". I am very much a novice with this stuff, so knowledge of application isn't very deep. I would like to identify at least 3 near-matches for each observation. Will this package allow me to do this? Any suggestions would greatly be appreciated as well.
Hi Jason,
Yes it can. With the newest version (0.6 uploaded just now and hopefully on CRAN soon) also comes the 'amatch' function which resembles R's match function but allows for inexact matches.
Your main problem will be the large number of comparisons to make: just using stringdistmatrix will produce a 65k times 180k numerical array. That's about 1.2E10 elements, gobbling up about 10G of RAM.
To get the closest three matches of a key in a vector of keys, you can do something like this:
x = 'hello'
y = c('world!', 'hello', 'this', 'is', 'foobar')
m <- stringdistmatrix(x,y) y[order(x,dercreasing=FALSE)[1:3]] You may also want to have a look at the RecordLinkage package which uses the Jaro-Winkler distance for approximate matching.
Hi Mark
What a pity the package in not working on Rstudio because of the R version compatibility!
Otherwise, very useful! Thanks
Hi Leila,
The package works fine with RStudio. You do need to have R >= 2.15.3 installed because a bug in R's native 'utf8ToInt' in older R versions hampers with the package.
Cheers,
Mark