Approximate string matching in R

I have released a new version of the stringdist package.

Besides a some new string distance algorithms it now contains two convenient matching functions:

  • amatch: Equivalent to R's match function but allowing for approximate matching.
  • ain: Similar to R's %in% operator
# here's an example of amatch
> x <- c('foo', 'bar')
> amatch('fu',x,maxDist=2)
[1] 1

# if we decrease the maximum allowd distance, we get 
> amatch('fu',x,maxDist=1)
[1] NA

# just like with 'match' you can control the output of no-matches:
> amatch('fu',x,maxDist=1,nomatch=0)
[1] 0

# to see if 'fu' matches approximately with any element of x:

# however, if we allow for larger distances
[1] TRUE

Check the helpfile of for other options, like how to choose the string distance algorithm.

Note previously stringdist and stringdistmatrix returned -1 if a distance was undefined or exceeding a predefined maximum. Now,
these functions return Inf in such cases, making it easier to do comparisons. It may break your code if you explicitly test output for this.

With the latest release also arrive the latest bugs, so please drop me a line if you happen to stumble upon one.

The next release will probably not include any user-facing changes, but I'm planning to improve performance by smarter memory allocation and better maxDist handling for some of the string distance algorithms. I've also started working on a 'useBytes' option, which gives considerable performance improvement for edit-based distances, at the cost of getting encoding-dependent results.

This entry was posted in data correction methods, R, string metrics. Bookmark the permalink.

12 Responses to Approximate string matching in R

  1. Matthieu says:

    Great! Thanks a lot for this, it is a feature I was personally waiting for since a long time!

    Just one issue, that is quite tricky to solve, is the point when there are multiple matches. I understand it returns only the first match, but in some cases it might be useful to show all matches?

    I understand however this would create a problem when x has length bigger than 1, as there would be then confusion, maybe then returning a list (or df?), would then be appropriate?


  2. mark says:

    Hi Matthieu,

    You can find multiple matches by using the stringdist function directly. For example:

    x <- "abc" y <- c("abc", "pqr","bac") stringdist(x,y) <= 2 or: which (stringdist(x,y) <= 2) You do have to handle NA's yourself in this case.... Cheers!

  3. Bob Muenchen says:

    Thanks for these handy functions Mark!

    Bob Muenchen

  4. First of all, thank you for releasing this package. It's nice to have so many string distance implementations in one place.

    I compared your functions against strcmp in the RecordLinkage package, which is my current go-to package for string distance. On my machine, your implementation of levenshtein distance seems to be about 2x slower than RecordLinkage, and your implementation of jaro-winkler seems to be about 3x slower than RecordLinkage. Not to be critical of a new package that I already love, but it might be worth taking a look at the the source code for the RecordLinkage package for some ideas for increasing performance.

    Also, it seems that the RecordLinkage implementation of jaro-winkler distance allows for weights for insertion, deletion, and substitution. I'm not too familiar with the particulars of this algorithm, but do you plan to implement weights when method='jw'?

    Here's the code for my comparison:

    require('RecordLinkage'); require('stringdist'); require('rbenchmark')
    x <- sapply(sample(5:25, 1e5, replace=TRUE), function(x) paste(sample(letters, x, replace=TRUE), collapse=''))
    y <- sapply(sample(5:25, 1e5, replace=TRUE), function(x) paste(sample(letters, x, replace=TRUE), collapse=''))

    dist1 <- levenshteinDist(x, y),
    dist2 <- stringdist(x, y, method='lv'),
    columns=c('test', 'elapsed', 'relative'))
    all.equal(dist1, dist2)

    sim1 <- jarowinkler(x, y),
    sim2 <- 1-stringdist(x, y, method='jw'),
    columns=c('test', 'elapsed', 'relative'))
    all.equal(sim1, sim2)

    • mark says:

      Hi Zachary,

      First of all, thanks for the response. I am very happy to see that you're benchmarking it-- this is something that I have not done yet. Also: the more people test the package, reliable it gets, so thanks!

      I will for sure have a look at the RecordLinkage source code. I have not profiled my code yet so I'm curious where the speed difference comes from. My code does character-wise comparisons (not byte-wise) and therefore all strings are re-encoded to utf8 and then to integers (this seems to be the only platform-independent way in R). I could imagine that this double conversion gobbles up some time.

      I will implement the weights for the Jaro-Winkler distance as well. I've added an issue to the repo: so it won't be forgotten.


      • mark says:

        Ok, just a quick test: RecordLinkage does seem handle non-ASCII characters byte-wise:

        # the function from RecordLinkage
        > levenshteinDist("ü","u")
        [1] 2
        # stringdist
        > stringdist("ü","u",method='lv')
        [1] 1

        RecordLinkage gives a distance of two since u-umlaut is a 2-byte character (in utf8): you need to delete one byte and substitute the other to turn it into an ASCII 'u'.

        stringdist looks at u-umlaut as a single character and measures that you need only a single substitution, which probably makes more sense in many applications.

        I have been thinking about a useBytes option: this would save some time in the conversion to integers. However, the distances would become OS-dependent: R uses the locale encoding internally, which on windows machines is usually 'latin-1' and utf8 on unix-like OS's.

        • Thanks for explaining the difference between byte-wise and character-wise comparisons. I did not realize that could be an issue until now!

          Personally, I'd love to see a useBytes option, perhaps with a warning about encodings. Sometimes speed matters more than accuracy.

  5. Hi Mark,

    if you are curious why you have been mentioned with Cosmo Kramer in the same paragraph, then check this out:

    Just kidding 😉

    Well, thanks for this very useful package! I would feel honored if you'd find the time to comment or give me feedback/corrections on the article.



  6. Pingback: A bit of benchmarking with string distances | Mark van der Loo

  7. Shivi says:

    Hi Mark,
    Thanks for the new version.
    I tried using amatch on a data which had more than 1 million records.

    I am using this on a data which has duplicate consumer name. PFA the syntax:
    aa<-read.csv("consignee.csv",header = TRUE)
    ab<-amatch('safex',aa$name,maxDist =4).

    Here i am trying to find all those cases of consumer name which contain safex. 2 things:-
    1- As the data is huge hence this code is taking a lot of time, is there a way wherein i can enhance its performance.
    2- This is just one case. I have multiple cases of incorrect consumer name entered in the system. Is there a way i can add multiple names at once or is there an alternate.

    Thanks, Shivi

    • mark says:

      Hi Shivi,

      If you are looking for consumer names that contain a certain substring, you could use 'grep' or 'grepl', an not amatch.

      The call

      i < - grepl("safex",aa$name)

      stores a logical vector 'i' that is 'TRUE' where aa$name includes 'safex'.

      The str_match function of the stringr package can handle a number of search patterns.

      To improve performance of amatch: buy a bigger computer :). Seriously, amatch (and all other stringdist functions) use multiple cores, so if you have a chance to run your code on bigger machines then you will get speedup.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.