After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings of 5-25 characters, which probably covers many use cases involving natural language. If your language happens to be denoted in single-byte encoding that is, but more on that below. Here's Zachary's benchmark (rerun by myself).
library(rbenchmark)
library(RecordLinkage)
library(stringdist)
> x <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
> y <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
> benchmark(
+ d1 <- stringdist(x,y,method='lv')
+ , d2 <- levenshteinDist(x,y)
+ , replications=10
+ )
test replications elapsed relative user.self
1 d1 <- stringdist(x, y, method = "lv") 10 4.939 1.876 4.924
2 d2 <- levenshteinDist(x, y) 10 2.633 1.000 2.632
Auch, indeed.
As we all know, premature optimization is the root of all evil so I felt it was time for some mature optimization.
Checking the differences between stringdist
and RecordLinkage
I found that RecordLinkage
uses C code that works directly on the char
representation of strings while I designed stringdist
to make sure that multibyte characters are treated as a single character. This is done by converting them to integers before feeding them to my underlying C routines (using native R functionality). Hence, there is some conversion overhead that is linear in string length while the time spent on computing the Levenshtein distance grows with the product of the length of strings compared (say, quadratically). This is easily checked by benchmarking over longer strings.
> x <- sapply(sample(25:250,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
> y M- sapply(sample(25:250,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
>
> benchmark(
+ d1 <- stringdist(x,y,method='lv')
+ , d2 <- levenshteinDist(x,y)
+ , replications=10
+ )
test replications elapsed relative user.self
1 d1 <- stringdist(x, y, method = "lv") 10 111.668 1.258 111.584
2 d2 <- levenshteinDist(x, y) 10 88.784 1.000 88.716
And indeed, the relative difference decreased from a factor of 1.8 to 1.25. So, I rolled up my sleeves and built in a useBytes
option that allows one to skip the conversion step. Here's the result on Zachary's original benchmark.
> x <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
> y <- sapply(sample(5:25,1e5,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
>
>
> benchmark(
+ d1 <- stringdist(x,y,method='lv')
+ , d2 <- stringdist(x,y,method='lv',useBytes=TRUE)
+ , d3 <- levenshteinDist(x,y)
+ , replications=10
+ )
test replications elapsed
1 d1 <- stringdist(x, y, method = "lv") 10 4.806
2 d2 <- stringdist(x, y, method = "lv", useBytes = TRUE) 10 1.463
3 d3 <- levenshteinDist(x, y) 10 2.942
Yowza! not only did I get rid of some overhead, stringdist
is also about twice as fast as RecordLinkage
in this benchmark. I haven't really profiled this but the main reason is probably that stringdist
has a better memory allocation strategy[*]
We can also check this for the jaro-winkler distance, which is also implemented by RecordLinkage
.
# (output of rbenchmark)
1 d1 <- stringdist(x, y, method = "jw") 10 4.526
2 d2 <- stringdist(x, y, method = "jw", useBytes = TRUE) 10 0.513
3 d3 <- jarowinkler(x, y) 10 1.303
Again, stringdist
is a factor of two faster.
But, there's always a but. To be fair we need to do the benchmark on longer strings as well, and that's where RecordLinkage
comes back with a vengeance.
> x <- sapply(sample(50:250,1e4,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
> y <- sapply(sample(50:250,1e4,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse=""))
>
>
> benchmark(
+ d1 <- stringdist(x,y,method='lv')
+ , d2 <- stringdist(x,y,method='lv',useBytes=TRUE)
+ , d3 <- levenshteinDist(x,y)
+ , replications=10
+ )
test replications elapsed
1 d1 <- stringdist(x, y, method = "lv") 10 12.557
2 d2 <- stringdist(x, y, method = "lv", useBytes = TRUE) 10 11.972
3 d3 <- levenshteinDist(x, y) 10 10.091
Here, stringdist
is 10-20 percent slower again. Again, I have not seriously profiled this but there is still some overhead in stringdist
that's not in RecordLinkage
. stringdist
needs to select the algorithm, perform some more argument checking, and I still have to do a (very simple) conversion from char
to int
. Oh well, you can't win 'em all.
The main point for users is to decide whether the useBytes
option is appropriate for the problem their trying to solve. The problem of working directly on the bytes in a character string is that for non-ASCII characters, the results depend on the character encoding scheme (and therefore most likely on the OS you're using.). Here's an example.
# the function from RecordLinkage
> levenshteinDist("ü","u")
[1] 2
# stringdist
> stringdist("ü","u",method='lv')
[1] 1
> stringdist("ü","u",method='lv',useBytes=TRUE)
[2] 2
On my system (linux) strings are encoded in utf8
where, u-umlaut is a 2-byte character[**]. Since RecordLinkage
treats strings byte-by-byte it 'sees' two characters. It needs to substitute one and delete another yielding a distance value of 2. stringdist
sees u-umlaut as a single character, but this may be turned off with the useBytes
switch.
The bottom line is that it depends on the application which solution is best for your purpose. Here are some quick rules of thumb:
- Comparing strings with characters possibly outside of the ASCII range, with results that needs to be portable: stringdist
- Comparing single-byte encoded strings (like ASCII): for strings upto ~100 characters: stringdist
with useBytes=TRUE
. Otherwise RecordLinkage
- Distances other than Levenshtein or Jaro-Winkler: stringdist
.
The latest version of stringdist (0.7.0) was accepted on CRAN yesterday (06.09.2013) and comes with the useBytes
option in all functions. Also read the NEWS for bugfixes and new features.
The latest version always comes fully packed with the newest, most advanced bugs that offer game-changing insights[***] into the cranial processes of the programmer, so please don't forget to drop me a line when you find one.
[*]By the way, you can do really cool things with oprofile
under linux if you want to profile C-code linked to R, but that's for another blog.
[**] well there are several ways to represent u-umlaut in utf8 but let's not get too technical.
[***] Buzzword bingo!
Thanks for the update to the package! I'm very excited to see this continue to evolve.
Pingback: Cluster Analysis and Strings | Pearltrees
Hi
Nice benchmark! There is since some time another funciton in R base package utils, adist(), which does similar stuff, computing however distances between all strings.
And a (naive) benchmark shows it's quite slower (maybe due to the mapply?):
benchmark(
+ d1 <- stringdist(x,y,method='lv'),
+ d2 <- levenshteinDist(x,y),
+ d3 <- mapply(adist, x,y),
+ replications=10)
test replications elapsed relative user.self
1 d1 <- stringdist(x, y, method = "lv") 10 0.386 2.573 0.384
2 d2 <- levenshteinDist(x, y) 10 0.150 1.000 0.152
3 d3 <- mapply(adist, x, y) 10 2.652 17.680 2.648
Hi Mat, thanks for the reply!
It is probably more honest to compare 'adist' with 'stringdistmatrix', but the stringdist package still outperforms 'adist' then by a factor of ~1.3.
> x <- sapply(sample(5:25,1000,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) > y <- sapply(sample(5:25,1000,replace=TRUE), function(n) paste(sample(letters,n,replace=TRUE),collapse="")) >
> benchmark(
+ d1 <- adist(x,y) + , d2 <- stringdistmatrix(x,y,method='lv') + , replications=1 + ) test replications elapsed relative 1 d1 <- adist(x, y) 1 2.469 1.345 2 d2 <- stringdistmatrix(x, y, method = "lv") 1 1.836 1.000 user.self sys.self user.child sys.child 1 2.469 0.000 0 0 2 1.827 0.007 0 0