Category Archives: string metrics

stringdist 0.9.5.1: now with C API

Version 0.9.5.1 of stringdist is on CRAN. The main new feature, with a huge thanks to our awesome new contributor Chris Muir, is that we made it easy to call stringdist functionality from your package's C or C++ code. The … Continue reading

Posted in data cleaning, programming, R, string metrics | Leave a comment

stringdist 0.9.4.2 released

stringdist 0.9.4.2 was accepted on CRAN at the end of last week. This release just fixes a few bugs affecting the stringdistmatrix function, when called with a single argument. From the NEWS file: bugfix in stringdistmatrix(a): value of p, for … Continue reading

Posted in programming, R, string metrics | 2 Comments

stringdist 0.9.4 and 0.9.3: distances between integer sequences

A new release of stringdist has been accepted on CRAN. stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding. version 0.9.4 bugfix: edge case for zero-size for lower tridiagonal … Continue reading

Posted in programming, R, string metrics | Leave a comment

Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples. Computing 'dist' objects with 'stringdistmatrix' The R dist object is used as … Continue reading

Posted in programming, R, string metrics | Leave a comment

stringdist 0.9: exercise all your cores

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R. Secondly, stringdist now employs multithreading … Continue reading

Posted in programming, R, string metrics | 4 Comments

stringdist 0.8: now with soundex

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words … Continue reading

Posted in data correction methods, data manipulation, R, string metrics | Leave a comment

A bit of benchmarking with string distances

After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings … Continue reading

Posted in R, string metrics | 4 Comments

Approximate string matching in R

I have released a new version of the stringdist package. Besides a some new string distance algorithms it now contains two convenient matching functions: amatch: Equivalent to R's match function but allowing for approximate matching. ain: Similar to R's %in% … Continue reading

Posted in data correction methods, R, string metrics | 12 Comments

The stringdist package

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into … Continue reading

Posted in R, string metrics | 8 Comments