Author Archives: mark

stringdist 0.9.4 and 0.9.3: distances between integer sequences

A new release of stringdist has been accepted on CRAN. stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding. version 0.9.4 bugfix: edge case for zero-size for lower tridiagonal … Continue reading

Posted in programming, R, string metrics | Leave a comment

Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples. Computing 'dist' objects with 'stringdistmatrix' The R dist object is used as … Continue reading

Posted in programming, R, string metrics | Leave a comment

stringdist 0.9: exercise all your cores

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R. Secondly, stringdist now employs multithreading … Continue reading

Posted in programming, R, string metrics | 4 Comments

Easy to use option settings management with the 'settings' package

Last week I released a new package called settings. It grew out of my frustration built up during several small projects where I'm generating heavily parameterized d3/js output. What I wanted was support to define a whole bunch of option … Continue reading

Posted in programming, R | Leave a comment

stringdist 0.8: now with soundex

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words … Continue reading

Posted in data correction methods, data manipulation, R, string metrics | Leave a comment

I came accross this post on SO, where several solutions to sorting data.frames are presented. It must have been solved a million times, but here's a solution I like to use. It benefits from the fact that sort is an … Continue reading

Posted in data manipulation, R | 5 Comments

Review of "Building interactive graphs with ggplot2 and shiny"

Recently, Packt published a video course with the above title, and I've just spent a pleasant morning reviewing it on Packt's request. Pleasant, because I think the course gives an excellent introduction to both ggplot2 and shiny. The course is … Continue reading

Posted in R | Leave a comment

A bit of benchmarking with string distances

After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings … Continue reading

Posted in R, string metrics | 4 Comments

Approximate string matching in R

I have released a new version of the stringdist package. Besides a some new string distance algorithms it now contains two convenient matching functions: amatch: Equivalent to R's match function but allowing for approximate matching. ain: Similar to R's %in% … Continue reading

Posted in data correction methods, R, string metrics | 12 Comments

The stringdist package

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into … Continue reading

Posted in R, string metrics | 8 Comments