stringdist 0.9.5.1: now with C API

Version 0.9.5.1 of stringdist is on CRAN. The main new feature, with a huge thanks to our awesome new contributor Chris Muir, is that we made it easy to call stringdist functionality from your package's C or C++ code.

The main steps to get it done are:

  1. Make sure to add stringdist to the Imports: and LinkingTo: fields in your DESRIPTION file
  2. Add the #include <stringdist_api> to your C/C++ source file.
  3. Start using stringdist from C!

Here's an example source file

#include <R.h>
#include <Rdefines.h>
#include <stringdist_api.h>


SEXP my_soundex(SEXP strings, SEXP useBytes){
  Rprintf("\nWow, using 'stringdist' soundex encoding, from my own C code!\n");
  return sd_soundex(strings, useBytes);
}

Great! how can I learn more?

  • The full API is desribed in a pdf file that is generated from doxygen that comes with the package. You can find it by typing ?stringdist_api on the R command line.
  • A minimal example package that links to stringdist is available on GitHub
  • A more sophisticated package with more elaborate examples can be found here: refinr (By Chris)

Any other news?

A few fixes, and a couple of long-deprecated function arguments have finally been removed. Check out the NEWS file on CRAN for a complete overview.

Happy coding!

Posted in data cleaning, programming, R, string metrics | Leave a comment

The use of R in official statistics conference 2018

On September 12-14 the 6th international conference on the use of R in official statistics (#uRos2018) will take place at the Dutch National Statistical Office in Den Haag, the Netherlands. The conference is aimed at producers and users of official statistics from government, academia, and industry. The conference is modeled after the useR! conference and will consist of one day of tutorials (12th September 2018) followed by two days of conference (13, 14 September 2018). Topics include:

  • Examples of applying R in statistical production.
  • Examples of applying R in dissemination of statistics (visualisation, apps, reporting).
  • Analyses of big data and/or application of machine learning for official statistics.
  • Implementations of statistical methodology in the areas of sampling, editing, modelling and estimation, or disclosure control.
  • R packages connecting R to other standard tools/technical standards
  • Organisational and technical aspects of introducing R to the statistical office.
  • Teaching R to users in the office
  • Examples of accessing or using official statistics publications with R in other fields

    Keynote speakers
    We are very happy to announce that we confirmed two fantastic keynote speakers.

  • Alina Matei is a professor of statistics at the University of Neuchatel and maintainer of the important sampling package.
  • Jeroen Ooms is a postdoc at UC Berkeley, author of many infrastructural R packages and maintainer of R and Rtools for Windows.

    Call for abstracts

    The call for abstracts is open until 31 May. You can contribute to the conference by proposing a 20-minute talk, or a 3-hour tutorial. Also, authors have the opportunity to submit a paper for one of the two journals that will devote a special issue to the conference. Read all about it over here.

    Pointers

  • conference website
  • Follow uRos2018 on twitter

  • Posted in official statistics, R | Leave a comment

    Track changes in data with the lumberjack %>>%

    So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

    > data(retailers, package="validate")
    > head(retailers, 3)
      size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
    1  sc0      0.02    75       NA        NA      1130          NA       18915  20045  NA
    2  sc3      0.14     9     1607        NA      1607         131        1544     63  NA
    3  sc3      0.14    NA     6886       -33      6919         324        6493    426  NA
    

    This data is dirty with missings and full of errors. Let us do some imputations with simputation.

    > out <- retailers %>% 
    +   impute_lm(other.rev ~ turnover) %>%
    +   impute_median(other.rev ~ size)
    > 
    > head(out,3)
      size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
    1  sc0      0.02    75       NA  6114.775      1130          NA       18915  20045  NA
    2  sc3      0.14     9     1607  5427.113      1607         131        1544     63  NA
    3  sc3      0.14    NA     6886   -33.000      6919         324        6493    426  NA
    > 
    

    Ok, cool, we know all that. But what if you'd like to know what value was imputed with which method? That's where the lumberjack comes in.

    The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

    > library(lumberjack)
    > retailers$id <- seq_len(nrow(retailers))
    > out <- retailers %>>% 
    +   start_log(log=cellwise$new(key="id")) %>>%
    +   impute_lm(other.rev ~ turnover) %>>%
    +   impute_median(other.rev ~ size) %>>%
    +   dump_log(stop=TRUE)
    Dumped a log at cellwise.csv
    > 
    > read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
      step                     time                      expression key  variable old      new
    1    2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size)   1 other.rev  NA 6114.775
    2    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   2 other.rev  NA 5427.113
    3    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   6 other.rev  NA 6341.683
    > 
    

    So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

    There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

    If this post got you interested, please install the package using

    install.packages('lumberjack')
    

    You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.

    As always, I am open to suggestions and comments. Either through the packages github page.

    Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

    And finally, here's a picture of a lumberjack smoking a pipe.

    [1] It really should be called a function composition operator, but potetoes/potatoes.

    Posted in data cleaning, data manipulation, programming, R | Leave a comment

    Announcing the simputation package: make imputation simple

    I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that

    • make it easy to define both imputation method and imputation model;
    • for multiple variables at once;
    • while grouping data by categorical variables;
    • all fitting in the magrittr not-a-pipeline.

    A few examples

    To start with an example, let us first create a data set with some missings.

    dat <- iris
    # empty a few fields
    dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
    head(dat,10)
    
    ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1            NA         3.5          1.4         0.2  setosa
    ## 2            NA         3.0          1.4         0.2  setosa
    ## 3            NA          NA          1.3         0.2  setosa
    ## 4           4.6          NA          1.5         0.2  setosa
    ## 5           5.0          NA          1.4         0.2  setosa
    ## 6           5.4          NA          1.7         0.4  setosa
    ## 7           4.6          NA          1.4         0.3  setosa
    ## 8           5.0         3.4          1.5         0.2    <NA>
    ## 9           4.4         2.9          1.4         0.2    <NA>
    ## 10          4.9         3.1          1.5         0.1    <NA>
    

    Below, we first impute Sepal.Width and Sepal.Length by regression on Petal.Width and Species. After this we impute Species using a decision tree model (CART) using every other variable as a predictor (including the ones just imputed).

    library(magrittr)    # load the %>% operator
    library(simputation) 
    imputed <- dat %>% 
      impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) %>%
      impute_cart(Species ~ .)
    head(imputed,10)
    
    ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1      4.979844    3.500000          1.4         0.2  setosa
    ## 2      4.979844    3.000000          1.4         0.2  setosa
    ## 3      4.979844    3.409547          1.3         0.2  setosa
    ## 4      4.600000    3.409547          1.5         0.2  setosa
    ## 5      5.000000    3.409547          1.4         0.2  setosa
    ## 6      5.400000    3.561835          1.7         0.4  setosa
    ## 7      4.600000    3.485691          1.4         0.3  setosa
    ## 8      5.000000    3.400000          1.5         0.2  setosa
    ## 9      4.400000    2.900000          1.4         0.2  setosa
    ## 10     4.900000    3.100000          1.5         0.1  setosa
    

    The package is pretty lenient against failure of imputation. For example, if one of the predictors is missing, fields just remain unimputed and if one of the models cannot be fitted, only a warning is issued (not shown here).

    dat %>% impute_lm(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
    
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1     5.076579         3.5          1.4         0.2  setosa
    ## 2     4.675654         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    So here, the third Sepal.Length value could not be imputed since the predictor Sepal.Width is missing.

    It is possible to split data into groups before estimating the imputation model and predicting missing values. There are two ways. The first is to use the | operator to specify grouping variables.

    # We first need to complete 'Species'. Here, we use sequential 
    # hot deck after sorting by Petal.Length
    dat %<>% impute_shd(Species ~ Petal.Length) 
    # Now impute Sepal.Length by regressing on 
    # Sepal.Width, computing a model for each Species.
    dat %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head(3)
    
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1     5.067813         3.5          1.4         0.2  setosa
    ## 2     4.725677         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    The second way is to use the group_by command from dplyr

    dat %>% dplyr::group_by(Species) %>% 
        impute_lm(Sepal.Length ~ Sepal.Width) %>% 
        head(3)
    
    ## Source: local data frame [3 x 5]
    ## Groups: Species [1]
    ## 
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
    ## 1     5.067813         3.5          1.4         0.2  setosa
    ## 2     4.725677         3.0          1.4         0.2  setosa
    ## 3           NA          NA          1.3         0.2  setosa
    

    Note: by using group_by, we also transformed the data.frame to a tibble, which not only sounds funny when you pronounce it (tibble, TIBBLE, tibble? tibbebbebbebble) but is also pretty useful.

    Supported methods and how to specify them

    Currently, the package supports the following methods:

    • Model based (optionally add [non-]parametric random residual)
      • linear regression
      • robust linear regression
      • CART models
      • Random forest
    • Donor imputation (including various donor pool specifications)
      • k-nearest neigbour (based on gower's distance)
      • sequential hotdeck (LOCF, NOCB)
      • random hotdeck
      • Predictive mean matching
    • Other
      • (groupwise) median imputation (optional random residual)
      • Proxy imputation (copy from other variable)

    Any call to one of the impute_ functions looks as follows:

    impute_<method>(data, formula [, <method-specific options>])
    

    and the formula always has the following form:

    <imputed variables> ~ <model specification> [|<grouping variables>]
    

    The parts in square brackets are optional.

    Please see the package vignette for more examples and details, or ?simputation::impute_ for an overview of all imputation functions.

    Happy imputing!

    Posted in data cleaning, data correction methods, imputation, programming, R | 5 Comments

    stringdist 0.9.4.2 released

    stringdist 0.9.4.2 was accepted on CRAN at the end of last week.

    This release just fixes a few bugs affecting the stringdistmatrix function, when called with a single argument.

    From the NEWS file:

    • bugfix in stringdistmatrix(a): value of p, for jw-distance was ignored (thanks to Max Fritsche)
    • bugfix in stringdistmatrix(a): Would segfault on q-gram w/input > ~7k strings and q>1 (thanks to Connor McKay)
    • bugfix in jaccard distance: distance not always correct when passing multiple strings (thanks to Robert Carlson)

    Actually the last bug has not bitten anyone since it was masked by the second one 🙂 (it was reported and fixed a long time ago but popped up again after fixing the second bug -- hat tip to Hadley for testthat!). The second fix also ensures that stringdist's memory allocator for q-gram storage is called fewer times which yields a speed gain in computation of q-gram based distances.

    Posted in programming, R, string metrics | 2 Comments

    validate version 0.1.5 is out

    A new version of the validate package for data validation was just accepted on CRAN and will be available on all mirrors in a few days.

    The most important addition is that you can now reference the data set as a whole, using the "dot" syntax like so:

    iris %>% check_that(
        nrow(.)>100
      , "Sepal.Width" %in% names(.)) %>% 
    summary()
    
      rule items passes fails nNA error warning                  expression
    1   V1     1      1     0   0 FALSE   FALSE               nrow(.) > 100
    2   V2     1      1     0   0 FALSE   FALSE "Sepal.Width" %in% names(.)
    

    Also, it is now possible to return a logical, even when the result is NA, by passing the na.value option.

    dat = data.frame(x=c(1,NA,-1))
    v = validator(x > 0)
    values(confront(dat,v))
            V1
    [1,]  TRUE
    [2,]    NA
    [3,] FALSE
    values(confront(dat,v,na.value=FALSE))
            V1
    [1,]  TRUE
    [2,] FALSE
    [3,] FALSE
    

    A complete list of changes and bugfixes can be found in the NEWS file. Below I include changes in 1.4 since I did not write about it before.

    I will be talking about this package at the upcoming useR!2016 event, so join me if you're interested!

    version 0.1.5

    • The '.' is now used to reference the validated data set as whole.
    • Small change in output of 'compare' to match the table in van den Broek et al. (2013)

    version 0.1.4

    • 'confront' now emits a warining when variable name conflicts with name of a reference data set
    • Deprecated 'validate_reset', in favour of the shorter 'reset' (use 'validate::reset' in case of ambiguity)
    • Deprecated 'validate_options' in favour of the shorter 'voptions'
    • New option na.value with default value NA, controlling the output when a rule evaluates to NA.
    • Added rules from the ESSnet on validation (deliverable 17) to automated tests.
    • added 'grepl' to allowed validation syntax (suggested by Dusan Sovic)
    • exported a few functions w/ keywords internal for extensibility
    • Bugfix: blocks sometimes reported wrong nr of blocks (in case of a single connected block.)
    • Bugfix: macro expansion failed when macros were reused in other macros.
    • Bugfix: certain nonlinear relations were recognized as linear
    • Bugfix: rules that use (anonymous) function definitions raised error when printed.
    Posted in data cleaning, data correction methods, data manipulation, programming, R, Uncategorized | Leave a comment

    Easy data validation with the validate package

    The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.

    library(magrittr)
    library(validate)
    
    iris %>% check_that(
      Sepal.Width > 0.5 * Sepal.Length
      , mean(Sepal.Width) > 0
      , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
    ) %>% summary()
    
    #  rule items passes fails nNA error warning                                              expression
    # 1   V1   150     66    84   0 FALSE   FALSE                        Sepal.Width > 0.5 * Sepal.Length
    # 2   V2     1      1     0   0 FALSE   FALSE                                   mean(Sepal.Width) > 0
    # 3   V3   150     84    66   0 FALSE   FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
    

    The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.

    In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.

    v <-  validator(
      ratio = Sepal.Width > 0.5 * Sepal.Length
      , mean = mean(Sepal.Width) > 0
      , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
      )
    v
    
    # Object of class 'validator' with 3 elements:
    #  ratio: Sepal.Width > 0.5 * Sepal.Length
    #  mean : mean(Sepal.Width) > 0
    #  cnd  : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
    

    We can confront the iris data set with this validator. The results are stored in a validation object.

    cf <- confront(iris, v)
    cf
    
    # Object of class 'validation'
    # Call:
    #     confront(x = iris, dat = v)
    #
    # Confrontations: 3
    # With fails    : 2
    # Warnings      : 0
    # Errors        : 0
    barplot(cf,main="iris")
    

    validate-iris These are just the basics of what can be done with this package.

    • If this post got you interested, you can go through our introductory vignette
    • Some theory on data validation can be found here
    • We'd love to hear your suggestions, opinions, bugreports here
    • An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
    • Github repo, CRAN page
    Posted in data cleaning, programming, R | 11 Comments

    settings 0.2.3

    An updated version of the settings package has been accepted on CRAN.

    The settings package provides alternative options settings management for R. It is aimed to allow for layered options management where global options are the default that can easily be overruled locally (e.g. when calling a function, or options as part of an object).

    New features:

    • Setting ranges or lists of allowed values.

    See the vignette to get started, all code is on github.

    Posted in programming, R | Leave a comment

    stringdist 0.9.4 and 0.9.3: distances between integer sequences

    A new release of stringdist has been accepted on CRAN.

    stringdist offers a number of popular distance functions between sequences of integers or characters that are independent of character encoding.

    version 0.9.4

    • bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused UBSAN to fire, but gave correct results).
    • bugfix in jw distance: not symmetric for certain cases (thanks to github user gtumuluri)

    Since 0.9.3, stringdist can compute distances between integer sequences, and you can use hashr to compute an integer representation of any (sequence of) R objects, based on the superFastHash algorithm of Paul Hsieh.

    version 0.9.3

    • new functions computing distances between integer sequences: seq_dist, seq_distmatrix
    • new function for tokenizing integer sequences: seq_qgrams
    • new function for matching integer sequences: seq_amatch
    • q-gram based distances are now always 0 when q=0 (used to be Inf if at least one of the arguments was not the empty string)
    • stringdist, stringdistmatrix now emit warning when presented with list argument
    • small c-side code optimizations
    • bugfix in dl, lv, osa distance: weights were not taken into account properly (thanks to Zach Price)

    All code on github.

    Posted in programming, R, string metrics | Leave a comment

    Stringdist 0.9.2: dist objects, string similarities and some deprecated arguments

    On 24-06-2015 stringdist 0.9.2 was accepted on CRAN. A summary of new features can be found in the NEWS file; here I discuss the changes with some examples.

    Computing 'dist' objects with 'stringdistmatrix'

    The R dist object is used as input for many clustering algorithms such as cluster::hclust. It is stores the lower triangle of a matrix of distances between a vector of objects. The function stringdist::stringdistmatrix now takes a variable number of character arguments. If two vectors are given, it behaves the same as it used to.

    > x <- c("fu","bar","baz","barb")
    > stringdistmatrix(x,x,useNames="strings")
         fu bar baz barb
    fu    0   3   3    4
    bar   3   0   1    1
    baz   3   1   0    2
    barb  4   1   2    0
    

    However, we're doing more work then necessary. Feeding stringdistmatrix just a single character argument yields the same information, but at half the computational and storage cost.

    > stringdistmatrix(x,useNames="strings")
         fu bar baz
    bar   3        
    baz   3   1    
    barb  4   1   2
    

    The output is a dist object storing only the subdiagonal triangle. This makes it particularly easy to cluster texts using any algorithm that takes a dist object as argument. Many such algorithms available in R do, for example:

    d <- stringdistmatrix(x,useNames="strings")
    h <- stats::hclust(d)
    plot(h)
    

    cluster

    (by the way, parallelizing the calculation of a lower triangle of a matrix poses an interesting exercise in index calculation. For those interested, I wrote it down)

    Better labeling of distance matrices

    Distance matrices can be labeled with the input strings by setting the useNames argument in stringdistmatrix to TRUE or FALSE (the default). However, if you're computing distances between looooong strings, like complete texts it is more convenient to use the names attribute of the input vector. So, the useNames arguments now takes three different values.

    > x <- c(one="fu",two="bar",three="baz",four="barb")
    > y <- c(a="foo",b="fuu")
    > # the default:
    > stringdistmatrix(x,y,useNames="none") 
         [,1] [,2]
    [1,]    2    1
    [2,]    3    3
    [3,]    3    3
    [4,]    4    4
    > # like useNames=TRUE
    > stringdistmatrix(x,y,useNames = "strings")
         foo fuu
    fu     2   1
    bar    3   3
    baz    3   3
    barb   4   4
    > # use labels
    > stringdistmatrix(x,y,useNames="names")
          a b
    one   2 1
    two   3 3
    three 3 3
    four  4 4
    

    String similarities

    Thanks to Jan van der Laan, a string similarity convenience function has been added. It computes the distance metric between two strings and then rescales it as , where the maximum possible distance depends on the type of distance metric and (depending on the metric) the length of the strings.

    # similarity based on the damerau-levenshtein distance
    > stringsim(c("hello", "World"), c("Ola", "Mundo"),method="dl")
    [1] 0.2 0.0
    # similarity based on the jaro distance
    > stringsim(c("hello", "World"), c("Ola", "Mundo"),method="jw")
    [1] 0.5111111 0.4666667
    

    Here a similarity of 0 means completely different and 1 means exactly the same (within the chosen metric).

    Deprecated arguments

    The stringdistmatrix function had to option to be computed in parallel based on facilities of the parallel package. However, as of stringdist 0.9.0, all distance calculations are multicored by default.

    Therefore, I'm phasing out the following options in stringdistmatrix:

    • ncores (how many R-sessions should be started by parallel to compute the matrix?)
    • cluster (optionally, provide your own cluster, created by parallel::makeCluster.

    These argument are now ignored with a message but they'll be available untill somewhere in 2016 so users have time to adapt their code. Please mail me if you have any trouble doing so.

    Posted in programming, R, string metrics | Leave a comment