Easy data validation with the validate package

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.

library(magrittr)
library(validate)

iris %>% check_that(
  Sepal.Width > 0.5 * Sepal.Length
  , mean(Sepal.Width) > 0
  , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()

#  rule items passes fails nNA error warning                                              expression
# 1   V1   150     66    84   0 FALSE   FALSE                        Sepal.Width > 0.5 * Sepal.Length
# 2   V2     1      1     0   0 FALSE   FALSE                                   mean(Sepal.Width) > 0
# 3   V3   150     84    66   0 FALSE   FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.

In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.

v <-  validator(
  ratio = Sepal.Width > 0.5 * Sepal.Length
  , mean = mean(Sepal.Width) > 0
  , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
  )
v

# Object of class 'validator' with 3 elements:
#  ratio: Sepal.Width > 0.5 * Sepal.Length
#  mean : mean(Sepal.Width) > 0
#  cnd  : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

We can confront the iris data set with this validator. The results are stored in a validation object.

cf <- confront(iris, v)
cf

# Object of class 'validation'
# Call:
#     confront(x = iris, dat = v)
#
# Confrontations: 3
# With fails    : 2
# Warnings      : 0
# Errors        : 0
barplot(cf,main="iris")

validate-iris These are just the basics of what can be done with this package.

  • If this post got you interested, you can go through our introductory vignette
  • Some theory on data validation can be found here
  • We'd love to hear your suggestions, opinions, bugreports here
  • An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
  • Github repo, CRAN page
This entry was posted in data cleaning, programming, R. Bookmark the permalink.

14 Responses to Easy data validation with the validate package

  1. Dusan Sovic says:

    Thanks for this package, found it very handy.
    I just wondering why I can't use 'grepl' as validator expresion

    v <- validator(
    grepl('pattern',COL1)
    )
    Warning message:
    In ini_validator(.self, ..., .file = .file) :
    Invalid syntax detected, the following expressions have been ignored:
    [001] grepl("pattern", COL1)

    • mark says:

      Dear Dusan,

      'validator' checks wheter each rule is a 'validating' statement (a statement that yields TRUE/FALSE/NA) by checking the top of the Abstract Syntax Tree agains a list of allowed symbols.

      For the moment, you can use grepl(...) == TRUE

      We will probably add 'grepl' to the list of allowed validating syntax in the next version.

      More generally, in the future we will make the list of allowed validating symbols user-extendable (it actually started out that way but we decided to move that feature to a later release).

  2. Stewart Ross says:

    Hello Mark. This is a very helpful development indeed. I am looking at automation of data validation for fire and rescue incident data, and there are many rules which I could automate using this package. Could you tell me how I could identify the rows which fail validation please? I have gone through the package documentation and the vignette but cannot see how to retrieve the individual row results so that rows which fail validation can then be corrected.

    • mark says:

      Thanks Steward,

      I believe the function you a re looking for is 'values'. With this function you can extract the array of TRUE/FALSE/NA values. The dimension of the array is nrows X nrules.

      For example

      v < - validator(height > weight/2, height > 0)
      cf <- confront(women,v)
      out <- values(cf)

      you can find out what records violate at least one rule as follows:

      ifail <- apply(out, 1, all, na.rm=TRUE)

      In the future, we will no doubt make this easier. One issue is that validation rules need not be evaluated recordwise (e.g the rule mean(height) > 0), so some kind of convention will necessary there. Until now we've focused on the basic infrastructure.

      • Chong SH says:

        Hi Mark. I found this validate package is very useful and easy to use when I'm applying it in my data cleaning and checking process. Thank you!

        I have same issue with Stewart in identifying what records violate the rule. Since I have few hundred of rules, it’s pretty hard for me to only use this function “ifail <- apply(out, 1, all, na.rm=TRUE)” to check ALL the violated records in one shot.

        I would like to ask is there any function for me to specify only that particular rule that I’m interested to check and instead of checking ALL, and also identify the violated rows using ID [Key Identifier in my datafile] instead of just row numbers?

        • mark says:

          you can index a 'confrontation' as you would with a vector. That way you select the results for a certain rule.

          See also the introductory vignette: https://cran.r-project.org/web/packages/validate/vignettes/intro.html#confrontation-objects
          at the end of the section on confrontation objects.

          • Kira says:

            Hi Mark,

            This doesn't really answer the question Chong SH posed though, which is also the question I have.

            You're saying you can index the confrontation to get the results for a certain rule, but that just gets you the aggregated results for that rule.

            What I think both of us are looking for is to get the row-by-row results for that rule, telling us which row passed or failed that given rule.

            Is that possible to get with this package? Otherwise, it's of pretty limited use to us.

          • mark says:

            Using values() on a confrontation objects gives you all boolean results in an array (or list of arrays, if the output dimension structure demands it)

            You can do as.data.frame() on a confrontation object and get all the results in data frame format. After that you can filter by rule. Make sure that you specify the key when confronting, so confront(data, rules, key=...)

        • Lalit says:

          cf$X <- (as.data.frame(check_that(data, is.na(X))))$value

          Will Populate the $value variable with True and False for each record.

  3. Pingback: Data checking packages in R | tonyladson

  4. trends says:

    GREAT pkg, Mark!
    Very, very useful.

    I have a simple syntax question:
    - The commas (",") in a Rule
    seems to be ANDs. Ok!
    - Why not make the "|" (single bar character)
    an OR between conditions?.

    This way you would not need the
    if(...) construct
    + an empty space in the if() btw conditions
    to define an OR condition,
    as in your present example:

    v 0.5 * Sepal.Length
    , mean = mean(Sepal.Width) > 0
    , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 # Here!!
    )

    so, cnd (last line)would be:
    ,cnd = Sepal.Width > 0.5*Sepal.Length | Sepal.Length > 10

    Would make the syntax +consistent
    and easier to remember. 🙂

    Thanks! / Dank U!

    • mark says:

      Thanks for the compliment :-).

      The suggestion you make, to replace 'if' with '|', is a bit unclear to me. In the example you give, the two statements are not equivalent.
      I mean, when interpreted as a validation rule the statement

      if ( Sepal.Width>0.5*Sepal.Length ) Sepal.Length > 10
      

      is equivalent to

      !( Sepal.Width>0.5*Sepal.Length ) | Sepal.Length > 10
      

      By material implication. In fact, that's what validate uses to vectorize and speed up execution of such statements.

  5. Songezo Rexe says:

    Which function can I use to list records from my dataset that have violated the defined Rules? Thanx for the great work. Much appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

*