The validate
package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.
library(magrittr) library(validate) iris %>% check_that( Sepal.Width > 0.5 * Sepal.Length , mean(Sepal.Width) > 0 , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) %>% summary() # rule items passes fails nNA error warning expression # 1 V1 150 66 84 0 FALSE FALSE Sepal.Width > 0.5 * Sepal.Length # 2 V2 1 1 0 0 FALSE FALSE mean(Sepal.Width) > 0 # 3 V3 150 84 66 0 FALSE FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
The summary
gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean
of a variable only one item is tested: the whole Sepal.Width
column. The other rules are tested on each record in iris
. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.
In validate
, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator
object supports such activities so validation rules can be reused.
v <- validator( ratio = Sepal.Width > 0.5 * Sepal.Length , mean = mean(Sepal.Width) > 0 , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) v # Object of class 'validator' with 3 elements: # ratio: Sepal.Width > 0.5 * Sepal.Length # mean : mean(Sepal.Width) > 0 # cnd : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
We can confront the iris
data set with this validator. The results are stored in a validation
object.
cf <- confront(iris, v) cf # Object of class 'validation' # Call: # confront(x = iris, dat = v) # # Confrontations: 3 # With fails : 2 # Warnings : 0 # Errors : 0 barplot(cf,main="iris")
These are just the basics of what can be done with this package.
- If this post got you interested, you can go through our introductory vignette
- Some theory on data validation can be found here
- We'd love to hear your suggestions, opinions, bugreports here
- An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
- Github repo, CRAN page
Thanks for this package, found it very handy.
I just wondering why I can't use 'grepl' as validator expresion
v <- validator(
grepl('pattern',COL1)
)
Warning message:
In ini_validator(.self, ..., .file = .file) :
Invalid syntax detected, the following expressions have been ignored:
[001] grepl("pattern", COL1)
Dear Dusan,
'validator' checks wheter each rule is a 'validating' statement (a statement that yields TRUE/FALSE/NA) by checking the top of the Abstract Syntax Tree agains a list of allowed symbols.
For the moment, you can use grepl(...) == TRUE
We will probably add 'grepl' to the list of allowed validating syntax in the next version.
More generally, in the future we will make the list of allowed validating symbols user-extendable (it actually started out that way but we decided to move that feature to a later release).
Thank you Mark.
Now it's more clear to me.
Hello Mark. This is a very helpful development indeed. I am looking at automation of data validation for fire and rescue incident data, and there are many rules which I could automate using this package. Could you tell me how I could identify the rows which fail validation please? I have gone through the package documentation and the vignette but cannot see how to retrieve the individual row results so that rows which fail validation can then be corrected.
Thanks Steward,
I believe the function you a re looking for is 'values'. With this function you can extract the array of TRUE/FALSE/NA values. The dimension of the array is nrows X nrules.
For example
v <- validator(height > weight/2, height > 0)
cf <- confront(women,v)
out <- values(cf)
you can find out what records violate at least one rule as follows:
ifail <- apply(out, 1, all, na.rm=TRUE)
In the future, we will no doubt make this easier. One issue is that validation rules need not be evaluated recordwise (e.g the rule mean(height) > 0), so some kind of convention will necessary there. Until now we've focused on the basic infrastructure.
Hi Mark. I found this validate package is very useful and easy to use when I'm applying it in my data cleaning and checking process. Thank you!
I have same issue with Stewart in identifying what records violate the rule. Since I have few hundred of rules, it’s pretty hard for me to only use this function “ifail <- apply(out, 1, all, na.rm=TRUE)” to check ALL the violated records in one shot.
I would like to ask is there any function for me to specify only that particular rule that I’m interested to check and instead of checking ALL, and also identify the violated rows using ID [Key Identifier in my datafile] instead of just row numbers?
you can index a 'confrontation' as you would with a vector. That way you select the results for a certain rule.
See also the introductory vignette: https://cran.r-project.org/web/packages/validate/vignettes/intro.html#confrontation-objects
at the end of the section on confrontation objects.
Hi Mark,
This doesn't really answer the question Chong SH posed though, which is also the question I have.
You're saying you can index the confrontation to get the results for a certain rule, but that just gets you the aggregated results for that rule.
What I think both of us are looking for is to get the row-by-row results for that rule, telling us which row passed or failed that given rule.
Is that possible to get with this package? Otherwise, it's of pretty limited use to us.
cf$X <- (as.data.frame(check_that(data, is.na(X))))$value
Will Populate the $value variable with True and False for each record.
Pingback: Data checking packages in R | tonyladson
GREAT pkg, Mark!
Very, very useful.
I have a simple syntax question:
- The commas (",") in a Rule
seems to be ANDs. Ok!
- Why not make the "|" (single bar character)
an OR between conditions?.
This way you would not need the
if(...) construct
+ an empty space in the if() btw conditions
to define an OR condition,
as in your present example:
v 0.5 * Sepal.Length
, mean = mean(Sepal.Width) > 0
, cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 # Here!!
)
so, cnd (last line)would be:
,cnd = Sepal.Width > 0.5*Sepal.Length | Sepal.Length > 10
Would make the syntax +consistent
and easier to remember. 🙂
Thanks! / Dank U!
Thanks for the compliment :-).
The suggestion you make, to replace 'if' with '|', is a bit unclear to me. In the example you give, the two statements are not equivalent.
I mean, when interpreted as a validation rule the statement
is equivalent to
By material implication. In fact, that's what validate uses to vectorize and speed up execution of such statements.
Which function can I use to list records from my dataset that have violated the defined Rules? Thanx for the great work. Much appreciated.