Representation of numerical NA's in R and the 1954 enigma

I've always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

  • A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
  • A missing double (real number) is represented by a special version of the default NaN (Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

Missing integers

As you may know, a lot of R's core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we'll assume that's always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].

(The range is asymmetric because 0 occupies the place of one positive number).

Now let's compare this with R's integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.

So this corresponds with C's INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:

The result is out of R's integer range. The integer range of R is \pm 2147483647: one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.

Missing doubles

To explain how real (\mathbb{R}) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

\textrm{sign}\times 2^{e} \times 1.F .

The sign is represented by 1 bit, the exponent e by 11 bits and the mantissa F by 52 bits, so we have 64 bits in total. The special value NaN (and also \pmInf) is coded using values of e that are not used to represent numbers. NaN is represented by e=0x7ff (hexadecimal) and F\not=0. The important thing is that it does not matter what the value of F is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose F=1954 in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.

A funny question is why did the R developers choose 1954? Any ol' number would have been fine. Was it because

  • It's the year of birth of one of the developers? (I couldn't find a match here)
  • Alan Turing died in 1954? (macabre)
  • Because president Eisenhower met with aliens in 1954? (ehm...)
  • In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well...)

Leave an answer in the comments if you have a better idea...

This entry was posted in R. Bookmark the permalink.

2 Responses to Representation of numerical NA's in R and the 1954 enigma

  1. Markk says:

    I believe this is an R and S thing. It comes (or did) from arithmatic.c

    static double R_ValueOfNA(void)
    {
    ieee_double x;
    x.word[hw] = 0x7ff00000;
    x.word[lw] = 1954;
    return x.value;
    }

    Some more guesses: Fortran was designed in 1954, IBM 704 came out in 1954 ... it is not a year at all. The IEEE standard defined NaN by the exponent but gives an example of a special value like R's somewhere that I can't find now. Not the same but it is allowed to use "special" NaN's like this.

  2. Pat Burns says:

    It's a typo, should be 1984.

Leave a Reply

Your email address will not be published. Required fields are marked *


nine + = 17


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">