I've always wondered how exactly the missing value (`NA`

) in `R`

is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

- A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
- A missing
`double`

(real number) is represented by a special version of the default`NaN`

(Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

### Missing integers

As you may know, a lot of `R`

's core is written in the `C`

language. However, an `int`

variable in `C`

does not support the concept of a missing value. So, what happens in `R`

is that a single value of the integer range is pointed out as representing a missing value. In this case it is `INT_MIN`

(a `C`

macro from `limits.h`

) which determines the largest negative value that can be represented by a `int`

variable in `C`

. On most computers, an `int`

variable will be 32 bits (4 8-bit words). To make things easier, we'll assume that's always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

(The range is asymmetric because 0 occupies the place of one positive number).

Now let's compare this with `R`

's integer range. The maximum integer is easily found,

since it is stored in the hidden `.Machine`

variable.

1 2 |
> .Machine$integer.max [1] 2147483647 |

So this corresponds with `C`

's `INT_MAX`

. The largest negative integer is

not present in `.Machine`

but we can do some tests:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# store the one-but-least C-integer. The L in the end forces the number # to be "integer", not "numeric" x <- -2147483647L typeof(x) [1] "integer" # adding an integer works fine, since we move further into the range: typeof(x+1L) [1] "integer" # substracting an integer gives a warning telling us that the result is out-of-range: > typeof(x-1L) [1] "integer" Warning message: In x - 1L : NAs produced by integer overflow # substracting a non-integer 1 ("numeric") yields a non-integer: > typeof(x-1) [1] "double" |

The result is out of `R`

's integer range. The integer range of `R`

is : one integer less than you get in `C`

. So by sacrificing only *one* of your *four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five* integers, you get the truly awesome feature of computing with missing values.

### Missing doubles

To explain how real () missing values are represented, we first need to spend a few words on the `double`

type. A `double`

is short for *double precision* and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

.

The sign is represented by 1 bit, the exponent by 11 bits and the *mantissa* by 52 bits, so we have 64 bits in total. The special value `NaN`

(and also `Inf`

) is coded using values of that are not used to represent numbers. `NaN`

is represented by `0x7ff`

(hexadecimal) and . The important thing is that it does not matter what the value of is when representing `NaN`

. This leaves developers with lots of room in the mantissa to give different meanings to `NaN`

. In `R`

the developers chose in the mantissa to represent `NA`

. A `C`

-level function called `R_IsNA`

detects the 1954 in `NaN`

values.

A funny question is **why did the R developers choose 1954**? Any ol' number would have been fine. Was it because

- It's the year of birth of one of the developers? (I couldn't find a match here)
- Alan Turing died in 1954? (macabre)
- Because president Eisenhower met with aliens in 1954? (ehm...)
- In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well...)

Leave an answer in the comments if you have a better idea...

I believe this is an R and S thing. It comes (or did) from arithmatic.c

static double R_ValueOfNA(void)

{

ieee_double x;

x.word[hw] = 0x7ff00000;

x.word[lw] = 1954;

return x.value;

}

Some more guesses: Fortran was designed in 1954, IBM 704 came out in 1954 ... it is not a year at all. The IEEE standard defined NaN by the exponent but gives an example of a special value like R's somewhere that I can't find now. Not the same but it is allowed to use "special" NaN's like this.

It's a typo, should be 1984.