I've always wondered how exactly the missing value (
R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:
- A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
- A missing
double(real number) is represented by a special version of the default
NaN(Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?
Read on if you want to dig a little deeper.
As you may know, a lot of
R's core is written in the
C language. However, an
int variable in
C does not support the concept of a missing value. So, what happens in
R is that a single value of the integer range is pointed out as representing a missing value. In this case it is
C macro from
limits.h) which determines the largest negative value that can be represented by a
int variable in
C. On most computers, an
int variable will be 32 bits (4 8-bit words). To make things easier, we'll assume that's always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is
(The range is asymmetric because 0 occupies the place of one positive number).
Now let's compare this with
R's integer range. The maximum integer is easily found,
since it is stored in the hidden
So this corresponds with
INT_MAX. The largest negative integer is
not present in
.Machine but we can do some tests:
# store the one-but-least C-integer. The L in the end forces the number
# to be "integer", not "numeric"
x <- -2147483647L
# adding an integer works fine, since we move further into the range:
# substracting an integer gives a warning telling us that the result is out-of-range:
In x - 1L : NAs produced by integer overflow
# substracting a non-integer 1 ("numeric") yields a non-integer:
The result is out of
R's integer range. The integer range of
R is : one integer less than you get in
C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.
To explain how real () missing values are represented, we first need to spend a few words on the
double type. A
double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.
Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):
The sign is represented by 1 bit, the exponent by 11 bits and the mantissa by 52 bits, so we have 64 bits in total. The special value
NaN (and also
Inf) is coded using values of that are not used to represent numbers.
NaN is represented by
0x7ff (hexadecimal) and . The important thing is that it does not matter what the value of is when representing
NaN. This leaves developers with lots of room in the mantissa to give different meanings to
R the developers chose in the mantissa to represent
C-level function called
R_IsNA detects the 1954 in
A funny question is why did the R developers choose 1954? Any ol' number would have been fine. Was it because
- It's the year of birth of one of the developers? (I couldn't find a match here)
- Alan Turing died in 1954? (macabre)
- Because president Eisenhower met with aliens in 1954? (ehm...)
- In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well...)
Leave an answer in the comments if you have a better idea...