Integers in R are stored in 32-bits. Out of the box, the largest integer R can handle is 2147483647. Sometimes this isn’t enough space. I’ve run into this issue a few times when dealing with large ID numbers. Here’s an example ID from Twitter 912411948163137536, let’s see if it’s larger than R can handle:
912411948163137536 > .Machine$integer.max
## [1] TRUE
If we try to store this number as an integer, R warns us that it does not fit within the 32-bit integer range and it returns NA
as this integer value is not available.
as.integer(912411948163137536)
## Warning: NAs introduced by coercion to integer range
## [1] NA
Reading
Like R
, readr::read_csv
cannot parse integers larger than
2147483647:
library(readr)
parse_integer(s)
## [1] NA
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 1 NA an integer 21474836470
Following the advice of Win-Vector, we can read large integers as character vectors.
# Write a test file
command <- paste0("echo 'big\n", s, "' > in.csv")
system(command)
# Read the integer data as character
my_data <- read_csv(
"in.csv",
col_types = cols(
big = col_character()
)
)
# Show we got the right value
my_data$big[1] == s
## [1] TRUE
Writing
The bit64
library facilitates working with 64-bit integers in R.
library(bit64)
# We can store `s` as a 64-bit integer
i <- as.integer64(s)
as.character(i) == s
## [1] TRUE
It looks like readr::write_csv
can handle integer64
objects:
# Write out an integer64 object
my_data$big <- as.integer64(my_data$big)
write_csv(my_data, "out.csv")
# Make sure it was written properly
text <- system("cat out.csv", intern = TRUE)
text[2] == s
## [1] TRUE
Answer
Thanks to Hadley Wickham for a pragmatic answer: “use doubles”
There is an important caveat associated with using doubles. Win-Vector notes:
IEEE 754 doubles define a 53 bit mantissa (separate from the sign and exponent), so with a proper floating point implementation we expect a double can faithfully represent an integer range of -2^53 through 2^53. But only as long as you don’t accidentally convert to or round-trip through a string/character type.
Here is an example where using doubles can get one into trouble:
> library(bit64)
> library(readr)
> # Set up CSV
> a <- as.integer64(2) ^ 53
> b <- a + 1
> text <- paste0("x\n", a, "\n", b, "\n")
> cat(text)
x
9007199254740992
9007199254740993
> # Read CSV
> data <- read_csv(text, col_types = cols(x = col_double()))
> # Show that b was not read properly
> as.integer64(data$x)
integer64
[1] 9007199254740992 9007199254740992
My plan moving forward is to use doubles and check that abs(x) <= as.integer64(2) ^ 53
.
Conclusion
The safest approach seems to be reading and writing string representations when integers cannot be stored in 32-bits.
readr::write_csv
appears capable of writing integer64
objects,
can we be certain this will always work?