Using external storrs.
Often it is useful to retrieve data from an external resource (especially websites). The way this works is:
We do a key lookup on the storr; if that succeeds (i.e. it maps to a hash) continue as normal..
If the lookup fails, pass the key (and namespace) to a “hook” function that generates an R object (in any way).
This is in some ways a variant on the memoisation pattern; if the key refers to a set of arguments to a long running function we get something like memoisation (see the bottom of this file).
As an example, this vignette will download some DESCRIPTION files from github, using the name of the repository as the key.
The first step is writing a hook function; this is a function with
arguments (key, namespace)
that returns an R object. For
packages stored in the root directory of a repository we can build URLs
of the form
https://raw.githubusercontent.com/<username>/<repo>/master/DESCRIPTION
So if the key is a username/repo pair and we ignore namespace we can write a function:
fetch_hook_gh_description <- function(key, namespace) {
if (!isTRUE(unname(capabilities("libcurl")))) {
stop("This vignette requires libcurl support in R to run")
}
fmt <- "https://raw.githubusercontent.com/%s/master/DESCRIPTION"
path <- tempfile("gh_description_")
on.exit(file.remove(path))
code <- download.file(sprintf(fmt, key), path, mode = "wb")
if (code != 0L) {
stop("Error downloading file")
}
as.list(read.dcf(path)[1, ])
}
This function downloads the requested DESCRIPTION file into a
temporary file (which it promises to delete later using
on.exit
), checks that the download was successful, then
reads in the downloaded file and converts it into a list.
The httr
and curl
packages make this a
little easier to do with authorisation so that this would work for
private repositories by using a token.
With this in place, we can build a storr:
The first argument here is a storr driver (i.e., a
driver_
function). If you have a storr that you want to
use, pass it as st$driver
to extract the underlying driver
(and share storage with your existing storr).
As with other storr creation functions, you can set the default
namespace using the default_namespace
argument.
The returned object is exactly the same as a usual storr except that
the get
method has changed (this is done by inheritence).
The get
method only behaves differently when the object is
not present in the storr, in which case it will try to fetch the object
and insert it into the storr.
At first there is nothing in here:
## character(0)
But we can still get
things from the storr:
Once a key has been fetched, it will be retrieved locally:
## [1] TRUE
And it will be present within the storr, as shown by
list
:
## [1] "richfitz/storr"
If an external resource cannot be located, storr will throw an error
of class KeyErrorExternal
:
tryCatch(st$get("richfitz/no_such_repo"),
KeyErrorExternal = function(e)
message(sprintf("** Repository %s not found", e$key)))
## Warning in download.file(sprintf(fmt, key), path, mode = "wb"): downloaded
## length 0 != reported length 14
## Warning in download.file(sprintf(fmt, key), path, mode = "wb"): cannot open URL
## 'https://raw.githubusercontent.com/richfitz/no_such_repo/master/DESCRIPTION':
## HTTP status was '404 Not Found'
## Warning in file.remove(path): cannot remove file
## '/tmp/RtmpmJpesl/gh_description_b7642a28927', reason 'No such file or
## directory'
## ** Repository richfitz/no_such_repo not found
This would happen for all errors, including lack of internet
connectivity, corrupt file downloads, etc. The original error will be
returned as the $e
element of the error if you need to
distinguish between types of failure. The KeyErrorExternal
is also a KeyError
class, so code that catches
KeyErrors
will still work as expected.
For more details on storr exception handling, see the
storr
vignette
(vignette("storr", package = "storr")
)
Note that if you want to persist the storage of the descriptions you would need to mangle the key:
## [1] "richfitz/storr"
## [1] "1.2.5"
The st_rds
storr does not include the fetch hook; it is
a plain storr.
The external storr can support a form of memoisation, though it might be simpler to implement this directly (see below).
Suppose you have some expensive function f(a, b)
f <- function(a, b) {
message(sprintf("Computing f(%.3f, %.3f)", a, b))
## ...expensive computation here...
list(a, b)
}
and a set of parameters to run the function over, with each parameter set (row) associated with an id:
The hook
here simply looks the parameters up and
arranges to run them:
hook <- function(key, namespace) {
p <- pars[match(key, pars$id), -1]
f(p$a, p$b)
}
st <- storr::storr_external(storr::driver_environment(), hook)
The first time the result is retrieved the message will be printed (the function is evaluated)
## Computing f(0.999, 0.342)
The second time, it will not be as the result is retrieved from the storr:
## [1] TRUE
This idea can be generalised by storing the parameters and the functions in the storr so that we lose the dependency on the global variables:
st <- storr::storr_environment()
st$set("experiment1", pars, namespace = "parameters")
st$set("experiment1", f, namespace = "functions")
hook2 <- function(key, namespace) {
f <- st$get(namespace, namespace = "functions")
pars <- st$get(namespace, namespace = "parameters")
p <- pars[match(key, pars$id), -1]
f(p$a, p$b)
}
st_use <- storr::storr_external(st$driver, hook2)
x1 <- st_use$get("1", "experiment1")
## Computing f(0.999, 0.342)
Memoisation in the style of the memoise
package is
possible to implement, but is not provided in the package.
Implementation is straightforward and will work with any driver:
memoise <- function(f, driver = storr::driver_environment()) {
force(f)
st <- storr::storr(driver)
function(...) {
## NOTE: also digesting the inputs as a key here (in addition to
## storr's usual digesting of values)
key <- digest::digest(list(...))
tryCatch(
st$get(key),
KeyError = function(e) {
ans <- f(...)
st$set(key, ans)
ans
})
}
}
Here’s a function that will print when it is evaluated
Create the memoised function
The first time an argument is seen, f()
will be run,
printing a message
## computing...
## [1] 2
Subsequent times will be looked up from the storr:
## [1] 2
Storr takes about twice as long as memoise (memoise does a direct
key->value mapping rather than going through hashed values because it
is the only thing that ever touches its cache). However, the overhead is
approximately half of one call to message()
so it’s not
that bad.