Skip to main content
Toggle menu

Search the website

OpenSAFELY, but new-R

Posted:
Written by:
Categories:

Since it started, OpenSAFELY has provided execution environments for running code in Python, Stata, and R. These are packaged up using a tool called Docker, which produces images that contain the language version and all supporting libraries. These images are then used to to provide identical execution environments for our users’ code. Whether running locally with opensafely run against dummy data, or when submitted to the secure backends to run against real patient data, the same image is used. This is necessary to enable the local development of OpenSAFELY code, which is key to the design goals of the platform (namely, transparency and reproducibility).

When we created the first versions of these language images, we pinned the version of the core language, and all its libraries, and we do not update them. This is to provide a reliable and reproducible experience for users over the lifetime of their project, which can be multiple years. The code a user wrote at the start of their project should still run successfully a year or more later. Over time, we have added new packages upon user request, but once added, a package stays at the same version, and is not upgraded.

However, software does not stand still. There are new versions of languages released, with useful new features and performance improvements. There are new versions of libraries, with important bug fixes and other benefits. Our current R image is based on R 4.0, which is 4 years behind the current version, and similarly has many out of date libraries. This is a pain point we have heard from our users, and have been working towards addressing it.

A new version of the R image

So we now have a new version of the R image, which we call r:v2. This includes R 4.4.3, and an updated suite of libraries. This has been in a beta testing period for a while, but is now generally available.

Compared to R 4.0.5, 4.4.3 brings with it desirable language features like the crowd pleasing native pipe operator |>, and anonymous functions. Here are some links to read more on new R language features included in this new image: 4.1, 4.2, 4.3, 4.4.

To upgrade, you should just be able to change the relevant run: commands for your actions in your project.yaml file to use run: r:v2 ... rather than a run: r:latest .... Our testing indicates that most existing R code should work with the v2 image unchanged. However, there may be some breaking changes, and you may need to fix your code to work with the newer version or R or some of the libraries (this is why we do not upgrade things by default!). But then you can use all the new features and updated libraries the r:v2 image brings.

The old version of the R image is still available under r:v1, so you do not need to change to v2 if you do not want to (for example, if your project is nearly completed). However, we do recommend that ongoing projects switch to using the r:v2 action image, and the documentation and tooling has been changed to default to the v2 image for new projects.

r:latest is not the same as r:v2

I am afraid so. This is an historical artefact of assuming a single version of an action image when we started OpenSAFELY. There are a lot of existing OpenSAFELY projects that use r:latest in their project.yaml files. If we were to switch r:latest to point to r:v2 we could unknowingly break lots of code for our users. Migrating to a new image version should be an explicit step by our users.

We know this is confusing, and our plan to fix this confusion is to switch to all projects using an explicit :v1 or :v2 version, and deprecating the use of :latest altogether at some point in the future.

We got some help from an expert

It has us a long time to produce a new version of the R image. There are two main reasons for this.

One was that the R language ecosystem at the time did not provide the tooling and metadata needed for us to reliably and efficiently reproduce an installation of R with a specific set of packages. The R ecosystem moves very fast, and comes from a research context. It does not have fully developed tooling and infrastructure to support strictly managing versions of libraries, something that is more common in general purpose programming languages like Python. Since we created r:v1, the R community has done great work improving things, and we are able to benefit from their work in r:v2.

The second reason is that the OpenSAFELY engineering team do not have a lot of experience or expertise in R or knowledge of its ecosystem - we’re mostly Pythonistas! When we built the initial R image, we did so somewhat naively, in a way that, whilst it worked, it was very expensive to build and maintain. Adding a new R library required manually chasing down dependency versions and sometimes meant we were not able to add user requested packages. Additionally, rebuilding it meant installing and compiling every installed R library from scratch, a process that took 3-4 hours!

However, there was an excellent solution to this problem, in the form of Dr Tom Palmer from Prof. Jonathan Sterne’s Electronic Health Records research group in Bristol Medical School and the MRC Integrative Epidemiology Unit. Tom is an experienced OpenSAFELY user, and is also an expert in the R language and its ecosystem. Collaborating with the University of Bristol, we were able to arrange a 6 month part-time secondment for Tom to come work with the Bennett Institute on improving our R language support. He is primarily responsible for the technical foundations of our new r:v2 image, and has greatly improved our understanding of R and its ecosystem. We are grateful to Tom, and the University of Bristol, for his time, energy and enthusiasm - this work would not have happened without him.

Here’s what’s different

Rather than downloading and compiling packages from source directly from CRAN, our new image instead leverages newer R community efforts.

Firstly, instead of renv, it uses the pak library for installing packages, which allows faster downloads and is much better suited to Docker images, and is more similar to other language packaging tools in many ways.

Secondly, it uses the excellent pre-built CRAN packages repositories provided by Posit. They provide date based snapshots of the entire CRAN archives, which is very useful for providing a static set of R libraries, as we can hardcode this date into our R image. Packages added to the image will come from CRAN as it was on that date, ensuring we do not upgrade the package, and providing consistency between supported library versions.

Another big benefit of using these repositories means that we install built binaries directly, which is much faster than compiling from source - rebuilding the R Docker image now takes less than 5 minutes instead of multiple hours. It also means its is much easier for us to add a user requested package without worrying about its dependencies and their versions, and it provides some basic quality assurance that the R packages we install are built correctly and work well with other libraries.

As with all things OpenSAFELY, this work was done out in public - you can see more of the technical details in the GitHub repository for the OpenSAFELY R image.

This is a much stronger technical foundation for our R support. It will make it far easier for us to maintain, and to respond to R user needs, and also to release future new versions of the image with the latest versions of R - perhaps slightly more often than every 4.5 years.