Recently, we have been working on a program to make the work of data scientists discoverable and reusable, and I wanted to share a bit about it with a wider audience. As part of our application, we built a DevOps system which enforces hermetic R and Python container builds and have seen lower error counts and shorter build cycle times as a result.
What is a hermetic build?
The goal of a hermetic build system is to operate as a pure function, or in build terms, to always return the same output assets given the same input source code. In order to do this, the system must provide a way to isolate the build from changes to the state of the host system, and a way to ensure the sameness of the inputs.
- Source Identity. Local source assets are protected by git, which identifies sets of code mutations with a unique hash code. External assets, however, can arrive in many forms, depending on the development idiom and culture of a language or tool. In 2020, a significant percentage of external code assets are hosted on services whose APIs provide a stable, discoverable URL which can be used to download specific versions of assets as of a given change set, either at the commit scale, or tagged release scale, or both. A hermetic build system leverages this API to specify exactly which versions of an asset to download and build with. It checks hashes of these assets against stored values so it can detect any unexpected changes. Upon retrieval, if an external asset does not match its stored hash, the build fails.
- Isolation. The other half of the hermetic build goal is to specify versions of tools required to build the source. These build systems also must provide ways to prevent accidental infection (timely!) through accidental use of unchecked source code resources, locally modified shell symbols, or environment variables et cetera. In its fullest form, hermetic build systems treat tools such as compilers like source code, downloading their own copies of tools and managing their storage and use inside managed file trees used as secure sandboxes. The isolation from the host machine and local user, including installed versions of languages, should be total.
Building with R
The R language presents interesting challenges to these goals. R is used by data scientists precisely because it wraps arbitrarily complex behind-the-scenes tool use in simple, semantically clean functional wrappers, allowing data scientists to focus solely on Getting Things to Work. For the would-be hermetic builder, however, it’s complicated:
1. Underlying dependencies. Most of the R authors in our application are blissfully unaware of which underlying technologies any given R libraries use. In practice, anything is possible including use of Java, C++, Python, TCL/tk, and/or any other tool stacks. These tools have dependencies on underlying system libraries which are not specified anywhere. The sadly non-hermetic R idiom is to assume that the user has the ability to install any missing packages locally, and that transitive dependencies will be resolved by a host OS package manager.
2. Version control. R’s idiomatic development culture predates universal access to free source control, so code releases are typically identified by a version number in the name of an asset bundle. R is not alone in this, but version numbers alone are not hermetic. Nothing enforces version number changes in lockstep with modifications to source code. Version numbers can be updated, or not updated or arbitrarily changed for any reason, independent of source modification. Further, while dependencies between R libraries are provided in text form, no package management tools exist to warn users about version incompatibilities.
3. Poor location stability. R does have its own website for package distribution, located at https://cran.r-project.org, commonly called CRAN. CRAN maintains an archive of current and recent versions of R libraries. Unfortunately, it uses one URL format for the current version, another for recent historical versions, and returns a 404 for more ancient historical versions. A sudden burst of new releases can cause specific versions of packages to change from the “current” URL format to the “historical” one without notice.
The problem
We needed to standardize, simplify and hermetically seal the build process around R libraries.
Our project enables model authors (presumed to be data scientists of varying degrees of cloud engineering knowledge) to develop models in their preferred coding language on their desktop as they have always done, and then upload these models to a cloud-hosted central registry for discovery. Our user interface then enables other users to run these models against their own data sets.
For model authors who choose to write R scripts, we offer hundreds of hosted R libraries to write to, constantly adding more by request, so finding a way to make our builds stable yet open to additions was a big need.
How we solved it
In order to make this happen, we built DevOps pipelines which, when users upload model code, build it as a custom docker container. These pipelines had to provide compile-time access to hundreds of R and Python libraries in a hermetically safe form, and produce containers with correct pre-installed versions of any system packages they might depend on.
We decided to use Bazel (htps://bazel.build), a hermetic build system which was developed at Google for internal use, and then subsequently open sourced. Bazel is a rules-based, cross-language build system which provides hermetic build rules for many languages (including Python) out of the box. Per language, Bazel provides hermetic source code identity rules which can download external files but check the hash of the downloaded asset against a provided value before building. Each language requires a separate set of rules because package management can vary widely and involve custom toolchains, all of which must be managed to idiom.
Because R is a niche language, support is not provided in the base product, but Bazel includes an extension language which has enabled third parties to add many useful capabilities. In the case of R, we found a Github-hosted project which provided the R rules we use (https://github.com/grailbio/rules_r).
Our solutions to the challenges mentioned above are as follows:
1. Underlying dependencies. We use Bazel package manager to build base containers with the underlying packages installed. The package manager rules download lists of packages from a specified timestamp from Debian snapshot server so they are all guaranteed to be built with the same OS updates. We also use Bazel at build time to test that each R library successfully loads in our containers, thus verifying that all of its required system dependencies have been met.
2. Version control. For hermetic purposes, we ignore advertised R library version numbers and instead use hashes to identify code assets. We keep track of version numbers only because they form part of the download URL for a given library. We have to manage version dependencies between libraries by hand, so there are occasional cascading transitive dependency updates, but so far big changes have been rare.
3. Poor location stability. In the wild, our experience has been that a library can be hosted in one of four ways, each of which requires specific Bazel rules:
- For current CRAN versions, the R rules package includes a macro which generates package rules from a CSV file with each line containing a library, the desired version and associated hash per line. The macro builds the CRAN URL, and knows how to extract the hash so it can generate a rule for the package.
- For the remaining cases below, the rules for libraries in each of these states are built by hand:
- Historical CRAN versions
- Github-hosted, versioned by full release
- Github-hosted, versioned by commit
- Each category its own URL format but all are a pure function of the library name and version. We record the hash by hand into the rule for the package. As a result, our builds only fail when the hosting of a library version we need moves from one of these categories to another, or if a hash changes, meaning that the code is not safe.
- For the remaining cases below, the rules for libraries in each of these states are built by hand:
In addition to helping us tame R builds, Bazel has many other features. In general it replaces what developers used to do in makefiles: it computes the chain of dependencies required to build a given target and builds them in the correct order. It also caches intermediate products of dependent rules, so if a source file changes it only rebuilds code which was dependent on the changes. Bazel also differs from make and similar tools in many ways. For instance, Bazel parallelizes builds. Out of the box, if you put Bazel on a multicore machine, it will use all of them. For large-scale continuous integration, that saves a ton of time.
Third party extension libraries allow Bazel to do a lot more than build code. We have adopted third party rule sets which extend its reach and capabilities in our DevOps systems. We use it to:
1. Tag and Upload container assets to Container Repositories and then deploy them to Kubernetes.
2. Install and deploy third-party Kubernetes objects with helm charts.
3. Perform all of our module- and container-based tests.
Conclusion
The rules we’ve created now run for a couple hundred R libraries. With parallelized, hermetic builds, rules are matched to source code from square one. This reduces the time required for building libraries—it takes half an hour now using 8 cores—which allows data scientists to work faster and cleaner.
Like what you see?
Jason Harenski has over 20 years of project management experience, internet product design, and engineering management in organizations ranging from startups to the Fortune 50.