How Descartes Labs achieved safe, mission-critical continuous delivery

At Descartes Labs we have built a data refinery to collect, process, and analyze sensor data to quantify changes in the Earth. Our platform runs on Kubernetes in Google Cloud, powering global-scale machine learning across more than 10 petabytes of geospatial data.

At any given time we may have tens of thousands of CPUs extracting meaning from complex geospatial datasets, with models generating insights within minutes of satellites passing overhead (such as this one, for wildfire detection).

Supporting a rapidly evolving API platform with time-sensitive components means that we have developed some insights into safe, mission-critical continuous delivery.

Here are key points we've learned along the way that could benefit your organization.

Empower your application developers

Descartes Labs has approximately 30 software engineers and three site reliability engineers (SREs); this means we have comparatively little capacity dedicated to our continuous integration and deployment infrastructure.

In our efforts to build out a robust continuous delivery ecosystem, a few things became clear early on:

Manually creating deployment pipelines for each application is error-prone and does not scale.
Having SREs in the critical path for adding and configuring specific application pipelines is inefficient.
Having SREs responsible for day-to-day operations of deployment pipelines is ineffective and does not scale.

Using the Spinnaker continuous delivery platform, we were able to address these issues without a significant engineering investment. Spinnaker supports many cloud vendors, with rich and continuously improving support for Kubernetes via the V2 provider.

Pipeline templates and pipelines as code

Spinnaker pipelines are defined in JSON, and by using the Spin command-line interface we can create and configure application pipelines in an automated fashion.

In practice, however, continuous deployment pipelines aren't one size fits all; different applications might require schema migrations, ConfigMap changes, custom environment variables, and so on.

By establishing some conventions around application naming and namespacing, we were able to build simple yet powerful pipeline templates using the Jinja open-source templating language for Python; Jinja is modeled after Django's templates.

This allowed us to configure pipelines for individual applications using lightweight JSON configuration files.

By separating out configuration from the pipeline, we arrived at an ownership model where the Jinja templates are managed by the SRE team via code owners. The configuration files live with the application code, where they can be managed by the application developers.

Configuring a new application with the standard suite of pipelines—including configuring the target cluster, horizontal pod auto-scalers, container image prefix, and some basic environment variables—can be done in less than 20 lines of JSON. The application team can review this code without involving a single SRE.

Our use of Jinja is an artifact of Spinnaker's maturity level at the time we started developing this approach. Today both Managed Pipeline Templates and Sponnet provide a richer, more native templating ability for Spinnaker pipelines.

Pushing operational responsibility for pipelines to the developers

Application developers are often best equipped to understand expected behavior and diagnose any issues. (This is in addition to avoiding the SRE scaling issues highlighted above.) Spinnaker provides several features that facilitate pushing pipeline operation to application developers:

Authorization: Pipeline execution can be restricted to individual teams.
Audit trail: Clear tracking of who executed manual triggers or approvals.
Rich diagnostics: View deployment/pod health and logs from within the Spinnaker UI.

It is worth noting that Spinnaker's diagnostics let us restrict developer access to production Kubernetes clusters without sacrificing visibility.

The value of high-velocity continuous delivery

You might assume that a high deployment frequency in production would correspond with a higher number of errors being introduced. We have found the opposite is generally true, with a higher deploy frequency corresponding with smaller changes that are easier to understand, debug, and reconcile.

We use trunk-based development, where deployments happen on demand upon merge into master. Our core applications can be deployed multiple times each day, and require less than an hour from code merge to receiving production traffic—a measure known as lead time for changes.

High-velocity deployments mean that we are generally using small, well-understood changes with unit and integration tests, which run against production. When changes introduce new problems into production (and they do!), these tend to be low impact and we can pin them down quickly.

Coupled with a short lead time for changes, it is often easy to fix the problem and quickly roll forward.

Our Spinnaker templates include a one-click rollback pipeline for each application, making it easy for developers to quickly revert a problematic deployment that made it into production. (In preparing for this article, I found that none of our core applications had been rolled back in the past three months.)

This anecdotal experience aligns with what others have generally observed today. For example, the Accelerate: State of DevOps Report demonstrates that increasing deployment frequency and reducing lead time for changes correlates with a lower change failure rate and a shorter time to recover from service incidents and defects.

A path to fully automated production deployment

Canary deployments incrementally roll out application changes to subsets of users, to validate behavior before updating all of production. Spinnaker has powerful native support for canary analysis that allows for safe, fully automated deployments into production.

Adding a canary stage that evaluates and compares metrics such as CPU and memory utilization of deployments is straightforward. Unfortunately, Kubernetes routing, evenly distributed across pods, imposes constraints that make using automated canary analysis difficult.

By incorporating the Istio service mesh, we improved the Spinnaker and Kubernetes canary story in two critical ways:

Istio provides controls that allow Spinnaker to manipulate traffic flow between deployments, independently of pod counts. This overcomes the constraints of native Kubernetes routing.
Istio provides L7 metrics out of the box, allowing automated canary analysis on application metrics like HTTP response codes and request latencies.

What you can learn from our experience

Combining Kubernetes, Istio, and Spinnaker helps you build a deployment ecosystem that allows for safe, mission-critical continuous delivery, without sacrificing velocity. Focus your resources on building out a self-service pipeline architecture, which will scale across development teams very effectively.

High-frequency deployments, having a short lead time for changes, and application-level canaries will provide the confidence you need for fully automated deployments into production. These factors together can improve reliability, with comparatively few engineering resources.

To learn more of the technical details of Descartes Labs' approach to continuous delivery, don't miss my talk, "A Journey to Safe, Mission Critical Continuous Delivery at Descartes Labs," at the Spinnaker Summit, which runs November 15-17, 2019 in San Diego, California. His presentation takes place on November 16.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, DevOps

You are here