Chaos engineering and testing: 34 tools and tutorials

Testing in production (TiP) is gaining steam as an accepted practice in DevOps and testing communities, but no amount of preproduction QA testing can foresee all the possible scenarios in your real production deployment. The prevailing wisdom is that you will see failures in production; the only question is whether you'll be surprised by them or inflict them intentionally to test system resilience and learn from the experience. The latter approach is chaos engineering.

The idea of the chaos-testing toolkit originated with Netflix’s Chaos Monkey and continues to expand. Today many companies have adopted chaos engineering as a cornerstone of their site reliability engineering (SRE) strategy, and best practices around chaos engineering have matured.

The time is right to gain a comprehensive understanding of this approach. And to help you do that, we gathered over 30 of the most popular and well-reviewed links to tools and resources on chaos testing and chaos engineering, neatly grouped and categorized. Read on to see if this relatively new strategy is right for you.

World Quality Report 2017-18: The state of QA and testing

Test in production

Test faster and smarter by testing in production

The advice from SauceLabs—aimed at TiP beginners—is to intentionally test in production. The real-world feedback it gives you is the perfect supplement to your internal QA process.

Testing in production: Yes, you can (and should)

"Only production is production," says Charity Majors, CEO at monitoring-tool vendor Honeycomb, and you are de facto testing in production every time you deploy code there. Lean into it and practice failure regularly so you can get better at handling it.

TestOps #2—Testing in production

Here's how you can implement TiP through canary releases, blue-green deployments, slow rollout techniques such as controlled test flight, A/B testing, synthetic user/bot-based testing to generate production load, fault injection testing/chaos engineering, and dogfooding.

Every release is a production test

When the Twitter team started on the path to TiP back in 2010, it ran a cost-benefit analysis and concluded that TiP would have high value, but at a high cost (risk of outage), so it came up with these risk-mitigation strategies.

Tie your production tests into your CI/CD pipeline

Testing in production: rethinking the conventional deployment pipeline

The Guardian integrates its production tests into the CI/CD pipeline, linking test results directly to GitHub pull requests to complete the feedback loop for developers. Its RiffRaff deployment tool and Prout pull-request feedback tool are both available as open source.

Salesforce testing best practice: why you should regularly run production tests

Salesforce advocates re-running tests in production on a regular cadence (not only at release time) to uncover failures due to changes in the system early, rather than after a later deployment. It provides a framework for doing so in its Gearset testing system.

Minimize the negative impact of production tests

Scientist: Measure twice, cut over once

Use GitHub’s Scientist framework to deploy new releases and send production requests down a new path to potentially uncover new bugs while also preventing end users from experiencing errors due to those bugs. Scientist serves the correct output to your users, compares old (control) and new (experimental) outputs, and alerts you if there's a mismatch. The two-year-old framework has been ported to multiple languages.

Move fast and fix things

GitHub uses Scientist for its own releases. Here it shares the details of one release experiment where the team found and fixed serious issues in its merge code over four days of testing in production—without affecting its users.

Understand the principles of chaos engineering

Principles of chaos engineering

This community-maintained document is a great first introduction to chaos engineering. It defines "chaos engineering"—experimentation on a system to uncover its weaknesses—and lists the principles agreed upon by the chaos-engineering community.

Chaos engineering upgraded

The principles of chaos engineering originated at Netflix, which documented them during the development of Chaos Monkey, its open-source tool for random fault injection. In 2015, the Netflix team augmented its chaos toolkit with Chaos Kong, a tool that mimics the outage of an entire AWS region. This post describes Netflix's Chaos Kong exercise and another experiment, with the Subscriber service.

Breaking things on purpose

Breaking things on purpose is preferable to being surprised when things break, says Mathias Lafeldt, infrastructure developer at Gremlin. When you do it on purpose, you can test breaking things at a time and place that is convenient, he explains in this blog post.

Chaos testing—Preventing failure by instigation

Learn the definition of "chaos testing" from Mark Harrison, senior consultant at Cake Solutions, and get some thoughts from the Chaos Community Day conference.

Practice chaos engineering techniques

Chaos engineering 101

Get started with chaos experiments, from principles to specific steps, with this article by Mathias Lafeldt.

The discipline of chaos engineering

The what, why, and how of chaos engineering as described by chaos-as-a-service provider Gremlin.

Planning for chaos with MongoDB Atlas: Using the "test failover" button

Here's an example of how to do chaos testing for MongoDB. The developers of MongoDB made it easier for users by providing a special “Test Failover” feature. This software comes with built-in chaos buttons.

A primer on automating chaos

Here's a walk through the progression of automation for chaos engineering. Don’t worry; you don’t have to automate it all at once.

The limitations of chaos engineering

While chaos engineering is a great tool for improving the resilience of your system, it is not a panacea. Here's where it's a fit—and where it's not.

Use fault injection and chaos tools

Chaos toolkit

This resource provides a command-line interface that encapsulates chaos-engineering workflow, along with tutorials.

The Netflix Simian Army

The well-known “Chaos Monkey” and the rest of Netflix's Simian Army have been used since 2011 to randomly break production systems and see if the are fault-tolerant.

Using Chaos Monkey whenever you feel like it

The original purpose of Chaos Monkey was to test resilience by killing off parts of your production system at random; this engineer uses it to kill Amazon EC2 instances in a controlled way.

FIT: Failure injection testing

Netflix developed the FIT framework in 2014 to give its engineers more control over the chaos.

From chaos to control—Testing the resiliency of Netflix’s content discovery platform

This is an example of using Latency Monkey (from the Simian Army suite) and FIT to test Netflix’s Merchandise Application Platform.

Automated failure testing: Training smarter monkeys

Netflix continued iterating on its toolkit with this 2016 prototype tool based on Molly, a fault injector that uses request lineage data.

Lineage-driven fault injection

For research background on the Molly approach, read this original University of California at Berkeley paper.

How we break things at Twitter: Failure testing

Twitter’s framework for injecting faults into its production system (power loss, network loss, service unavailability) consists of mischief, monitoring, and notifier modules tied together with a Python library. Sadly, it is not open source, but a good architectural overview is provided.

Systematic resilience testing of microservices with Gremlin

This open-source Python framework from IBM for fault injection testing of microservices should serve as a companion to—not a replacement for—Chaos Monkey.

ChaosCat: Automating fault injection at PagerDuty

ChaosCat is not open source, but serves as an inspiration. PagerDuty implemented it as an always-on service, with a Slack bot interface for one-off invocation. As a service, it continuously throws randomly chosen attacks at PagerDuty’s hosts.

Pumba—Chaos testing for Docker

Pumba is a new Chaos Monkey-like tool for resilience testing Docker containers.

Run game-day exercises

Fault injection in production: Making the case for resilience testing

This seminal 2012 paper from Etsy lays out the argument for testing in production with intentional fault injection, and provides a pattern for constructing a game-day exercise. The exercise helps the system learn from exposure, à la vaccination.

3 lessons learned from an Elasticsearch game day

Datadog describes how it ran a game-day event on its ElasticSearch cluster in order to learn which failure modes were easily handled, and which caused unexpected problems.

Game day exercises at Stripe: Learning from 'kill -9'

Stripe suggests that you stick to the simplest failure scenarios when starting out with game-day exercises. Its first choice was a basic “kill -9” on the primary node of a Redis cluster, which unexpectedly resulted in data loss. Here are the lessons learned.

Our first engineering game day

If you're new to game days and not ready to inflict potential pain on your high-value customers, consider this startup’s approach: It ran a game day in a staging environment instead of production. This also works well for intentionally exceeding the tolerance limits of your system, to train the team on incident response. (Full disclosure: I am the author of Quid’s game-day blog post.)

This way to more chaos

If you still can't get enough of chaos engineering and testing in production, you'll find additional resource lists on GitHub. Each has slightly different collections. The first resource is part of GitHub's series and includes more articles, tools, books, conferences, and blogs. The second is a curated list of resources on testing distributed systems, which includes resources on chaos engineering, game days, and more.

World Quality Report 2017-18: The state of QA and testing

Those are my picks for the best resources on all things chaos engineering. If you recommend other resources on chaos engineering or TiP, let me know by posting them in the comments below.

Topics: Quality