You are here

You are here

How Ancestry used AI to optimize its microservices apps

Darek Gajewski Principal Infrastructure Analyst, Ancestry

IT has been transformed over the past decade. Microservices and DevOps have accelerated time-to-value for code, and cloud computing has made infrastructure almost completely programmable. Everyone is scrambling to adapt as efficiently as possible to this new landscape.

With a microservices model, you can roll out new application features quickly and often. This is highly productive and agile. However, although the continuous integration/continuous delivery (CI/CD) toolchain ensures that everything is tested and deployed rapidly, the post-release portion of the delivery pipeline gets neglected.

Across the board, apps are chronically inefficient, leading to misconfiguration and overspending. Many developers know this, but they don't know how to address it.

Fortunately, there's a way forward: You can continuously optimize application performance in your microservices while avoiding the overprovisioning of resources—without extra person-hours—by leveraging machine learning and artificial intelligence (AI).

At Ancestry, we don't have the bandwidth to spend on investigating, tuning, and subsequently testing to continually optimize our applications. I imagine this is true at other companies also. AI has been essential for us to reduce the pressure on our developers to optimize, increase our overall utilization, and ultimately reduce our cloud spend.

Microservices' complexity makes human optimization impossible

When our teams were pushing out code on a monthly or quarterly basis, they had the bandwidth and the ability to manually tune application performance. Today they are managing weekly and daily releases.

And as the adoption of microservice infrastructures continues to grow, so does the resulting complexity. Application infrastructure—rooted in the dynamism of microservices, containers, and cloud instances—is increasingly hard to manage.

In this scenario, it becomes impossible for engineering teams to select the right resources and parameter settings. Look at it this way: Even a simple five-container application can have more than 255 trillion permutations of resources and basic parameters.

This makes for a daunting number of potential configuration tweaks available every second. To manage this system, you would need flawless knowledge of the entire infrastructure across every layer, and of the application workload itself.

To maximize resources and optimize spending, you would need to measure parameters such as requests per second or response time, while tweaking settings such as virtual machine (VM) instance type, CPU shares, thread count, garbage collection, memory pool sizes, and more.

But all of this is in constant flux. And if that isn't challenging enough, modern app workloads are undergoing constant change. Developers are releasing new features, middleware is getting updated, behavior patterns are shifting, and cloud vendors are releasing new resource options all the time.

When you're attempting to optimize, getting the right instance, the right number of instances, and the right settings in each instance involves numerous interdependencies that are simply beyond our reach.

That's why, in a microservices model, cloud and mobile apps  chronically run with worse performance and at a higher cost than what's ideal for their workloads. The sheer number of parameter and resource settings available in modern application deployments means that true optimization is impossible without help. Instead, enterprises implement some basic performance monitoring and leave it at that.

Use neural networks and deep reinforcement learning for continuous optimization

If optimizing application performance in the microservices era is too much to put on human teams, what is the answer? Artificial intelligence.

Machine learning, the branch of AI that is revolutionizing so much of the digital and high-tech landscape, offers the level of comprehension and insight that automated optimization requires. Neural networks, based on the interconnectivity and activation of neurons in the human brain, can develop internal states that represent patterns hidden in datasets.

With deep reinforcement learning, you can map and analyze the entire shape of the infrastructure, and register all interdependencies and implications. The system can pay permanent and granular attention to how shifts in every sort of setting affect performance.

With this machine-learning combination, an AI-driven optimization platform can continuously examine millions of combinations of configurations, in order to identify the optimal combination of resources and parameter settings.

An AI-driven platform can absorb metadata about an application, making small tweaks to resource assignments and configuration settings to enhance performance or reduce cost, and then continuously remeasure whenever changes are made.

Best of all, AI can proactively perfect those settings that are typically too complex: resources such as CPU and memory, middleware configuration variables such as Java Virtual Machine Garbage Collection type and pool sizes, kernel parameters such as page sizes and jumbo packet sizes, and application parameters such as thread pools, cache timeouts, and write delays.

It can also react constantly to new traffic patterns, new code, new instance types, and all other relevant factors.

An effective combination

With this level of insight and intervention, AI can automatically select the optimal combinations of infrastructure parameters in real time for every given workload, to quickly optimize for performance and efficiency. Developers can set the business goals, while the AI performs the task.

And all of the operating information can be constantly fed back into the neural network, which processes and learns everything it sees, so that insights compound. With a setup like this, the optimization engine becomes exponentially better at tuning performance and improving efficiency.

This is how we implemented our platform at Ancestry. After integrating it into our CI/CD pipeline, our AI-driven solution now iterates several times a day to find each application's optimal settings within its thread pools, garbage collectors, resource settings, and any other tunable settings within our Java, .NET, Solr, or Node.js applications.

The result is a more efficient infrastructure, less strain on our developers, and a smaller cloud bill.

End guesswork in provisioning

At present, most tech companies that are operating in a microservice and DevOps setting prioritize reliability over performance and cost. This makes sense. The last thing anyone wants is for things to break down. Overprovisioning is a safety net; we buy ourselves peace of mind with extra, unused resources.

But with the possibility of automated and continuous optimization, this is no longer necessary. Now it's possible for a well-optimized infrastructure, personalized to the workload, to deliver the same or better reliability, at higher performance, and at a much lower cost .

Whether you want to develop your own solution or find something off the shelf, AI is essential to creating an efficient and scalable cloud-based infrastructure.

For a more on using AI and optimizing your IT infrastructure, come to my presentation at KubeCon + CoudNativeCon North America, where I'll talk about how Ancestry got Kubernetes to run two times better per dollar by using AI. The conference runs November 18-21, 2019 in San Diego, California.

Keep learning