Dog in a sidecar

Make your app architecture cloud-native with a service mesh

As technology advances, developers find themselves solving similar problems but in different contexts. Every five to 10 years software engineers must deal with shifts in how you architect and implement those solutions. This time around, it's microservices and the cloud.

With microservices architectures, you're making more network calls and need to do more integration to get your system to work. This creates more ways for applications to break and causes failures to propagate much faster. You need a way to call your microservices, and to have them be resilient to distributed systems failures as a first-class implementation.

Previous generations of tools that solved these problems were built by software engineers at companies like Twitter or Netflix, but they did so with application libraries. For each combination of platform, language and framework you use to build microservices today, you need to solve for the following critical functions:

  • Routing / traffic shaping
  • Adaptive/client-side load balancing
  • Service discovery
  • Circuit breaking
  • Timeouts/retries 
  • Rate limiting
  • Metrics/logging/tracing
  • Fault injection

Doing all of these things in application-layer libraries across all of your languages and frameworks becomes incredibly complex and expensive to maintain. Fortunately, an service mesh solves these problems more elegantly by pushing those concerns down to the infrastructure layer. 

Containerized Architecture: Microservices architecture and containers

The cloud and service architectures

Recently, cloud infrastructures have become not only more ubiquitous, but the norm. Having on-demand, configurable, and elastic infrastructure allows you to build interesting capabilities with software that previously was only available for the big “WebOps” shops. These include service-oriented architecture and microservices.

With a services architecture, you take a system view of your applications designed as individually deployable, autonomous services that cooperate and interact with one another to solve business problems.

In the past, developers built “monolith” or “n-tier” applications, where all of the modules and components in an application (or tier) lived in the same memory space. The focus was on abstractions and boundary design between your modules, but everything was co-located and ultimately deployed within the monolith. This gave them a lot of conveniences and solved some difficult problems. For example, function calls between modules within an application were expected to be reasonably fast. They would also succeed or fail.

Additionally, they could take advantage of things such as ACID (atomicity, consistency, isolation, durability) databases, which further removed complications by abstracting the developer away from serious data inconsistencies. In addition, people used unit tests, logging, compilers, and debuggers—powerful tools that helped to build, run, and maintain these applications.

New problems, new opportunities

As you look at building systems with the new infrastructure capabilities and as services architectures, you need to solve many of the same problems you faced in the past. However, since the game has shifted drastically, many previously cherished tools and accompanying assumptions no longer hold.

Ultimately, you are now building a distributed system where you have little to no control over your collaborating services, and even less over your communication medium. For example, in a monolith, all of the components of the system are written in the same language.

In a services architecture, different parts of the system can be written in different programming languages or frameworks. When communicating over typical cloud networks, you also have to deal with another phenomenon: calls to collaborators can fail, be slow, or only partially succeed. These are all issues that you don’t necessarily deal with in your monolith, but have the ability to take down your entire system. You need to design your applications with the network in mind.

These are things you either learn from your own experience or from your predecessors. For example, Netflix has become a poster child of sorts for building these types of architectures. Netflix has been outstanding at sharing its learning and tools with the open-source community, as have a few other firms. These companies solved a lot of problems by investing significant engineering time and talent into solving network problems with language-specific frameworks and application libraries.

Things such as load balancing, service discovery, circuit breaking, retries and retry budgets, timeouts, and so on are all critical to building resiliency in these types of networked environments. For example, Netflix open-sourced its Hystrix library for doing things such as bulk heading and circuit breaking to help reduce the spread of failures across fault zones.

Google and Twitter invested heavily in their RPC systems (such as Stubby and Finagle) to solve these problems. As you become more polyglot (language, frameworks, technology vintages, etc.) it becomes a heavier burden and increases complexity to re-implement this functionality for the many different combinations of languages, frameworks, and runtimes. What's more, consistency of implementations becomes difficult to maintain.

The network is a horizontal concern

These types of horizontal concerns are not service-specific and are not differentiators for services. Architecturally, what you're trying to achieve is some way to intercept (or wrap) your interactions with the network in the logic responsible for providing functionality such as service discovery, retries, and circuit breakers. That’s what tools such as Hystrix, Finalge, Stubby or ad-hoc application-level network interceptors do.

What if you could just do that irrespective of the implementation stack used to implement the service? That’s where things such as sidecar proxies and service mesh fit into the picture.

Meet Istio is an open source service mesh platform that helps developers and service operators solve some of these network problems in a framework- and language-neutral way. It deploys a small sidecar proxy (implemented with Lyft’s Envoy Proxy) that's collocated with your service that lets your service communicate with the rest of the system. In other words, the service talks directly to the proxy (possibly unknowingly), and the proxy talks to upstream services (as well as the reverse).

With this sidecar model, you can intercept and wrap your network calls with functionality such as retries, circuit breaking, timeouts, and load balancing, and relieve your application from having to bring in framework- and language-dependent libraries for this purpose. Moreover, this functionality is implemented consistently, regardless of application or service implementation details (Java, Node, C++, monolith, microservice, microlith, etc.).

With these sidecar proxies in place, collocated with your services, you’re able to consistently help solve network resilience problems. You can also do interesting things with respect to exposing some of the usage metrics of the service calls. Since your proxies work at both the L3 and L4 layers as well as L7, you can mine interesting telemetry at both the network level (connections, bytes, etc.) and the application request level (requests, retries, failures, etc.). You can even provide consistent distributed tracing information.

With Istio, the sidecar proxies (the "data plane") are collecting, batching, and sending this telemetry data back to an Istio component called the Mixer (part of its control plane). You can use the Mixer component to tap into the information flowing between services across the network, such as when sending data to metrics collection tools such as Prometheus or using distributed-tracing tools like OpenTracing, or even control policies about access-control lists or service call quotas.

Since Mixer is built on a plugin model, you could plug in any other tools (such as API management) interested in observing and controlling the traffic going through the service mesh.


The instrumentation and resilience Istio offers opens up interesting opportunities for deploying your applications. With cloud infrastructure, you can do things such as zero-downtime deployments with blue-green and rolling deployments. With Istio you can more finely control the traffic between services and the service mesh by defining routing rules.

The Istio proxies in the mesh periodically check with Istio's control plane for updated configuration information, including routing rules. For example, if you want to do an ad-hoc stage deployment, a.k.a. “canary deployment,” for a new version of your service (such as for doing A/B style tests, or just simply testing out changes before you initiate a rolling upgrade/blue-green deployment), you can tell Istio to send a specific fraction or percentage of live traffic to your services. You can also do more adaptive routing such as routing to services in a different zone during failures.

Go beyond the monolith

Istio provides the tools that you need to run mature services architectures in elastic cloud environments. A lot of these things — resilience, routing, observability — are reincarnations of some of the things you already had in your monolithic environments, but recast for modern architectures. If you’re looking for language- and platform-agnostic implementations of this functionality, or if you're struggling to wrap your head around moving from traditional tools to cloud-native tools, take a good look at

Containerized Architecture: Microservices architecture and containers
Topics: IT Ops