Calculator app and money

How to estimate the IT Ops cost for software before it's written

It is a worrying question for any project or program: “What hardware do I need to buy?” These days you might not need to buy physical hardware at all, but someone still wants to know what your running costs are going to be, whether that’s expressed in virtual or physical compute resources.

In the past, this process has been complicated because all projects were included in a yearly budget, and IT defined success as fully spending that budget. When you move to a more granular approach, where you allocate budget for each stage of your project (experiment, exploit, etc.) and define success in terms of customer or operational goals, that makes life easier. Either way, you won’t get a green light for any new project unless you can estimate running costs.

So how to you do that?

If your company is using a microservices-based architecture, this task is somewhat easier because there is a strong correlation between the performance requirements of a microservice and its cost. So if you have existing microservices, you can benchmark them under different loads to calculate what it costs for each. Then, to calculate the costs for new microservices, you simply take the performance requirement each must meet and find the closest match from the data you already generated.

DevOps Enterprise Summit: Experts share lessons learned

Using microservices with NFR testing

The performance requirements for microservices are just one of many nonfunctional requirements (NFR), such as resilience, security, and so on. These are often expressed as an average response time at a given number of transactions per second (TPS). More specifically, performance requirements are often expressed as the average response time to a given percentile, typically 95% or 99%.

To prove that a potential release candidate of a microservice meets requirements, you would expect it to pass a series of tests that you can then leverage to create the data you need. These tests should be automated, spinning up test environments on the fly to generate the results. Note that if your tests are manual, you can still use this approach, but it will limit the amount of data you can generate.

For example, the results of a positive performance test for a given microservice that needs to handle 800 TPS in under 80 milliseconds might look like this:

MicroserviceResponse time (ms)TPSResourcePlatform 
Product service800731 large virtual machineIaaS


This tells you that in order for the microservice to achieve a response time of 73 milliseconds under a load of 800 transactions per second, the resource required will be a large-sized virtual machine (VM). And in the future, any microservice with the same performance requirement should require a similar level of resources.

The resource in question could be a PaaS, IaaS, or on-premises system. What is important is that you can calculate the cost based on the size of the resource. Actual costs should be published by your cloud or data-center provider and may include different options, depending on such factors as the period of time for which you need the resource.

Leverage your data: How to extrapolate costs for new microservices

There are, however, two main problems with this approach:

  • You only have figures for a particular performance load/resource combination, which is unlikely to be the same for all future microservices, and…
  • The data comes only from one service, so you can’t be confident that it is representative.

You can solve the first issue by running the performance tests again using different transactions per second/resource combinations.

First run the same performance test, but constrain the resources:

MicroserviceTPSResponse timeResourcePlatform
Product service800731 large virtual machineIaaS
Product service8001181 medium virtual machineIaaS
Product service8001481 small virtual machineIaaS


Although you can see that for this particular service a small VM isn’t enough to  hit the performance NFR, this is still useful data because you know that if a future service had a performance requirement of 800 TPS with a response time of under 150ms, you can meet that for the cost of a small VM.

However, each new microservice is likely to have a different performance profile, so you need to know how each will perform under different loads. So this time repeat the tests above using a selection of different transactions per second. You’ll then end up with a matrix of "transactions per second/resource required" with the response time results.

Rinse, dry, repeat the process

Use this approach for your other microservices to increase your confidence level that your forecast will be accurate. The more microservices you create and test, the higher the quality of your data.

For example, imagine that you have done all of the above and want to forecast the cost for a new recommendation service. Based on the requirements you captured, the performance NFR says the system needs to cope with 200 TPS with a response time under 60ms. If you use this to query your data, it might look like this:

Query: TPS = 200 Response time = <60

MicroserviceResponse time (ms)TPSResourcePlatform
Product service200601 medium virtual machineIaaS
Customer service200452 small virtual machinesIaaS
Payment service200451 medium virtual machineIaaS
Inventory service200551 medium virtual machineIaaS
Loyalty service200581 medium cloud servicePaaS


This is now much more powerful. You now know that, based on the performance of your existing microservices, the new service will require a single, medium-sized VM. You can then confirm the current price with your data center or cloud provider to get the estimated operating cost.

Now increase your estimate accuracy

The level of accuracy you can attain using the process I’ve described above might be “good enough,” and is hopefully a step up from what you are doing today. However, if you need greater accuracy because, for example, you are trying to estimate the cost of hundreds of services, you can still increase it.

One way is to add more attributes on which to filter. You might, for example, want to add “technology,” “compute complexity,” or “vendor” attributes. In this way, you can do a better job of comparing like with like.

In the real world, however, you may be faced with many more combinations of hardware options and multiple instances where a specific type of resource is required. This shouldn’t be an issue if you are using automated testing to generate your results, but it will not be feasible if you’re using a manual approach, since the number of combinations required will rise.

Redefine IT Ops success in DevOps

My definition of success for a DevOps project is “The time between when you create a business hypothesis and when you have the empirical data needed to either prove or disprove it.” The keystone of this is having the right structure for your teams, requirements, and architecture. But these alone won’t guarantee success.

Enterprises in particular have other hurdles to clear, such as forecasting operational costs. In the past, projects have suffered from finger-in-the-air estimates, and IT has repeatedly had to go back to ask for more capital, or simply define success as spending the entire budget. In some cases, potentially successful projects never see the light of day because costs were highballed out of fear.

That’s why I’m an advocate of this different approach. The road to DevOps success requires that you estimate operational costs based on empirical data, have high confidence that your data is accurate, and define success as when the customer need or desired operational outcome has been met.

Want to know more? I'll be speaking in depth on this topic at the upcoming DevOps Enterprise Summit London. Or post your questions and comments below. I hope to see you at the summit.

DevOps Enterprise Summit: Experts share lessons learned
Topics: DevOpsIT Ops