Stack of 50 dollar bills

3 production QA practices that will save your business money

Most organizations use quality assurance (QA) practices to improve the quality of their systems, hoping that, in the long run, the increase in quality will lead to lower costs and higher profitability. The problem is that they may be wasting too much time worrying about issues that will never happen.

QA in production is a set of emerging DevOps techniques that focus on fostering a keen awareness of what the actual issues are in production. These practices complement (or in some cases unseat) traditional preproduction QA practices by providing fast feedback and vital diagnostic data. While some organizations may find the idea of adopting production QA practices risky or intimidating, doing so can have a positive impact on a business’s bottom line. Here's my trifecta of techniques you can use, and how they will save your business money.

DevOps Enterprise Summit: Experts share lessons learned

Alert on unexpected scenarios

Every software team I’ve been a part of has spent time worrying about the unexpected. What if the data from that system isn’t what we expect? What if this value is null? What happens if users click on this before they click on that? Such uncertainty is a natural part of software development, given how inherently unpredictable real users, networks, devices, and other systems are. The problem is that this uncertainty costs money.

Systems analysts agonize for days over potential issues. Developers spend hours writing and discussing code to deal with each of these edge cases. Any future developers on that system then also spend extra time maintaining these additional lines of code. Often, such code can materially affect the design of a code base, leading to unnecessary abstraction or complicated code.

An alternative approach is to do nothing but get the system to let you know something happened. The simplest way is to add an error entry to your logs. For example:

if (theData.importantValue !== expectedValue) {
   logger.error(‘Important data had an unexpected value: ’ + theData.importantValue);
}

Now you can forget about the edge case until it actually happens in production. When it does, you can analyze the impact and the frequency of the issue and decide whether it’s worth the cost of addressing. While logs are an important way to get the information you need, I’d highly recommend setting up a corresponding alert to ensure that your team knows as quickly as possible that the scenario has occurred.

Many of these scenarios will never happen, which means you save time and get to focus on something more important.

Keep an audit log

Data storage is cheap these days. Use this to your advantage and store important metadata about what happens on your system. Let’s take the example of an invoice payment system. There will be several steps during the course of getting an invoice paid. First, the invoice is loaded into the system. Next, a number of people may need to approve the payment. After that, the invoice needs to be scheduled for payment, paid, and marked as fulfilled. A process such as this will often end up being more complex than it sounds, and things might go wrong in the middle somewhere.

So what do we do when an irate creditor phones us because it hasn’t been paid? In most cases, we scramble. We ask a large number of developers to spend many hours trying to figure out what happened. We look at the data, pore over the code, and try to find the bug that may have caused it. QAs spend time helping imagine what scenario may have occurred based on the creditor’s complaint.

Instead, save a little bit of useful metadata at each step. Imagine having the following records available in your audit collection or table:

[{
   invoiceId: ‘123’,
   time: ‘2017-05-12T10:00:35.123Z’,
   type: ‘loaded’,
   details: { userId: ‘780’, amount: 1000 }
},
{
   invoiceId: ‘123’,
   time: ‘2017-05-13T11:00:35.123Z’,
   type: ‘approval_request_sent’,
   details: { userId: ‘44’, emailAddress: ‘sue@example.com’ }
},
{
   invoiceId: ‘123’,
   time: ‘2017-05-13T11:00:35.123Z’,
   type: ‘approved’,
   details: { userId: ‘77’, amendments: { amount: 445 } }
},
{
   invoiceId: ‘123’,
   time: ‘2017-05-13T15:00:35.123Z’,
   type: ‘scheduled’,
   details: { userId: ‘645’, expectedPaymentDate: ‘2017-05-15T00:00:00.000Z’, p    aymentBatchId: 991 }
}]

Based on the above information, the team would be able to see that a payment was scheduled (and the date for which it was scheduled), but that the payment never took place. This helps pinpoint where in a complex process something went wrong. The team can now look at the logs for a specific payment batch to diagnose the issue.

Imagine if the audit trail looked like this instead:

[{
   invoiceId: ‘123’,
   time: ‘2017-05-12T10:00:35.123Z’,
   type: ‘loaded’,
   details: { userId: ‘780’, amount: ‘1000’ }
}

In this case, it’s clear that the invoice was never approved, because the approval reminder didn’t reach the user. Now the team can look at the email logs for the time in question and see why the email wasn’t delivered. It’s even possible, as we’ve done at Tes, to set up some automatic self-healing. For this example, you might automatically retry sending the email after a certain amount of time, or you might automatically notify the user who loaded the invoice that something has gone wrong.

Using audit data for support can save a lot of time and thereby improve your service level when helping customers. Remember to respect people’s privacy and only keep data you really need.

Spend less time on worthless tests

Not all automated tests earn their keep. I’ve seen many teams spend hours fine-tuning and debugging tests only to never find real issues. This is particularly true for many UI and performance tests. If your tests are slowing you down, set up the right production monitoring so that you’ll know when things go wrong. That way, you will be alerted when something is broken or slow. Unlike tests that provide many false positives, you can know for sure that an issue is worth addressing, because it happened in production. Overall, this approach leads to less time wasted on tests that aren’t valuable and increases your team’s awareness of what really happens in production.

Time to build your safety net

Production QA practices act as a safety net. Use these techniques to free your development team from the anxiety that springs from uncertainty. They will help you know when things go wrong, and you'll have more information to hand when they do.

More than anything, these practices will help you stay rooted in reality. This ultimately leads to huge savings in the time your developers need to spend worrying about, writing coding for, and supporting approaches to issues that may never happen. Less time spent on non-important issues means less money wasted—and more new features shipped.

DevOps Enterprise Summit: Experts share lessons learned
Topics: Quality