Testing for bias in your AI software: Why it's needed, how to do it

When it comes to artificial intelligence (AI) and machine learning (ML) in testing, much of the interest and innovation today revolves around the concept of using these technologies to improve and accelerate the practice of testing. The more interesting problem lies in how you should go about testing the AI/ML applications themselves. In particular, how can you tell whether or not a response is correct?

Part of the answer involves new ways to look at functional testing, but testers face an even bigger problem: cognitive bias, the possibility that an application returns an incorrect or non-optimal result because of systematic inflection in processing that produces results that are inconsistent with reality.

This is very different from a bug, which you can define as an identifiable and measurable error in a process or result. A bug can typically be fixed with a code change. Bias can be much more insidious and harder to test.

Here's why you need to test for bias in AI, and how to best go about it.

Where does bias originate?

Bias comes from your data. AI systems are trained with data collected from the problem domain. Even if the data is scientific and objective, it can still be subject to bias.

For example, in 2016 Amazon trained an AI bot to crawl the web to find candidates for IT jobs. To train this bot, the company used the resumes of its existing staff, which was overwhelmingly male. Not surprisingly, what the application "learned" was that males make the best IT employees. Amazon was never able to fix that bias, and withdrew the bot from use.

We’ve also seen examples of commercial facial-recognition systems that badly misclassify dark-skinned subjects, in large part because they are trained overwhelmingly with light-skinned images. People have been arrested and held based on faulty AI-based identifications that police accept without question. Similarly, loan recommendation systems might be trained with data that creates a bias in the system against people living in lower-income or minority neighborhoods.

Identifying bias in an application can be hazardous to your career. Last December, Google fired AI ethics researcher Timnit Gebru, who was known for finding bias in facial analysis systems. While the circumstances of this dismissal remain in question, teams that develop these systems are often faced with conflicting data that can have an impact on the quality of the system.

Bias can also expose an organization to negative publicity, such as when beauty pageant contestants were judged by an algorithm that selected almost all white women as winners. Companies, governments, and other groups that allow bias into their decisions will be perceived as untrustworthy and lose credibility and quite possibly business.

Defining bias

There are three main categories of bias.

Latent bias

Here, concepts become incorrectly correlated. For example, you may find a correlation between education and income, but that correlation may in fact be spurious. Instead, intelligence, a different measure, may be the actual causative variable.

Data selection bias

Here your team doesn't have good data, or the data you have doesn't represent the problem domain well. If that data doesn't accurately and completely represent the problem domain, you are likely to experience bias in your results.

Alternatively, the problem domain may have subtly changed over time to the point where your historical data no longer represents it.

Interaction bias

This happens when an application can become biased through interaction with other biased systems. This can be the case with unsupervised learning systems, where the application continues to learn while being used.

For example, Microsoft released Tay, a Twitter chatbot, as a technology demonstration. Tay learned from interactions with other Twitter users, but many of those users were malicious and "trained" Tay to be racist and sexist. Microsoft had to withdraw Tay within 24 hours of launch.

Testing for bias

So what does all of this mean for testers? First, testing these applications is very different than for traditional applications. At the most basic level, functional testing of traditional applications is all about analyzing the requirements to determine the correct response for a given set of inputs.

In an AI application, however, the requirements aren't that cut-and-dried. For a given set of inputs, it's not at all clear what the output should be.

So the requirements have to be different. Instead of a defined output for a fixed set of inputs, you will likely only have a guess as to what a correct response might be. For example, a human exercising judgment in recognizing a face will not be perfect, so we have to account for the fact that our data may have similar uncertainties.

The biggest issue with testing these systems is that you will occasionally encounter incorrect results, or results that seem incorrect. Testers test to requirements, but those requirements can't be expressed in absolute terms. In many cases the best you can do is expect a probability of a reasonable result, with a standard deviation for error.

Create lots of test cases

Testers need plenty of test cases across the entire range of the problem domain. Based on the data used for training, testers must ensure a solid foundation when the inputs are mainstream and the results are both commonsensical and apparently correct.

Then they need plenty of edge cases, where the expected results aren't clear. Would a human expert make that same decision? Why?

As a tester you must understand the architecture of the application and the design decisions made, as well as the data it uses. One concern is that an application can be architected very specifically to its training data. It may perform very well in testing but not in the real world.

Judea Pearl, a longtime Bayes' theorem researcher, calls this "curve-fitting" to a specific set of data. It looks good on paper but is biased against real-world data.

It all comes back to data, so testers need to test the data to ensure that it accurately and completely reflects the problem domain. This means using common sense, as well as mapping the characteristics of the problem domain into the data that represents it. That's an inexact science, but one that testers have to become adept with.

Get used to uncertainty

All software, not just AI applications, has the potential for bias. For example, a London doctor was locked out of her fitness center locker room because the smart card wouldn't work. It turned out that the software team had hard-coded the title "doctor" to equate to male, meaning that her access was only to the men's locker room.

In typical applications, this kind of bias is more easily identified and fixed. In an AI application, testers have to accept the fact that they can't always tell for sure. A lot depends on the problem domain; you don't need a high level of accuracy in e-commerce recommendation engines, but you do in autonomous vehicles.

Fixing these issues is often more than a simple code change. The architecture of the application may have to be completely rethought, or you may need to use new datasets for training.

That is almost equivalent to starting the application again from scratch, which no one wants to do. But starting over is almost always a better choice than deploying an application with known biases.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, Testing

You are here