How to use computer vision for your test automation

If you need to test cases that are impossible with traditional automation frameworks, you need to expand your toolkit. Examples for when this might be necessary include gaming console automation, smoke testing when an application is under active development and the UI is constantly being changed during a sprint, ad banners, and when the app uses Android/iPhone keyboards.

For example, one of our customers needed to have an Xbox app automated. For various reasons, we couldn't use the existing tech stack and we also couldn't build a new one, since we didn't have access to the elements tree, which displays the data-access structure as a folder structure. 

After doing extensive research, we concluded that the only way to solve this problem was to use computer vision technology, which detects the controls and elements on pages.

Here's how my team implemented this approach. 

Gartner Magic Quadrant for Software Test Automation

Defining the problem

For automation testing my organization used Carina, an open-source test automation framework that handles Selenium actions, makes them stable, and provides reports for the automation team. Also, this framework integrates well with test management systems, bug-tracking systems, and other tools. But its key feature is handling Selenium actions. 

Our team needed to develop an alternative to building an elements tree because we didn't have access to the client's. We wondered: What if we used an approach that comes from manual QA? That is, what if we looked through a page to detect and recognize its elements, as manual testers do.

So the main goal became to implement technologies that could help us implement this approach. We decided to use neural networks and computer vision. Neural networks can detect and classify objects from the images computer vision produces.

Computer vision and neural networks

Our team didn't include any neural network specialists, so the first step was to find an existing solution for building one. After some research, we found Darkflow, an open-source framework for real-time object detection and classification.

This tool uses Tensorflow—one of the most popular open-source solutions of its kind, with detailed documentation and a vast support/contribution community—as its machine-learning framework. Tensorflow also lets you export graphs that can be used anywhere.

For its part, Darkflow has pretty clear documentation on GitHub, making it easy to train a network to use it.

We trained the network by the means of page screenshots and detected the elements using their coordinates. To detect an element, we needed the coordinates of its two points (top left and bottom right). Then we could easily calculate the position of any control center, including any divergences.

The following screenshot, from our testing process, will serve as an example. 

 
Figure 1: This screenshot, from our testing process, shows how we recognized page controls, which we selected with controls of different types. 
 
For marking screenshots, our team used LabelImg, another open-source tool.
 
The next step was designing the process of creating or finding, and then preparing, data for network training.
 
In the example below, we were attempting to detect the text_field, link, header, label, checkbox, and select. 
 
Figure 2. We used LabelImg for marking screenshots.
 
 
We ended up with both a screen file and an .XML version, with expected elements and their coordinates. (Note: These two files should have the same name.) The XML code that matches the screenshot above is as follows:
 
<annotation verified="no">
<folder>Desktop</folder>
<filename>ebay.png</filename>
<path>C:/Users/Oksana/Desktop/ebay.png</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1349</width>
<height>2088</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>text_field</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<checked>0</checked>
<bndbox>
<xmin>312</xmin>
<ymin>233</ymin>
<xmax>638</xmax>
<ymax>256</ymax>
</bndbox>
</object>
<object>
<name>select</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<checked>0</checked>
<bndbox>
<xmin>643</xmin>
<ymin>235</ymin>
<xmax>812</xmax>
<ymax>258</ymax>
</bndbox>
</object>
<object>
<name>label</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<checked>0</checked>
<bndbox>
<xmin>310</xmin>
<ymin>203</ymin>
<xmax>519</xmax>
<ymax>223</ymax>
</bndbox>
</object>
……
</annotation>
 
 
We divided our data and used 70% of our screens for training and 30% for testing.
 
Screenshots were taken manually; we required each to be different from the rest and have as many graphs as possible for the most accurate neural network teaching. But for this example, we did the training on one screen.
 
Training for the regular project took about five days, but visible results appeared after two days of training. The duration of training for this particular example was four hours. 
 
Once the necessary data was collected, it was possible to create a new model, which we called $MODEL_NAME, as follows:
 
  1. Create a directory with images for training, and copy all screens there ($MODEL_NAME/img).
  2. Create a directory with XMLS for training, and copy all screens there ($MODEL_NAME/ann). (Do you mean XML or the XMLS parser?)
  3. In the $DARKFLOW_HOME directory, create a file label such as $MODEL_NAME.txt, and put all the controls that you have marked before in there (text_field, link, header, etc.).
  4. You may need to create one file manually; it's needed to save neural-network graphs: $DARKFLOW_HOME/cfg/$MODEL_NAME.cfg
  5. To start training, just run the command:

    nohup $DARKFLOW_HOME/flow --train --labels $DARKFLOW_HOME/labels-$MODEL_NAME.txt --annotation $MODEL_NAME/ann --dataset $MODEL_NAME/img --model $DARKFLOW_HOME/cfg/$MODEL.cfg --load $DARKFLOW_HOME/bin/tiny-yolo-voc.weights \--trainer adam --gpu 0.9 --lr 1e-5 --keep 10 --backup $DARKFLOW_HOME/ckpt/$MODEL_NAME/ --batch 16 --save 500 --epoch 2000 --verbalise > ../logs/create_model_$MODEL_NAME.log &
 
We trained our model for 20,000 epochs and saved graphs every 500 epochs. In this example, an epoch was the amount of time needed for the full training process on the 70% of the screenshots. After training was complete, we tested the model and checked the results in different formats. 
 
We used an image format for checking the neural network results, and JSON for integrating with the test framework.
 
Here are our results rendered as an image:
 
$DARKFLOW_HOME/flow --imgdir $PATH_TO_DIR_WITH_SCREENS_TO_RECOGNIZE --backup $DARKFLOW_HOMEckpt/$MODEL_NAME / --load -1 --model $DARKFLOW_HOME /cfg/ $MODEL_NAME .cfg --labels $DARKFLOW_HOME/labels-$MODEL_NAME.txt:
 
 
Figure 3. Here's the result after using neural network recognition of the image format.
 
or as JSON (same as for the image but with a --json flag)
$DARKFLOW_HOME/flow --imgdir $PATH_TO_DIR_WITH_SCREENS_TO_RECOGNIZE --backup $DARKFLOW_HOMEckpt/$MODEL_NAME / --load -1 --model $DARKFLOW_HOME /cfg/ $MODEL_NAME .cfg --labels $DARKFLOW_HOME/labels-$MODEL_NAME.txt --json
[
{
"topleft":{
"y":106,
"x":85
},
"confidence":0.97,
"caption":"Advanced Search",
"bottomright":{
"y":154,
"x":1027
},
"label":"header"
},
{
"topleft":{
"y":393,
"x":305
},
"confidence":0.82,
"caption":"Search",
"bottomright":{
"y":438,
"x":380
},
"label":"button"
},
{
"topleft":{
"y":1785,
"x":324
},
"confidence":0.94,
"caption":"Search",
"bottomright":{
"y":1823,
"x":387
},
"label":"button"
} ….

Try it out yourself

There are several advantages to using computer vision instead of traditional test automation frameworks. There's no need to access the elements tree, it's relatively simple to deploy, you can use open-source products, and you can train the neural network on the elements you need to test. Finally, the entire training process won't take much time and effort. 
 

Another plus is that you can also use computer vision and neural networks for regular tasks. And you can integrate this approach with real automation cases from production. But that's a discussion for another time.

Topics: Dev & Test