7 lessons from debugging a test automation framework

We all come across strange code behavior that makes no sense and requires a lot of time to see clearly what is going on. For example, my team and I went through a memorable, weeks-long ordeal figuring out why the browser-alert exception-handling in our test automation framework suddenly stopped working.

Along the way, we learned some important lessons, such as the importance of event handlers, the wisdom of having a failsafe protocol, and the best methods for solving complex troubleshooting problems.

Over time, these lessons have helped us in getting to the bottom of issues more quickly without prolonging the frustration of troubleshooting problems that are caused by senseless programming.

These seven lessons could help you do the same. 

Gartner Magic Quadrant for Software Test Automation

Events are important

I was using a vendor-based UI automation tool and running the application under test (AUT) on Chrome. The environments were all virtual machines, to support parallelization. When the daily batch run lit up in red one day, our investigation showed that the browser-alert event was not firing up, which affected the entire batch result.

Being an ERP program, the AUT relied heavily on transactional data, and it’s common to get alerts and prompts while completing a transaction. When events are expected, they are handled in the script. But a few events aren't predictable, for several reasons.

This is why event handlers are handy. In this case, with the browser-alert functionality not working properly, the impact was much larger than anticipated. Scripts began failing left and right.

Lesson: Have event handlers in your framework to deal with unexpected events, popups, and different types of alert messages. Remember Murphy’s Law: Everything that can go wrong will go wrong. So make sure to plan for that.

Have failsafe protocols

I knew the alert problem had to be resolved. I was concerned about the whole batch getting screwed up by one alert. That’s when I adopted the rule of syncing the automation tool and the AUT before every script executed. Also, as soon as a mismatch between expected and actual AUT states was detected, the framework would stop executing the test in progress and reset the AUT.

It is common for tests to fail because of something that happened while executing the previous test, because an alert failed to alert or a popup failed to pop up, because an application is in a non-responsive state, and so on. Our syncing rule saved us a lot of grief over the years by ensuring that any problems from one test do not affect the others.

Lesson: Have a backup plan in case your safety protocols fail. In this example, browser-alert handling was the primary mechanism and the protocol of stopping the current test and resetting the AUT was our backup plan.

Lesson: It is very important that the state of the automation tool and AUT be aligned. Have a resync strategy built in after every test.

Write down the variables

In my quest to figure out what was going on with the alert problem, I tested all the usual suspects, including the tool, browser, different app versions, different browsers, different environments. I could only reproduce the issue with Chrome; other browsers worked fine. The testing community was silent on this issue, however; it seemed as if I was the only one seeing this problem.

I was stuck and not sure what to try next, so I went off to another team that was using a similar tool base to try out a sample test I had created for the alert. To my surprise, it worked fine. The only difference between the two environments was that one was a virtual machine and the other was a physical one. At that point, I didn't understand why that would make a difference, but at least I now had a lead.

Lesson: Don’t stick with the same experiment once you know it has failed. It helps to write down all the variables in any given experiment and try changing them one by one. This makes it easier to find patterns or identify missed permutations.

Your assumptions can be wrong

The VM-versus-physical machine variable didn't make sense to me. Another team member suggested that the operating system might be a factor, too, but both machines were running Windows 7. So I went back to the whiteboard and scribbled down every conceivable explanation, no matter how far-fetched, and tested a handful of them.

It turns out that when you run Chrome on a Windows 7 machine installed on a VM, the browser alerts generated are modal windows. That's not the case on a physical machine.

Lesson: Once you’ve eliminated the obvious, check your basic assumptions. With the rapid pace of change in software, your assumptions—for example, that a browser should work the same on the operating system whether running in a physical or VM environment—could be invalid.

Lesson: The automation’s execution environment is as important as your automation framework. Remain in control of every aspect of it. That will help you tweak it according to the framework’s needs.

Nothing is far-fetched in automation

After weeks of research and testing, my conclusion was that running Chrome on Windows 7 in VM generates alerts that the automation tool finds unreadable. And by changing the OS from Windows 7 to 8, I solved the problem. That conclusion was completely unexpected.

When levels of abstraction are added rapidly, complicated effects can result that go unnoticed. Complicating matters, automation projects are becoming more like complete software products of their own, with similar problems brought on by having lots of connected interfaces and environments.

Lesson: Nothing is bizarre in automation framework design; unexplained errors can be a common occurrence. Learn to quickly solve such problems.

Hope for the best, prepare for the worst

Usually no one likes debugging weird problems, for good reason: the process can take far more time than the coding of the feature under test. The solution: Use event handlers to reduce the number of troubleshooting problems, sync the AUT and tool states regularly, have failsafe protocols, and control your execution environment.

When you end up in the pit with a deep troubleshooting problem, write down all possible variables, don’t be afraid to question your assumptions, and remember: Once you crawl out of one pit, you might find yourself in another. Enjoying the troubleshooting process makes going through a programmer’s life easier.

Care to share what you learned debugging your automation framework? Let us know in the comments below.

Testing in the Agile Era: Top Tools and Processes
Topics: Dev & Test