How to predict and prevent user story defects

In the previous installment of my "Data Science for Developers" series, I showed how developers can use market basket analysis methods to identify missed changes in source code. This time, I'll show you how to use text classification techniques to help predict which user stories are most likely to contain defects.

These techniques have been around for many years, but the tooling for using them has become so good that the effort to try them out is relatively low. What would have once been a costly experiment requiring a PhD now can be done by any motivated developer. Why not you?

World Quality Report 2017-18: The state of QA and testing

The problem: Limited resources vs. endless needs

One of the most fundamental and difficult problems in testing is risk management–deciding what needs to be tested and how much effort you should apply to it. Even a modestly sized data entry form presents myriad test possibilities.

Once you consider all potential combinations of valid inputs, invalid inputs, blank entries, tab order, error handling scenarios, and so on, you rapidly encounter explosive degrees of combinatorial complexity. Practical constraints on time, resources, and computing power preclude developers from testing every possible permutation of the factors that are potentially relevant to how well the form works.

You need to make choices about what is most important to test and how thoroughly you should test it by considering both the likelihood that a certain aspect of the form will fail, and the impact such a failure would have.

When you deal with real software releases, this complexity only grows. Now your form is part of a solution with many other components, and each of those components also needs to be tested with respect to both its internal features and its interactions with other parts of the software. As such, it becomes even more important to know where to focus your efforts.

Effective test planning is a game of probabilities. Test too much and too thoroughly, and you'll waste money and you move slowly. Test too little, and you'll release garbage. A well-managed test effort strikes an effective balance between these poles, and does a thorough job of testing the riskiest components while devoting less effort and attention to other areas of the product.

How to strike the right balance

The ability to strike such a balance obviously depends on your ability to accurately assess the risk of the changes being introduced. Typically, software engineers do this with some combination of experience, intuition, and heuristics. If a specific area of the software has contained many bugs before, it is likely to produce more in the future, and so the development team should more heavily scrutinize any changes that affect this area.

Also, new functionality is more prone to defects than are previously established features. That's why test cases for a release should focus disproportionately on new functionality, including targeted regression of closely related features.

Humans make complex judgments about risk all of the time. But since humans have limited memories and easily skewed perceptions, a mathematical model can help assess the probability that a given change is likely to contain defects. When you want to build complex models that help to assess the probability that something should be classified in a given category (e.g., risky or not), consider machine learning techniques.

As you do so, bear in mind the many fallibilities of such an approach.

Machine learning is not magic. It will not relieve you of the hard burdens of risk analysis.

Not all of the information you have can be readily captured in a way that learning algorithms can use, and so it will be hard for machine learning to obtain a comprehensive view of the factors relevant to risk judgments.

Yet machine learning approaches can give you a quantitatively oriented perspective on what is risky, and help you to see patterns in our data that are difficult to see otherwise.

The solution: Text classification with Naïve Bayes

There are many approaches you can use to predict software defects using machine learning. Much of the academic literature surrounding this topic today looks at properties of source code. Defect-prone modules are identified using historical defect data and statistical properties of the code base, such as the cyclomatic complexity of the methods in a class.

While this kind of analysis can produce strong models with good predictive characteristics, these models leave something to be desired in terms of practical usefulness. They reveal only the risk associated with a user story (i.e., a software requirement) after the relevant code has been touched.

If a development team has an idea that certain user stories (requirements) will have a higher risk of defects, what should their response be? They might do one or more of the following:

  • Have more experienced people work on the story
  • Plan additional exploratory testing
  • Increase targeted regression testing around those areas
  • Use paired programming
  • Engage in peer review

Some of these things are impossible or impractical to do after implementation, and so post-implementation is a terrible time to find out if you need them. Implementation-based (that is, code-based) prediction approaches make sense for building models of defect distribution. But the operational usefulness of such models is questionable for a team that has just started a two-week sprint and needs to know where to spend its time.

A more timely and practical model of defect prediction should be based on what the team will actually know at the start of the sprint. They won't yet have a structured representation of the code they will change, and so they can't feed this information into a predictive model.

However, they will have the user stories on which they are scheduled to work.

If you can make a meaningful prediction about the riskiness of particular stories based on the properties of those stories, you can enable the team to respond to that risk in time to actually mitigate it.

In the upcoming demonstration, I take a set of real user stories, some of which have been associated with defects, and use them to build a predictive model. The model will be based on the presence or absence of words in the stories. I will apply Naïve Bayes, a model learning method commonly used in text classification tasks, to help predict whether or not a user story is likely to contain defects.

Text classification principles

Spam email used to clog our inboxes and diminish our productivity. Today, however, it has been largely contained with the advent of spam filters based on machine learning techniques that use text classification methods.

Such methods treat words in the emails as predictive features. The presence of a certain word is a characteristic that the filter can use to help categorize the email as "ham" (email I want to read) or "spam" (email I don't want to read). The filter examines a large number of messages that have been characterized by humans as ham or spam, and learns to characterize new messages based on their contents. If the new messages contain a high percentage of words and phrases that have been historically associated with spam emails, then these messages are likely to be characterized as spam.

The filtering process is based on a theory of probability, the Bayes' Theorem, which can be expressed in shorthand as follows:

If you are interested in why this formula holds true, here's straightforward account of its derivation. For most people, the formula is a bit cryptic, so here's a friendlier, expanded version:

Essentially, Bayes' Theorem deals with how the evidence about the current situation allows you to make probability judgments. You want to find out the likelihood of an important proposition, such as "My llama will die of scurvy." This is the hypothesis. You also have additional information about your situation, such as "My llama is gray." This is your evidence.

You are trying to figure out the likelihood of your llama dying given what you know, and this is called the posterior probability–your judgment of the probability that something (the hypothesis) will be true, made after you know something about your situation (the evidence you have).

To figure out the posterior probability (the likelihood of your llama dying of scurvy given what you know about it), you need to know some other pieces of information. First, you must know the likelihood of the hypothesis in general. This is called prior probability–a probability that you know before having any evidence about your current situation.

For the sake of this example, assume you know that 2% of llamas die of scurvy. You also need to know the likelihood that a scurvy-stricken llama will be gray–this is the conditional probability. Let's assume you know that out of all the llamas that have died of scurvy, 30% were gray. Finally, you need to know how prevalent gray llamas are in general–this is the marginal likelihood. Let's assume that 15% of llamas are gray.

Now you know how often llamas die of scurvy (prior probability), how often such llamas were gray (conditional probability), and how prevalent gray llamas are in general (marginal likelihood). And now you are in a position to calculate how likely a llama is to die of scurvy given the fact that it is gray (posterior probability).

Now plug the values into Bayes' theorem:

When you run the numbers, as seen in the graphic above, you end up calculating a 4% likelihood that a gray llama will die of scurvy.

This result should make some intuitive sense–if 15% of all llamas are gray, but 30% of dead, scurvied llamas are gray, then gray llamas are dying of scurvy twice as much as they appear in the population. And so you double the rate at which the typical llama dies of scurvy (2%) to determine the rate at which gray llamas die of scurvy (4%). Poor llama–somebody get it an orange.

Obviously, this is a fanciful example, just for the sake of illustration. Returning to the subject of spam, you can use Bayes' Theorem to predict whether an email message is spam or not by performing similar calculations. You treat the frequency of words and phrases in a message as your evidence, then look at prior data about how prevalent those words were in previous spam and non-spam messages.

When you couple this data with knowledge of how often an email in general is considered spam, you have all of the elements needed to calculate the probability that a specific message is spam by using Bayes' Theorem.

Since this theorem works well for classifying emails based on the text they contain, you can do the same for user stories, and build a model that shows how likely they are to be associated with defects.

Assumptions, methods, and tools

Below is an example of predicting defects in a set of requirements (Jira issues) from the Apache Derby project. The data used here is publicly available at https://issues.apache.org/jira/projects/DERBY. I used the Jira REST API to pull a snapshot of the data, and then processed it a bit to make the example more suitable for analysis.

Here is a sample record from the processed data set:

When I processed the data, I filtered out issues that were themselves defects, then added a simple data element called "has_bugs" to indicate whether or not the issue was connected to a one or more defects.

This seems reasonable to me, but I am not a member of the Derby project, and I don't understand what all of their data means. In particular, I don't know all of the circumstances in which a member of the project would connect a non-defect issue to a defect. 

If you are trying to predict defects, you would want a "has_bugs" feature to read "yes" only for issues which were not supposed to have a defect but ended up with one anyway due to some kind of mistake. In this data set, a manual review of the records shows evidence of non-defect issues that were created explicitly for the purpose of tracking already-known defects.

Obviously, there would be little value in "predicting" defects for issues that were intentionally created to track defects. So, the following analysis shows how you might use historical data to build a predictive model for defect-prone user stories. But you can't assume that this particular analysis would be practically useful for finding risky stories in the real Derby project, because we don't know enough about the rules they use for associating defects with user stories and other kinds of Jira issues.

If your projects have data tying a user story to defects that were unintentionally created in the development of that story, then the methods shown below are worth considering.

The example is written in R, a statistical computing language commonly used for data science, and I recommend installing the R Studio integrated development environment (IDE) to make your explorations with R easier. For the sake of brevity, this tutorial will assume that you have both of these already installed, and that you have explored them enough to be able to start running the code samples.

The simplest way to get started is to simply paste the code into the Source pane (top left of R Studio), and use the accompanying "Run" button to execute it line by line.

I will walk through the code one step at a time. Finally, I will provide all of the code at once for ease of copying. 

You can run all of the code as a single R script. The code relies heavily on a similar example from Machine Learning with R, by Brett Lantz, an excellent book.

World Quality Report 2017-18: The state of QA and testing

Getting and preparing the data

First, download the file that contains the processed Jira-issue data from the Derby project. The shortened URL in the code points to a file in Google Drive, and the download.file() function saves it to the path specified by the filename variable. By default, this in your working directory, but you could supply a full path here if you want to change where it goes. Notice that # is used to add unexecutable comments that will be ignored by the interpreter.

# get the file from the web
URL <- "https://goo.gl/5mG6Lq"
filename <- "Apache_Derby_Issue_Analysis_TechBeacon.txt"
download.file(url = URL, destfile = filename)
Next, read the file contents into a data frame, which is a table-like structure in R. The stringsAsFactors argument determines whether you treat all strings of text as categorical variables, and since the text in each Jira issue is not a category but a property specific to that issue, set this parameter to false.

Another variable, the has_bugs column, should be treated as a category, but you'll take care of that later.

The sep argument tells you what column separator the input file uses, and your data is pipe-delimited. Notice the <- operator, also known as the arrow operator, which is an assignment operator. Think of it as similar to an equals sign, which you could actually use to get the same effect. However, the arrow operator is more idiomatic (i.e., conventional) than the equals sign, and so I prefer it for this example.

# read the issue data into a frame
issues_raw <- read.csv(filename, stringsAsFactors = FALSE,sep = "|")
You are going to transform and clean up your data, so make a copy of it, in case you want to compare the transformed version to the original during debugging.

# make a copy that we will alter
issues_prepared <- issues_raw

Some of the rows in the source data contain the phrase "No Description" in the description field. When I was pulling this data from the Derby Jira project, my code inserted that string whenever there was no description in the original Jira issue. Since you will be doing text-based classification, I will take it out again so as not to throw off your analysis.

The code below says: "Assign a zero-length string to the description column of the issues_prepared data frame for just those rows where 'No Description' is the current value of description."

# scrub the "No Description" from the description field
issues_prepared$description[issues_prepared$description=="No Description"] <- ""

Some of the Jira issues in our data set have just a summary (a title), while others have both a summary and a description. From a text classification perspective, there probably isn't much difference between these fields, so we will combine them together into one chunk of text for each issue.

# combine description and summary text together
issues_prepared$text <- paste(issues_prepared$summary,' ',issues_prepared$description)

Then, since you only need that text and the has_bugs field, you can create a new data frame that contains only the 5th and 6th columns. An R data frame can be indexed positionally using a [rows,columns] convention. If I wanted to see the value for the third column of the fourth row, I could write issues_prepared[4,3] to indicate just that single cell. I can also give ranges or lists to both the rows and columns arguments.

If I don't supply a specification for either the rows argument or the column argument, it will default to returning all available. So, when I ask R for issues_prepared[,c(6,5)], it means that I want all rows (since I left that first argument blank) and a specific set of columns. The definition of the set of columns is created by the combining function, c(), and is called a vector in R terminology.

# make copy that contains only the text and bug status
issues_final <- issues_prepared[,c(6,5)]

Now you want to convert the has_bugs field in your data frame to a factor–a variable that represents a category. This helps to save space because a factor is structured with a numeric index that gets expanded into text values through a lookup table. Also, many learning algorithms will assume that the class value being predicted is stored as a factor.

# convert yes/no for bugs to a factor.
issues_final$has_bugs <- factor(issues_final$has_bugs)

You verify that you have converted has_bugs into a factor using the str() function that shows the structure of a variable. Then, you examine the distribution of issues with and without bugs by using the table() function, and also calculate this distribution as percentage values.

# examine the bug indicator more carefully
str(issues_final$has_bugs)
table(issues_final$has_bugs)
prop.table(table(issues_final$has_bugs))

The results show that has_bugs is now a factor, and that roughly 20% of the issues in our data set are associated with defects.

Now it's time to start working on the textual data to prepare it for analysis. For this you are going to use a text mining package called tm.

Install the package and make it available for use with the library() function. You only need to install tm or any other package one time per system, so you can comment out the line that does the installation after the first time you run this example. The tm package will help us build a special object for text analysis called a corpus, using a function called VCorpus that uses a text vector as input.

# build a corpus (body of docs) using the text mining (tm) package
# comment below line out after first install of tm
install.packages("tm")
library(tm)
issue_corpus <- VCorpus(VectorSource(issues_final$text))
Now that our corpus is created, you need to clean it up in a variety of ways. For text-based classification, you are generally looking for patterns in basic concepts. You don't want two tokens (pieces of text) to be treated differently just because they vary in superficial characteristics, such as capitalization. You also want to take out numbers, punctuation, and white space that are more likely to add noise than signal.

In addition, you want to remove any obvious "stop words" that would also add noise. Since this is data from the Derby project, I doubt references to Derby add much predictive value, and so I have taken them out.

My manual review of the data suggests that the presence of the word "bug" may also bias the analysis (recall my earlier discussion of data-gathering conventions on this particular project), so I'm going to take that word out as well. You are also going to standardize word stems, so that word variants will all be reduced to a common root. For example, the words learn, learned, and learning would all be reduced to the common root learn.

# clean up the corpus using tm_map()
issue_corpus_clean <- tm_map(issue_corpus, content_transformer(tolower)) # make things lowercase
issue_corpus_clean <- tm_map(issue_corpus_clean, removeNumbers) # remove numbers
# set up stopwords
custom_stopwords <- c('derby','bug')
complete_stopwords <- c(custom_stopwords, stopwords()) # combine custom and built-in stopwords
issue_corpus_clean <- tm_map(issue_corpus_clean, removeWords, complete_stopwords) # remove stop words
issue_corpus_clean <- tm_map(issue_corpus_clean, removePunctuation) # remove punctuation
issue_corpus_clean <- tm_map(issue_corpus_clean, stemDocument)
# eliminate unneeded whitespace
issue_corpus_clean <- tm_map(issue_corpus_clean, stripWhitespace) 

Having done all of the above, you can compare the original corpus to the cleaned-up corpus, and see all the changes that have been made. The lapply() function applies a function to a list of things. In this case, the as.character() function will be applied to each of the first three documents (which correspond to Jira issues) in each corpus, so you can see what the document currently contains.

As expected, the case of all letters has been lowered. Numbers and punctuation have been removed. Common noise words like "for" have been removed, and words have been stemmed (e.g., "needs" has become "need").

Analyzing  the text

Now that you have obtained and cleaned up the data, you can turn to analysis. Before you try to build a model that predicts defects based on the text of user stories (Jira issues), you might wonder why I would even think this is possible at all. Is there anything that would lead you to believe that there are meaningful differences between the text of the stories that are associated with defects and the text of the stories that are not?

A simple comparison word cloud can help us answer this question. You start by getting the text from the cleaned up corpus and converting it into character values. Then combine this cleaned text into a new data frame alongside the labels that tell you whether or not a story is associated with defects.

This allows you to separate the text for each type of story using the subset() function and then pack all of the words for each type into two separate documents that are part of the same corpus. Next create a TermDocumentMatrix that tells you how often a word appears in each document. The code uses the sink() function to print the contents of that matrix to a file so that you can see how it is structured. Then, after giving user-friendly names to both of the columns (documents) in the matrix, it generates a comparison word cloud.

# word cloud prep
# get the text from the cleaned up corpus
text_list = lapply(issue_corpus_clean[1:length(issue_corpus_clean)], as.character)
# add this text and the bug labels into a data frame
text_with_labels = data.frame(unlist(text_list,use.names = FALSE), issues_final$has_bugs)
# put bug and no bug info into separate vectors
bugs <- subset(text_with_labels, issues_final$has_bugs == "Yes")
no_bugs <- subset(text_with_labels, issues_final$has_bugs == "No")
# collapse all entries into one variable for both bugs and non-bugs
bugs_text = paste(bugs[,1], collapse = " ")
no_bugs_text = paste(no_bugs[,1], collapse = " ")
# now build a new corpus on these two "documents"
all_text <- c(bugs_text, no_bugs_text)
corpus = Corpus(VectorSource(all_text))
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#see what is in a TDM
options(max.print=1000000)
matrix_tdm <- as.matrix(tdm)
sink("see_tdm_contents.txt")
matrix_tdm
sink()
# now you can look at the file created
# add column names
colnames(matrix_tdm) = c("Bugs","No Bugs")
#generate comparison wordcloud
library(wordcloud)
comparison.cloud(matrix_tdm,max.words=80,random.order=FALSE)

A comparison word cloud identifies the differences between two documents, showing the words that are most characteristic of each (i.e., more commonly present in one than the other). The resulting graphic, below, suggests that stories associated with defects are more likely to contain specific words such as support, column, trigger, and issue, than stories that are not associated with defects.

As such, you have some reason to think that a text-based classification approach could yield interesting results.

Now you will create a DocumentTermMatrix (DTM) for the entirety of our cleaned corpus. The DocumentTermMatrix has a different orientation of rows and columns than the TermDocumentMatrix created above., but the goal is the same—to record how often each distinct word from the overall corpus appears in each document (i.e., Jira issue).

Since a given word will only be in some of the documents, most of the cells of the matrix will contain zeros. And so R creates a sparse matrix–only non-zero values are stored, and the zeros are inferred. Again, the code uses sink() to print the matrix to a file so you can examine its structure.

# create a document-term sparse matrix for the whole corpus
issues_dtm <- DocumentTermMatrix(issue_corpus_clean)
#see what is in a DTM
matrix_dtm <- as.matrix(issues_dtm)
sink("see_dtm_contents.txt")
matrix_dtm
sink()
# now look at file

Now you are ready to build your classification model. Start by dividing the data in your document term matrix into two data sets: a training set and a test set. The training set will be fed into the NaiveBayes function to train the model. (It is called "Naive" because it assumes that the features of the training set are uncorrelated, even though in practice it is likely that the presence of one word makes it more or less likely that other words will be seen as part of the same document.)

Then you test that model against the test set to see how it performs on previously unseen data. You randomly sample 80% of the data for your training set and use the remaining 20% as your test set. Before you take your sample, set a random number-generation seed (12345) so that the results are reproducible.

Capture the values of has_bugs for each item in the test and training sets so that you can use them later. Finally, check to make sure that our sample contains a similar proportion of defect-associated stories in the test and training sets.

# creating training and test datasets
set.seed(12345)
train_pct <- .80
train = sample(1:nrow(issues_dtm), nrow(issues_dtm) * train_pct)
# use train as row indexes to extract rows for training set
issues_dtm_train <- issues_dtm[train, ]
# negate train to remove those same rows for creating test set
issues_dtm_test  <- issues_dtm[-train, ]
# also save the labels
issues_train_labels <- issues_final[train, ]$has_bugs
issues_test_labels  <- issues_final[-train, ]$has_bugs
# check that the proportion of bugs is similar
prop.table(table(issues_train_labels))
prop.table(table(issues_test_labels))

You have about 20% defects in both the test and training sets, which is good. It means that the random sampling process did not produce obvious differences in the two sets.

Now pare down your DTMs by removing words that don't appear in the training set at least 10 times, since such words won't have much predictive value. First identify the words, then use them to select the desired columns from the training and test DTMs you already created, because each column in a DTM corresponds to a word.

# indicator features for frequent words
# save frequently-appearing terms to a character vector
issues_freq_words <- findFreqTerms(issues_dtm_train, 10)
# create DTMs with only the frequent terms
issues_dtm_freq_train <- issues_dtm_train[ , issues_freq_words]
issues_dtm_freq_test <- issues_dtm_test[ , issues_freq_words]

The model will be relatively simple, and will rely on the presence or absence of words in a document (remember: each "document" represents the text from a Jira issue)  rather than how many times it appeared. So convert all of the counts in your DTMs into a "yes" for cases where the count is greater than zero and "no" otherwise.

Write a simple function called convert_counts to do this, and apply it to the values in your matrix.

# convert counts to a factor
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
# apply() convert_counts() to columns of train/test data
issues_train <- apply(issues_dtm_freq_train, MARGIN = 2, convert_counts)
issues_test  <- apply(issues_dtm_freq_test, MARGIN = 2, convert_counts)

Finally, it is our time to train your model on the training set using the version of NaiveBayes implemented by the e1071 library. As before, you need to do a one-time installation of this library.

Create a new classifier for Jira issues using your training DTM, called issues_train, and your vector of training labels that tell you whether an issue was associated with defects or not. NaiveBayes will review the "evidence" of defects presented by the prevalence of the words in each category and build a model that will help categorize future issues based on the words these contain. Take this classifier and run it against your test data set to produce a set of predictions about the test data.

# comment below line out after first install of e1071
install.packages("e1071")
library(e1071)
issue_classifier <- naiveBayes(issues_train, issues_train_labels)
#Evaluating model performance ----
issues_test_pred <- predict(issue_classifier, issues_test)

Now see how well your classifier performed by using a package called gmodels to compare our predictions to the actual test labels you saved, and generate a confusion matrix that shows how often the predictions matched the actuals. Then print out related calculations using the individual cell values from the confusion matrix to make your performance more clear.

# comment below line out after first install of gmodels
install.packages("gmodels")
library(gmodels)
CT <- CrossTable(issues_test_pred, issues_test_labels,
                 prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
                 dnn = c('predicted', 'actual'))
paste('overall correct: ',(((CT$t[1,1] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))
paste('% of bug predictions correct: ',(CT$t[2,2]/(CT$t[2,1] + CT$t[2,2]))*100)
paste('% of bugs correctly identified: ',(CT$t[2,2]/(CT$t[1,2] + CT$t[2,2]))*100)
paste('overall % of bugs: ',(((CT$t[1,2] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))

The rows represent the predictions made by the model, and the columns represent the actual values from the labels on our test data set. The diagonal of No/No and Yes/Yes represents the cases in which you were correct, while the opposing diagonal represents cases in which your predictions did not match reality. 

For example, you predicted that 500 of the issues in your test set would not be associated with defects, and you were correct for about 410 of those. The remaining 90 were associated with defects in practice. Overall, you were about 76% correct in your predictions of which stories would and would not be associated with a defect.

In some ways that isn't bad, but bear in mind that you could have made this number be 79% by just predicting that none of the stories would be associated with a defect. So you can't pat yourself on the back for overall prediction accuracy just yet.

Thinking about what you might want to do with your data (e.g., put more attention on risky stories), you should probably ask how often you turned out to be correct when you predicted that an issue would be associated with a defect. Once this model makes such a prediction, how often is it correct? 

Recall that roughly 21% of the stories in the test set contained defects. So, if you had just identified risky stories randomly, you could expect a roughly 21% hit rate. The model did meaningfully better than this, and was correct about a defect being a defect roughly 38% of the time, with 30 of its 79 defect predictions being accurate.

Of course, this is far from spectacular, but it is much better than random. The model was able to correctly identify roughly 25% of total defects, which is not particularly good, but again better than random.

How could you make our model better? Note that words which have appeared zero times before in the training data (and thus have zero for prior evidence) will cause multiplication by zero issues in the equation for Bayes' theorem. Since zero multiplied by anything is zero, such new words will have a dramatic impact on our results.

Because of this common issue, we typically use Laplacian smoothing. This means that you inject a small value (such as 1) into the prior-evidence calculations to take the place of zero, so that new words in a document do not unduly impact the way the model characterizes the document.

Now rebuild your classifier using Laplacian smoothing and see how it performs. This is essentially the same code as before, just with the laplace = 1 argument added when the classifier is built.

#Improving model performance ----
issue_classifier_laplace <- naiveBayes(issues_train, issues_train_labels, laplace = 1)
issues_test_pred_laplace <- predict(issue_classifier_laplace, issues_test)
CT <- CrossTable(issues_test_pred_laplace, issues_test_labels,
                 prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
                 dnn = c('predicted', 'actual'))
paste('overall correct: ',(((CT$t[1,1] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))
paste('% of bug predictions correct: ',(CT$t[2,2]/(CT$t[2,1] + CT$t[2,2]))*100)
paste('% of bugs correctly identified: ',(CT$t[2,2]/(CT$t[1,2] + CT$t[2,2]))*100)
paste('overall % of bugs: ',(((CT$t[1,2] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))

Laplacian smoothing has changed the results meaningfully. Overall classification accuracy has improved a bit, to 78%, but again, this is still worse than what you would get if you just predicted no defects at all.

Your ability to correctly identify bugs has increased to 27%. More interestingly, you saw a big leap in the accuracy of the defects you did predict. When the model highlighted an issue that was likely to be associated with defects, it was correct 44% of the time. Those are good odds relative to random, and they give you some confidence that efforts to take extra precautions around those issues, such as moving personnel and increasing targeted regression, could be worthwhile.

Wrapping up

There is no magic here. Predicting defects, especially based on something like user story text, is fraught with challenges. Nonetheless, you were able to show results that were much better than random guessing. A model like this could presumably supplement the risk-oriented decision-making in your organization and would provide relevant information in time for people to act upon it.

It's worth repeating that the Derby data was just a convenient public data source that I chose for this kind of analysis. Those who understand the circumstances under which the data was collected would be in a better position to assess the practical likelihood of predicting unexpected defects with this model in the context of their project. Nonetheless, the model shows that the words in an issue can help determine whether that issue is likely to be somehow associated with a defect, even if interpreting the meaning of that connection is better left to others.

Note that I did not just happen upon the Derby project for this article. I tried the analysis with several different Apache Foundation projects, and Derby data showed the most promising results. The data collection practices in your own project will determine whether or not this kind of analysis has any predictive value for you.

In closing, here is all of our code in one shot, so that you can copy and paste it easily.

# get the file from the web
URL <- "https://goo.gl/5mG6Lq"
filename <- "Apache_Derby_Issue_Analysis_TechBeacon.txt"
download.file(url = URL, destfile = filename)
# read the issue data into a frame
issues_raw <- read.csv(filename, stringsAsFactors = FALSE,sep = "|")
# make a copy that we will alter
issues_prepared <- issues_raw
# scrub the "No Description" from the description field
issues_prepared$description[issues_prepared$description=="No Description"] <- ""
# combine description and summary text together
issues_prepared$text <- paste(issues_prepared$summary,' ',issues_prepared$description)
# make copy that contains only the text and bug status
issues_final <- issues_prepared[,c(6,5)]
# convert yes/no for bugs to a factor.
issues_final$has_bugs <- factor(issues_final$has_bugs)
# examine the bug indicator more carefully
str(issues_final$has_bugs)
table(issues_final$has_bugs)
prop.table(table(issues_final$has_bugs))

# build a corpus (body of docs) using the text mining (tm) package
# comment below line out after first install of tm
install.packages("tm")
library(tm)
issue_corpus <- VCorpus(VectorSource(issues_final$text))
# clean up the corpus using tm_map()
# make things lowercase
issue_corpus_clean <- tm_map(issue_corpus, content_transformer(tolower))
issue_corpus_clean <- tm_map(issue_corpus_clean, removeNumbers) # remove numbers
# set up stopwords
custom_stopwords <- c('derby','bug')
complete_stopwords <- c(custom_stopwords, stopwords())
issue_corpus_clean <- tm_map(issue_corpus_clean, removeWords, complete_stopwords) # remove stop words
issue_corpus_clean <- tm_map(issue_corpus_clean, removePunctuation) # remove punctuation
issue_corpus_clean <- tm_map(issue_corpus_clean, stemDocument)
# eliminate unneeded whitespace
issue_corpus_clean <- tm_map(issue_corpus_clean, stripWhitespace) 
# examine the final clean corpus
lapply(issue_corpus[1:3], as.character)
lapply(issue_corpus_clean[1:3], as.character)

# word cloud prep
# get the text from the cleaned up corpus
text_list = lapply(issue_corpus_clean[1:length(issue_corpus_clean)], as.character)
# add this text and the bug labels into a data frame
text_with_labels = data.frame(unlist(text_list,use.names = FALSE), issues_final$has_bugs)
# put bug and no bug info into separate vectors
bugs <- subset(text_with_labels, issues_final$has_bugs == "Yes")
no_bugs <- subset(text_with_labels, issues_final$has_bugs == "No")
# collapse all entries into one variable for both bugs and non-bugs
bugs_text = paste(bugs[,1], collapse = " ")
no_bugs_text = paste(no_bugs[,1], collapse = " ")
# now build a new corpus on these two "documents"
all_text <- c(bugs_text, no_bugs_text)
corpus = Corpus(VectorSource(all_text))
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#see what is in a TDM
options(max.print=1000000)
matrix_tdm <- as.matrix(tdm)
sink("see_tdm_contents.txt")
matrix_tdm
sink()
# now you can look at the file created
# add column names
colnames(matrix_tdm) = c("Bugs","No Bugs")
#generate comparison wordcloud
# comment below line out after first install of wordcloud
install.packages("wordcloud")
library(wordcloud)
comparison.cloud(matrix_tdm,max.words=60,random.order=FALSE)


# create a document-term sparse matrix for the whole corpus
issues_dtm <- DocumentTermMatrix(issue_corpus_clean)
#see what is in a DTM
matrix_dtm <- as.matrix(issues_dtm)
sink("see_dtm_contents.txt")
matrix_dtm
sink()
# now take a look at the file created

# creating training and test datasets
set.seed(12345)
train_pct <- .80
train = sample(1:nrow(issues_dtm), nrow(issues_dtm) * train_pct)
# use train as row indexes to extract rows for training set
issues_dtm_train <- issues_dtm[train, ]
# negate train to remove those same rows for creating test set
issues_dtm_test  <- issues_dtm[-train, ]
# also save the labels
issues_train_labels <- issues_final[train, ]$has_bugs
issues_test_labels  <- issues_final[-train, ]$has_bugs
# check that the proportion of bugs is similar
prop.table(table(issues_train_labels))
prop.table(table(issues_test_labels))

# indicator features for frequent words
# save frequently-appearing terms to a character vector
issues_freq_words <- findFreqTerms(issues_dtm_train, 10)
# create DTMs with only the frequent terms
issues_dtm_freq_train <- issues_dtm_train[ , issues_freq_words]
issues_dtm_freq_test <- issues_dtm_test[ , issues_freq_words]


# convert counts to a factor
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
# apply() convert_counts() to columns of train/test data
issues_train <- apply(issues_dtm_freq_train, MARGIN = 2, convert_counts)
issues_test  <- apply(issues_dtm_freq_test, MARGIN = 2, convert_counts)

# comment below line out after first install of e1071
install.packages("e1071")
library(e1071)
issue_classifier <- naiveBayes(issues_train, issues_train_labels)
#Evaluating model performance ----
issues_test_pred <- predict(issue_classifier, issues_test)

# comment below line out after first install of gmodels
install.packages("gmodels")
library(gmodels)
CT <- CrossTable(issues_test_pred, issues_test_labels,
                 prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
                 dnn = c('predicted', 'actual'))
paste('overall correct: ',(((CT$t[1,1] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))
paste('% of bug predictions correct: ',(CT$t[2,2]/(CT$t[2,1] + CT$t[2,2]))*100)
paste('% of bugs correctly identified: ',(CT$t[2,2]/(CT$t[1,2] + CT$t[2,2]))*100)
paste('overall % of bugs: ',(((CT$t[1,2] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))
#Improving model performance ----
issue_classifier_laplace <- naiveBayes(issues_train, issues_train_labels, laplace = 1)
issues_test_pred_laplace <- predict(issue_classifier_laplace, issues_test)
CT <- CrossTable(issues_test_pred_laplace, issues_test_labels,
                 prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
                 dnn = c('predicted', 'actual'))
paste('overall correct: ',(((CT$t[1,1] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))
paste('% of bug predictions correct: ',(CT$t[2,2]/(CT$t[2,1] + CT$t[2,2]))*100)
paste('% of bugs correctly identified: ',(CT$t[2,2]/(CT$t[1,2] + CT$t[2,2]))*100)
paste('overall % of bugs: ',(((CT$t[1,2] + (CT$t[2,2]))/(CT$t[1,1]+CT$t[1,2]+CT$t[2,1]+CT$t[2,2]))*100))

Want to know more about machine learning and software development? Drop in on my keynote presentation at the Better Software East conference in Orlando.

World Quality Report 2017-18: The state of QA and testing
Topics: Quality