Our *rasion d’etre* in the Data Science group at Windsor Circle is to make marketing smarter.

To that end, this article covers some metrics that we use internally and your organization can start using right now to improve your marketing decisions. We use these frequently at Windsor Circle because they are important for evaluating a specific class of machine learning algorithm, but they are useful well beyond their machine learning applications. So let’s get to it.

**Use Cases: Gift Orders and Mix Tapes**

This post comes on the heels of some recent work we’ve been doing at Windsor Circle to identify gift purchases. I’ll be using that work as an example of how these metrics look when the algorithm works well. Since that’s not always the case, we’ll contrast this with a hypothetical example in which there are severe problems with the algorithm.

In both of these scenarios the model under scrutiny is a binary classifier. Binary classifiers separate things into two mutually exclusive groups, e.g., ‘Male/Female’, ‘Red/Blue’, ‘High/Low’, ‘Failed/Passed’, etc. These types of tasks are extremely common in machine learning, but they are also part of everyday decision-making.

*Use Case 1: Predicting Gift Orders*

In the first use case, we simply want to classify orders based on whether they are likely to be gifts are not. Knowing that an order is a gift allows us to better target follow up marketing efforts.

*Use Case 2: Rebirth of the Mix Tape*

In the second use case, we’ll follow Susan, the owner of a music shop that only sells cassette tapes. It’s been a rough couple of decades for Susan, but she’s heard the buzz about cassettes making a comeback and has decided she wants to up her marketing game. Unfortunately, cassette tapes are pretty useless without cassette tape players and only a small percent of the population, 5% in this example, still owns one.

Susan would like to identify potential customers who own cassette tape players and focus her limited marketing resources on them. She’s hired a cut-rate data consulting group to create a binary classification algorithm that will identify cassette player owners from her list of 10,000 potential customers and they’ve just delivered the results.

**Evaluating Binary Classification Algorithms**

There are five commonly used measures for evaluating binary classification algorithms - accuracy, precision, recall, the F1 score, and the Receiver Operator Curve. The F1 score summarizes the relationship between precision and recall so this list can be reduced to just three key metrics.

*Ground Truth Data*

Before getting started it’s important to note that to calculate these metrics you’ll need some ‘ground truth’ data. I.e., if you’re building an algorithm to predict high versus low CLV, you’ll need a group of cases where CLV is already known that can be sent through your algorithm and used for evaluation. Many machine learning approaches require this ground truth data to build the model, but if you’ve developed an algorithm some other way you may not have needed it going in. Regardless, you need it to do any evaluation. It is also best if the test data is separate from the data that is used to build the algorithm, otherwise your results may not be reliable benchmarks for how the algorithm will perform on new data.

*Confusion Matrices*

Most of the calculations we’ll be doing are based on just four numbers that are very easy to derive. We’ll start by constructing a confusion matrix - a 2 x 2 table that crosses the number of actual and predicted cases in each group. Convention is that the matrix columns represent the predicted classifications and the rows represent actual (i.e., ground truth) classifications:

Table1. Confusion matrix

Cases that are **correctly** predicted to be members of the group of interest - group ‘A’ in Table 1 - are referred to as ‘true positives’ and are counted in the top left cell. Cases that are predicted to be in group A, but are actually in group B are counted in the false positive count in the bottom left cell. False negatives and true negatives are placed in the top and bottom right cells, respectively. Table 1 illustrates this using the common abbreviations for these groups: *tp* = true positive, *tn* = true negative, *fp* = false positive, and *fn* = false negative.

*Metric 1: Accuracy*

Calculating accuracy is a natural next step once the confusion matrix is created. Accuracy is simply the proportion of correct classifications of all classifications made:

Table 2 shows the confusion matrix for our gifting algorithm. Accuracy is the sum of the correct predictions - the top-left (*tp*) and bottom-right (*tn*) cells of the table- divided by the sum of all cells in the table. Of the 5,806 classifications made by the gifting algorithm, 84% are correct; i.e., the algorithm has an accuracy of .84. (For simplicity and clarity, I prefer to present accuracy as a percentage and will do so for the rest of the article.)

Table 2. Gift Order Confusion Matrix

Although straightforward and easy to understand, accuracy can be misleading - particularly in cases where the outcome is either very rare or very common. In these situations, an algorithm can achieve a very high accuracy by classifying everyone into the largest group without really being able to discriminate between the groups at all.

To see this effect in action, let’s have a look at the results from Susan’s cassette player ownership algorithm in Table 3. Applying the model to her 10,000 potential customers produces a confusion matrix that looks like this:

Table 3. Cassette Player Ownership Confusion Matrix

Out of 10,000 predictions, the model classified 9,500 correctly. Awesome; the model has an accuracy of 95%! Time to head out for some beers, right? Not if you're Susan. This model has given her absolutely no useful information. It has just lumped all 10,000 cases into the ‘No Cassette Player’ group. She can’t accurately predict whether someone owns a cassette player and therefore can’t efficiently target potential customers.

As obvious as this error seems, accuracy is used as the sole classification metric way too often. A word of advice - never use overall accuracy as your only measure of classification accuracy. Use it alongside a metric that considers more types of error. The F1 score does just that.

*Metric 2: F1 score*

The F1 score is a combination of two other metrics that are interesting and useful in their own right - precision and recall.

Precision and recall both focus on the number of correct ‘positive’ classifications in an algorithm’s predictions. Remember that ‘positive’ doesn’t necessarily mean ‘good’, it’s just lingo to indicate the event of interest. In the gifting example, gift orders are the ‘positive’ outcome. In Susan’s model it is cassette player ownership.

**Precision** is the number of correct positive classifications out of all positive *predictions*.

**Recall** is the number of correct positive predictions out of all positive *events*.

The F1 score combines these two metrics. Specifically, it is the harmonic mean of precision and recall, and if you know your properties of Pythagorean means, well...I’m pretty sure you could beat me at bar trivia.

Precision, recall, and the F1 score all range from 0 to 1 with values closer to 1 indicating better performance.

For the gifting model our precision, recall, and F1 values are .72, .90, and .80, respectively. All very acceptable values. However, Susan’s model doesn’t fare so well. In fact, the precision, recall, and F1 metrics are all 0. Although Susan’s model had a better overall accuracy, looking at these additional metrics illustrates it has some serious deficiencies.

*Metric 3: Receiver Operator Characteristic Curve*

Another common measure for evaluating binary classifiers is the Receiver Operator Characteristic (ROC) curve. (The esoteric sounding name is a vestige of the method's origins in evaluating the performance of WWII radar operators.)

A ROC curve shows the relationship between recall and precision across the entire range of probabilities predicted by an algorithm. Note that some machine learning methods assign each case a predicted probability that can be converted to a binary classification. Other approaches simply provide the nominal classifications. A ROC curve requires the predicted probabilities and therefore can’t be calculated from just the nominal classifications. The process for creating a ROC curve is beyond the scope of this post, but it’s something you should receive (or ask for) when handed a binary classification algorithm.

Figure 1 shows the ROC curve for our gifting model. The farther the blue line bends away from the red diagonal line the better. There is almost a right angle in the blue line, so we’re doing well.

To more succinctly summarize the information in ROC curves, we usually talk about the area of the chart to the right of the the blue line - the Area Under the Curve (AUC). More space there is better. AUC has a maximum value of 1; the value of .88 for the gift prediction approach indicates an algorithm with good accuracy and a clearly defined probability threshold.

Figure 1: Gift Prediction ROC Curve

Figure 2 shows the ROC curve for Susan’s model and illustrates what the chart looks like for a very poor performing algorithm. In a ROC curve chart, the straight diagonal line running from the bottom left to the top right corner indicates the performance expected from a model that does no better than randomly classifying cases. In other words, the ROC curve for Susan’s model shows that is does no better than classifying cases by flipping a coin (or a Def Leppard tape).

Figure 2: Cassette Player Prediction ROC Curve

**Wrapping Up **

The goal of this post has been to walk through some state-of-the-art - but simple to calculate - metrics for evaluating binary classification algorithms. These methods are applicable wherever an algorithm classifies cases into two mutually exclusive options, whether that algorithm is based on machine learning or tea leaves.

In the next post we’ll dive a little deeper into how to use this information to make informed decisions that affect revenue. Til then.