In Q4 of 2016 Windsor Circle released a Predicted Customer Value (PCV) module based on the "Buy Til You Die" approach. At the time, the model was vetted on a subset of historical data. Now that it has been in the field for close to a year, we have an opportunity to use current data to evaluate its real-world effectiveness. To do so, we use predictions generated using data through January 1, 2017 and evaluate how well those predictions held up over the next 6 months, through July 1, 2017.
Based on the analysis described below, the PCV model is able to predict future spend and churn at very high rates (83% and 86% accuracy, respectively).
Correlations between Predicted Outcomes and Observed Outcomes
A simple gauge of the association between two continuous variables is a Pearson correlation coefficient.
Correlation coefficients provide a straightforward preliminary check of the PCV model performance. When evaluating how well the models predict the intended outcomes, we hope for a strong positive correlation between the predicted and actual values.
We consider the association between three pairs of predicted and actual metrics: predicted and actual total spend, predicted and actual AOV, and predicted and actual number of orders. Across roughly 150 clients, we see an average correlation of .66, .76, and .65 for these metrics, respectively. These values suggest a strong positive relationship between our predictions and the customer behavior observed in subsequent months.
Correlations are a good quick check of our model performance, but we can’t stop there. With a few minor adjustments to our data we can use three standard performance metrics - accuracy, precision, and recall - to evaluate how well our model predicts certain categories of our outcome variables.
Binary Accuracy, Precision, and Recall
When dealing with binary variables - that is, variables that can take only two values such as on/off, purchased/didn’t purchase, churned/not churned - classification metrics allow for a more intuitive interpretation of model performance1. The most straightforward model validation metric is simple accuracy - out of all of our classifications, what percent of the time were we correct? However, in some cases it’s possible for a model to have very high accuracy but do a really bad job of discriminating between possible outcome categories. As such, it’s good practice to consider two other metrics as well - precision and recall.
Below is a quick refresher on these metrics. For detailed information on accuracy, precision, and recall check out this previous post.
But first... Confusion Matrices
All three metrics are based on just four numbers that are very easy to derive. We’ll start by constructing a confusion matrix — a 2 x 2 table that crosses the number of actual and predicted cases in each group. Convention is that the matrix columns represent the predicted classifications and the rows represent actual (i.e., observed) classifications:
Table1. Confusion Matrix
Cases that are correctly predicted to be members of the group of interest — group "A" in this example — are referred to as "true positives" and are counted in the top left cell. Cases that are predicted to be in group A, but are actually in group B are counted in the false positive count in the bottom left cell. False negatives and true negatives are placed in the top and bottom right cells, respectively. Table 1 uses common abbreviations for these groups: tp = true positive, tn = true negative, fp = false positive, and fn = false negative.
Once the confusion matrix is created calculating accuracy is a natural next step. Accuracy is simply the proportion of correct classifications of all classifications made. Stated differently, accuracy is the sum of the correct predictions - the top-left (tp) and bottom-right (tn) cells of the table - divided by the sum of all cells in the table:
Although straightforward and easy to understand, accuracy can be misleading in cases where the outcome is either very rare or very common. In these situations, an algorithm can achieve a very high accuracy by classifying everyone into the largest group without really being able to discriminate between the groups at all.
Precision and Recall
Precision and recall both focus on the number of correct classifications for just the category of interest, e.g., purchase, churn, etc. Precision is the proportion of correct classifications out of all cases that were predicted to be in the category of interest:
In the context of churn prediction, precision is the proportion of cases that actually churned, out of those we predicted had churned.
Recall is the proportion of cases correctly predicted to be in the category of interest out of all cases that were actually in the category of interest:
In the context of churn prediction, recall is the proportion of cases correctly predicted to have churned, out of all those who had actually churned.
For simplicity we convert precision and recall to percentages. Average accuracy, precision, and recall for PCV predictions across all clients are 86%, 99%, and 83%.
Recall from our correlation analyses that some of our features of interest are continuous - they have many more than two values. We can apply the metrics above to these features by first making them binary. To do that, we create new indicator features that identify cases in the top quintiles of predicted and actual values.
Total spend is often a feature of interest, so we will focus on it. For each client, if a case has a predicted value in the top quintile of predicted spend it is assigned a value of "1" in our new binary, prediction-based feature, all other cases are assigned a value of "0." If a case has an observed spend in the top quintile of spend values it is assigned a "1" in our new binary, actual spend-based feature, all other cases are assigned a value of "0." This re-mapping allows us to apply the classification metrics discussed above to our total spend variable. In doing so, we observe an average classification accuracy of 83% and precision and recall of 60% and 59%, respectively. On average, 83% of the time we predict a case will be in or out of the top spend quintile over the next 6 months, we are correct.
- When predicting customers’ churn rates for the following six months, average accuracy, precision, and recall across all clients are 86%, 99%, and 83%, respectively. These metrics indicate a very well-performing churn prediction model.
- When predicting a customer’s total spend will be in the top spend quintile during the following six months, average accuracy, precision, and recall across all clients are 83%, 60%, and 59%, respectively. On average, 83% of the time we predict a case will be in or out of the top spend quintile over the next 6 months, we are correct.
1 - For simplicity, this discussion focuses on two of the most important features — predicted churn and predicted total spend — but the results are representative of all of the PCV module outputs.