The typical retailer’s relationship to her customers is that they can show up to buy something at any time, and she has no way to tell when they stop for good. Customer churn, in her world, is guessed, not observed. This is different from the case of a cell phone service provider whose customers pay their bill every month at the same date, and they call to cancel the service before they stop paying.
Unfortunately, typical churn models that you can google are more suited for the latter. In fact, the canonical public data set for churn model demonstration purposes is one of 3,333 cell phone customers with actual observed customer lifetimes from acquisition to contract cancellation and some details regarding their call and text activity in the meantime. With this kind of training data you can model the probability of churn in a variety of ways that can be useful to anybody who sells anything on a subscription basis (newspapers, credit card providers, CSA boxes, etc.). This is a good overview of three popular options, but it’s not going to be much use to Windsor Circle clients: they are typical retailers.
The Google-friendly acronym for the type of churn problem that we’re trying to solve is BTYD (Buy ’Til You Die; the less macabre name is Buy ’Til You Defect). So, we don’t have observed churn, but we do have a guess: as time goes by after the last time we saw a customer, the probability that this customer is dead to us grows from somewhere close to 0 to somewhere close to 1. That’s the churn score. The trick is to find a way to tie it somehow to things we know about our customers, so that we can give each of them an estimate of this score based on observed behavior.
Two Flavors of a Solution
An online retailer’s order log holds the usual data that all marketers use for RFM (recency, frequency, monetary value) segmentation: it tells us when each customer was acquired, when they bought whatever they bought so far, and how long ago we saw them last.
Order log data can be thought of as drawn randomly from two probability mixtures: one which describes the purchasing behavior while active, and one which describes the dropout behavior. The “random” part in each customer’s own transaction stream does not need much explaining: think of the gaps between your own grocery store trips. The “mixture” part is simply a way to acknowledge that different people buy at different rates while active. If the average purchase rates of customers – not just each customer’s own actual purchases – are also drawn from a probability distribution, you have a mixture process. Also, people tend to churn randomly – I might stop going to this particular store tomorrow, or the next week – but at different baseline rates: some are more loyal than others (in other words, you can be loyal and buy rarely). That’s the other mixture.
If you already know that there are some established ways to model this twofold heterogeneity in order log data, and you have already heard of the Pareto/NBD model and the BG/NBD model, then this blog post is for you. Their specifics, by the way, are explained nicely here and here and they are worth going over, but they are not the point of this blog post.
The point of this blog post is this: the Pareto/NBD model has a reputation for being hard to estimate, and the sources above both recommend BG/NBD as a more tractable alternative. What would Windsor Circle do?
It’s nice that BG/NBD is simple enough that you could implement it in Excel – as shown here – but this is not a consideration for us. What we care about is how well the two perform when compared to each other. The plot below – using real data from a client of ours – suggests that they’re equally good, so you should go with what’s easier:
In this plot we observe new customers acquired throughout Q215, then followed for a little over a year. The actual count of weekly total transactions is shown by the jagged light green line and its general pattern should make intuitive sense to you. There is a peak at the end of 2015 Q2 when we stop adding new customers to our data set, and then the weekly transaction counts taper off toward an eventual zero, at which point all customers in this particular cohort have churned.
The goal is to train a churn model using the time interval between week 0 and the vertical gray line. At that time we will give each customer a score between 0 and 1, equal to the probability that they’ve churned. This probability can be predicted by either the Pareto/NBD model or the BG/NBD model and so far they look like equally good candidates: their predicted transaction rates – shown in the two darker shades of green – approximate the actual numbers equally well.
We stop at the gray line because we want a sanity check for our churn probability estimates. We know that we can’t actually observe churn. That is not an option for retailers. But surely there should be a negative relationship between the churn probability estimated and the actual transactions observed during the period after the gray line. In other words, a good model will predict a high probability of churn for customers who indeed make no transactions in this holdout period (we call it that because we didn’t show it to the model; we held it out, so we can see how well the model does when confronted with data it hasn’t seen).
So, which model wins by this measure?
Clearly, Pareto/NBD wins. Never mind that the dark green line dips below 0; it’s just a LOESS smoother whose only job is to show us the general trend, so it’s OK it if wiggles a bit. The relationship that matters is the downward slope of this line: Pareto/NBD guesses a higher churn probability for the people who are not observed to make any transactions in the holdout period, and a gradually lower one for people who end up actually not churning. You might say that this inverse relationship follows the strength of the evidence: it’s giving people ever lower churn probability estimates as the evidence grows stronger that they haven’t churned yet, as measured by the actual transaction counts.
One the other hand, you might also argue that this “strength of the evidence” metaphor doesn’t really apply here, because churn is a binary outcome. Somebody who makes even one transaction in the holdout period has not churned, so any churn probability guessed to be higher than zero is equally bad, regardless of the number of transactions observed. That would be true, strictly speaking. But look at the alternative: BG/NBD predicts pretty uniformly low churn probabilities for everybody, regardless of how many transaction they’ve made in the follow-up period. Would you trust this model more? This is not a rhetorical question. You might reason that it’s perfectly possible that people haven’t churned (so we correctly estimate low probabilities of churn) even if we don’t see them in the follow-up period. They might come back later.
OK, you could do this, but we won’t. Our reasoning is that if we ignore the relationship between the observed transaction counts in the holdout period and the churn prediction – as choosing BG/NBD over Pareto/NBD would require – then we are left with no acceptable gauge for checking whether our churn predictions are any good. Because we can’t think of a better yardstick than performance in a holdout period, we choose Pareto/NBD over BG/NBD.
It gets even better. Our evidence from several clients shows that this inverse relationship between the Pareto/NBD churn probability prediction and the actual observed transactions in the holdout period is robust even in the presence of high noise. Below is such an example: the plot of a cohort acquired in Q4 of 2004; end-of-year buyers may be Christmas shoppers or Black Friday discount hunters, an especially fickle bunch.
You can see that both Pareto/NBD and BG/NBD struggle here, and you might be tempted to expect that they’re equally hopeless. The actual weekly transaction counts are too all over the place to be well approximated by four parameters, which is what each of these two models have. And yet:
Again Pareto/NBD wins. At Windsor Circle, we will favor its churn predictions over those of BG/NBD. But, since BG/NBD is easier to compute and tends to converge on non-degenerate parameter estimates when Pareto/NBD does not, we do it both ways and pick Pareto/NBD when feasible, unless the evidence shows that BG/NBD is better (hasn’t happened yet).