Predicting Customer Response to Starbucks Promotional Offers
Detailed Walk-Through of a Udacity Data Scientist Nanodegree Capstone Project using Supervised Machine Learning with Python, Pandas and Scikit-learn
I must admit it’s not the coffee, but the data, that got me interested in Starbucks. As a non coffee drinker I actually never made a purchase at Starbucks in my whole life, but lots of people obviously do: The global coffee chain operates over 30,000 locations worldwide in more than 70 countries and in 2019 achieved a revenue of $26.5 billion. One key success factor is the effective use of data science by Starbucks, which a blogger recently described as not a coffee business, but a data tech company.
Also the machine learning project below originates from this approach, as it is based on data from a simulated customer test provided by Starbucks in cooperation with Udacity. The data set depicts how customers make purchasing decisions and how these decisions are influenced by promotional offers. While the task for this project is not exactly prescribed, my approach is to use the data to build a supervised machine learning model that — based on demographic customer data and offer features — predicts customer response to different types of promotional offers.
The aim of this blog post is to showcase the whole data science process involved in this task in a beginner friendly way — loosely following the Cross Industry Standard Process for Data Mining (CRISP-DM) framework — from business understanding to model deployment. The real challenge, as you will see, was rather interpreting and wrangling the data in a meaningful way than the actual modelling, thanks to the user friendly and well-documented API of Scikit-learn. The blog post puts the main focus on these considerations and their results, while the full code is available as Jupyter Notebook on GitHub.
In this section I will use Pandas to provide a brief overview of the given data, while at the same time performing some preliminary data cleaning operations (e.g. converting data types, dropping missing values).
During a 30-day test period, ten different offers were sent out to users of the Starbucks rewards app program. The test included three types of offers: buy-one-get-one (BOGO), discount, and informational. The offers were delivered via four different channels: web, email, mobile, and social. The test data contains logs of more than 300,000 time-stamped events, including receiving offers, opening offers, completing offers and making purchases. In addition, there is demographic data for the 17,000 participating customers provided.
The data consists of three JSON files:
- portfolio.json — meta data about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed
Let’s have a closer look at each of these data sets:
Offers sent during 30-day test period (10 offers x 6 fields):
- id (string) — offer id
- offer_type (string) — BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings) — web, email, mobile, and social
Data cleaning operations performed:
- Creating more self-explanatory ids.
- Converting categorical features in offer_type and channel columns into one-hot encoding (original offer_type column preserved temporarily for easier data visualization in EAD).
Resulting data frame:
It’s important to keep in mind that for informational offers both difficulty and reward are always zero.
Rewards program users (17,000 users x 5 fields):
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (O for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
Data cleaning operations performed:
- Dropping missing values. The data set contains NaN values in columns gender, age (encoded as 118) and income. The NaN values always occur in the same rows, resulting in 2,175 out of 17,000 customers (12.8 %) without any demographic data. As the main target of this project is to build a predictive model based on demographic data and there is no meaningful method for imputing these data I will drop these rows.
- The date when a customer created an app account is currently encoded in became_member_on as integer. In order to better compare the data and use it as input feature for my ML model, I will convert it into membership_days, defined as the time delta between the dates when the actual customer joined and when the most recent customer joined (as no date for 30-day test is given).
- Converting categorical features in gender into one-hot encoding (temporarily keeping the original column).
Resulting data frame:
Distribution of demographic customer data:
Event log (306,648 events x 4 fields):
- event (str) — record description (transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test, the data begins at time t=0
- value (dict of strings) — either an offer id or transaction amount depending on the record
Data cleaning operations performed:
- Converting value(dict of strings) into separate columns offer_id, amount and reward. This will inevitably create some NaN values, which are not a problem here, as these columns will not be used as input features for the ML model.
Resulting data frame:
Distribution of event types in transcript:
One important fact to keep in mind is that time in transcript is measured in hours while the duration in portfolio is measured in days. Neglecting these different time units when merging information from both data frames would terribly falsify the results.
Data Preparation: Accumulating Success Indicator Data for Offers
The main goal in this segment is to define a success indicator for each individual promotional offer that can be used as target label for the ML model later on.
For building a ML model based on the outcomes of each offer received during the test, the structure of the transcript data frame with individual records for all four event types is not optimal. To make it easier to navigate the data, I will create a new data frame offers that contains only the rows for offer_received events, and subsequently add further data for these offers.
1.) Scrutinizing the Experiment Design
This results in a rather basic data frame, which nevertheless reveals some important insights into the experiment setup (yes, data understanding isn’t over yet):
1. All offers were sent out at six distinct points in time during the test phase.
2. At each of these points in time, a roughly eqal number of all ten different offers was distributed.
3. The number of offers each customer received varies between one and six.
4. The same customer might have received the same offer several times (up to five times) during the test.
5. There was maximal one offer distributed to each individual customer at each point in time.
2.) Determining Offer Success
The next step is to accumulate data on whether an offer was successful or not. This is actually not as trivial as it seems.
The first idea would be to simply look for offer_completed events from the transcript data frame. As there is no unique id for the individual offer itself (offer_id refers only to the ten different offer variations from portfolio) and the same kind of offer might be sent out multiple times to one and the same customer, I will assign to each offer_received event all offer_completed events of the same type (= offer_id ) that were sent out to the same customer_id within the duration period of the offer.
There is, however, one weakness in the experimental design by Starbucks, that makes it impossible to unequivocally allocate every offer_completed event to an offer_received event. Because the duration of the each offer varies between 72 and 240 hours, the intervals between sending out offers range from 72 to 168 hours, and the same offer can be distributed to the same customer several times in a row, overlapping duration intervals for the same kind of offer for the same customer would be possible. So if e.g. a customer received a discount_1 offer with a duration of 240 hours at hour 336 and then again at hour 408, and there is an offer_completed event for this user_id and offer_id in hour 450, which offer was actually completed? With the given data we simply do not know. However, these events are pretty rare, so I will simply consider each offer as completed whenever there is a completion event for this offer_id and user_id within its duration.
Furthermore, an offer_completed event is registered whenever customers makes purchases of at least the difficulty level of that certain offer type — no matter if they ever viewed the offer. In contrast, we are only interested in customers whose purchase was induced by the offer. And while it is impossible to unequivocally determine if a promotional offer really was causal for a purchase or if the customer would have made the purchase anyway, it is clear that a purchase was not induced by an offer if the customer did not view the offer priorly.
To determine whether an offer was successful or not, I will therefore pull both the view time and completion time for every offer and add them as new columns to the offer data frame (whenever there are such events within the duration of the offer), and then define an offer as successful whenever it was both completed within its duration time and the offer was viewed before completion.
With these new columns it is possible to determine whether an offer was successful, according to the definition above (I also added a column with information whether there is demographic data available):
At this point, I thought I had generated success information for each offer, but a quick data visualization revealed that I actually missed one important aspect:
As I defined success based on offer completion, and there was no offer completion recorded for informational offers, all these offers are currently labelled as not successful. Instead, informational offers should be labelled as success if any purchase was made after it was viewed and within the duration of the offer. Therefore, we have to re-evaluate all informational offers that have been viewed within their duration period:
The result looks much more reasonable:
3.) Merge Offers with Demographic Customer Data and Portfolio Data
After creating a success column as target label for all offers, the next step will be to merge in further demographic customer data and offer features that will be used as input features for the ML model, resulting in one single data set with all relevant information. For this, I will only use offers with demographic data available, resulting in 66,501 offer records:
The data frame still contains a few redundant columns, like gender (as the same information is encoded in ‘gender_M’ and ‘gender_O’), which are kept temporarily for easier data visualization.
Dropping offers for customers without demographic data available also increased the overall success rate from 39.86 % to 43.02 %, caused by a very low success rate of only 18.37 % for customers without demographic data available (probably there are many non-active customers in this group).
With this success rate, I would consider the data set balanced enough to not use any methods for further balancing the data (like undersampling, oversampling, or SMOTE) before fitting the ML model later on, as also each of these methods has its pay-offs.
Correlations between Input Features and Promotional Offer Response
Before starting the actual modelling, it’s worth plotting a few data visualizations to further scrutinize the correlations between certain features — both demographic customer features and offer features — and promotional offer success.
There are quite a lot of interesting insights to draw from these visualizations, e.g. on the correlation between offer type, gender and success. For some reasons, BOGO offers seem to be less attractive for male Starbucks customers. It’s also striking how success rates differ within the same offer type (e.g. between discount_1 and discount_2), which might be caused by other offer features like channel, difficulty, duration, or reward. In addition, it’s worth having a closer look at the influence of membership time and income level on offer success.
As the main focus of this project is on building a predictive model, I will not comment the results in detail here, but rather let the data speak for itself and leave it up to you to draw further conclusions:
1. Splitting into Training and Testing Data
Before staring the model building, it’s time for some final data preparation by dropping redundant columns, resulting in a data frame df with one target label (success) and 14 numerical input features for the model. For model training and evaluation, the data frame is first converted into two different numpy arrays for input features (X) and target labels (y) and then split into training data (70%) and testing data (30%):
2. Model Selection
The next important decision is the selection of a machine learning algorithm. Predicting offer success as defined above is a typical supervised classification task, providing a wide choice of different algorithms available in Scikit-learn. I will implement several of them to compare their performance on the given data, but first we should keep in mind one important thing regarding the real-world application of our model:
Predicting customer response on promotional offers based on demographic data with absolute certainty is impossible, even with the perfect data and model, simply because in real life even one and the same customer will react differently to one and the same offer on different occasions (depending on factors like e.g. weather, mood, or current location). So for a real-world application, it would be much better to predict the probabilities of successful offer responses rather than just predicting success or not. Therefore, I will not use any classification algorithms that do not (directly) support probability estimation, like e.g. Support Vector Machines.
Based on these considerations, I have chosen six different ML algorithms for performance comparison:
3. Model Building
For optimizing the hyper parameters of the model, I will use GridSearchCV, which runs cross validation on all different combinations of a given parameter grid. GridSearchCV is used in combination with Pipeline to standardize the input data by running a StandardScaler only on the training data set within each fold of the cross validation. This is important, as first standardizing the whole training data set and then using cross validation in grid search would cause data leakage.
As a first step, the parameter grid for each classification algorithm is defined in a dictionary, which is used for looping through all six classification algorithms to:
- Instantiate and fit the model using GridSearchCV and Pipeline
- Make predictions
- Print out a classification report for model evaluation
- Save the trained model as pkl file
Comparing the classification reports for the different algorithms shows that with accuracy scores between 0.624 and 0.692 the performance results for all six classifiers are relatively close to each other.
The best performing algorithms for this task are all models based on Decision Trees, with ensemble methods like Gradient Boosting and Random Forest delivering only slightly better accuracy scores than a basic decision tree. The pay-off for these slightly better results is a significantly longer run time for training the model with both ensemble methods. For this rather manageable data set, however, this aspect doesn’t carry much weight, so I will use the best-performing Gradient Boosting classifier for deploying the model later on.
Taking a closer look at the individual classification reports also reveals that all classifiers, except Naive Bayes, have better F1-scores for the majority class (‘no success’) than for the minority class (‘success’). This observation raised my suspicion that maybe the slightly imbalanced data set could be a problem. To be on the safe side I rebuild the model with various resampling methods (undersampling, oversampling, and SMOTE) as first step in the pipeline (Imblearn Pipeline in this case).
Classification report with undersampling:
But while the scores for the minority class slightly improved after resampling for most classifiers, the pay-offs were poorer scores for the majority class. In the end, all resampling methods provided slightly worse overall accuracy scores, so I decided to stick with the original data and not perform any resampling operations.
Taking into consideration the fact that customer response on promotional offers is dependent on many more factors than simply demographic customer data and offer features (as discussed above), I would consider an accuracy rate of almost 70 percent as a quite satisfying result for this classification task and would make the case for deploying the model.
Before implementing the model, it’s worth taking a look on the importance of certain input features, using a model inspection technique called permutation_importance, which calculates the decrease in model score when a single feature value is randomly shuffled.
The plot reveals that the most important single input features for the model are whether an offer was distributed via social media and rewards programme membership time, which is in accordance with the correlations between input features and offer response as depicted above. Other important features are income and age on the customer side as well as reward and duration on the offer side.
Successfully building a machine learning model is nice, but we should also keep in mind the practical application. How could the marketing team at Starbucks profit from this model? One very basic possible application would be to build a function that predicts success probabilities for all promotional offers for a given customer ID:
This function could e.g. be used to provide the marketing team with a formatted output of success probabilities for all offers for a given list of customer IDs. In the example given below, all promotional offers with a success probability above a defined threshold are marked green and the offer with the highest success probability for each customer is highlighted:
Because the model takes offer features from portfolio as input features, but not the offer_id itself, we can also use the model to predict the success probability of new offers with different feature values:
The function could also be used to automate the distribution of potentially successful offers to customers. The highlighting above indicates that sending each customer simply the offer with the highest individual success probability would drastically limit the variation of offers distributed. A better approach could be to define a certain threshold and send out a randomly chosen offer with a success probability above this threshold — or no offer at all if there is no one above the threshold — to each customer.
There are of course lots of advanced, more valuable applications possible, but I think these examples already demonstrate the potential real-world use of the model.
In the post above I described the complete data science process involved in using demographic customer data, meta data on promotional offers, and transaction data records from a 30-day experiment provided by Starbucks to build a machine learning model for predicting customer response to promotional offers. The post highlights the importance of data understanding and wrangling in this process. The project was successfully completed by training a supervised classifier using gradient boosting, an ensemble method based on decision trees, which turned out to be the best algorithm for this task and predicts customer response with an accuracy rate of almost 70 percent. Given the fact that also one and the same customer will react differently on one and the same offer, this is a quite satisfying result. I also drafted a few possible real-world applications for the model.
Model Limitations and Possible Improvements
From a business perspective there is one major caveat with the rather technical definition of success applied here, making it less attractive to actually implement the model: A promotional offer was defined as successful as soon as it was both viewed within its duration and a transaction above the offer’s threshold was conducted between the viewing event and the expiry of the offer. Some demographic groups, however, will make purchases even if they don’t receive an offer, so in terms of business figures it would not make sense to send them a buy 10 dollars get 2 dollars off offer if they would have spent 10 dollars anyway.
A more business oriented definition of success would be to define offers as successful only when the amount spent after viewing an offer is greater than the sum of the amount the customer would usually spend during the same period of time and the reward the user got for completing the offer. I actually pursued this approach first, but eventually abandoned it due to two limitations in the experimental design:
- As there is no data provided on what customers usually spent, we would have to pull these information from the transaction data for the 30-day test period. In this context it is important to take into account only those transactions not influenced by any offers for calculating the usual customer spendings, to avoid circular reasoning. However, the time periods during the test, that were not influenced by any offers, turned out to be so limited for many customers, that it was impossible to calculate reliable average spendings.
- Even if we had reliable average purchase data for all customers, we would still face the problem of overlapping offer duration intervals discussed above, which in this case also concerns offers of different types. As a major proportion of transactions is conducted within the duration of multiple offers, there is no way to unambiguously attribute an increase in spendings to a certain offer viewed. To calculate the monetary effect of distinct offers, it would therefore be necessary to implement a different experimental design with non-overlapping duration intervals.
Apart from these obstacles, one could also argue, that a purely monetary business definition of success is problematic, as it could tempt the company to not send any promotional offers to its most loyal patrons, assuming that ‘they would make the purchase anyway’. Which in the long run probably is not the best way to promote customer loyalty.
This project was developed as part of my Udacity Data Scientist Nanodegree. Which also means: I’m a learner, not an expert (with an academic background in humanities, not CS). Constructive feedback is much appreciated. Or get in touch at LinkedIn.