Which customers are more likely to respond to bank’s marketing campaigns?

Tags: , ,

A quick demonstration on business consulting with data science

Audience

The intended audience for this blog post is marketers who have read the earlier post on 5-step data science consulting framework, and are keen to learn more about the actual implementation of such projects. We will be using the caret package in R as a quick demonstration.

Overview

The data set can be downloaded from UCI Machine Learning Repository. It is consisted of 41,188 customer data on direct marketing campaigns (phone calls) of a Portuguese banking institution, with variables below:

Client: age, job, marital, education, default status, housing, and loan

Campaign: last contact type, last contact month of year, last contact day of the week, and last contact duration

Others: number of contacts performed in current campaign, number of days that passed by after the client was last contacted, number of contacts performed before this campaign, outcome of previous campaign, and whether a client has subscribed a term deposit

Step 1: Business Problem

There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.

Step 2: Analytics Objective

A classification approach to predict which clients are more likely to subscribe for term deposits.

Step 3: Data Preparation

Let us first see the summary of our data set.

Screenshot of the data set summary
summary(mydata)
str(mydata)

Next, we will visualize the relationship between each variable versus y (whether or not a client has subscribed to term deposit) through box chart. We found out that only duration has significance among all other variables.

Box chart of Duration versus y
Box chart of Age versus y
p_duration <- ggplot(mydata, aes(factor(y), duration)) + geom_boxplot(aes(fill = factor(y)))

We will also need to create dummy variables as some inputs such as job are categorical. We can do this manually with ifelse, as shown below.

for(level in unique(mydata$job)){
  mydata[paste("job", level, sep = "_")] <- ifelse(mydata$job == level, 1, 0)
}

Or we can do this with the dummyVars function in caret.

dummies <- dummyVars(y ~ ., data = mydata)

In addition, since the variables are of different magnitude, scaling is recommended, though tree-based models do not usually require scaling to achieve good performance. However, we must first split the data set into training and testing before scaling, as if we scale both training and testing sets together, we may include information from the testing set, causing a biased estimation.

set.seed(1)
training_size <- floor(0.80 * nrow(mydata))
train_ind <- sample(seq_len(nrow(mydata)), size = training_size)
training <- mydata[train_ind, ]
testing <- mydata[-train_ind, ]
preProcValues <- preProcess(training, method = c("center", "scale"))
scaled.training <- predict(preProcValues, training)
scaled.testing <- predict(preProcValues, testing)

We can observe that y is highly screwed. That is, we have a lot of non-subscribers; subscribers only account for about 11.7% of all clients. As a result, scaling is required. Here we will demonstrate 4 popular resampling techniques, including under-sampling, over-sampling, smote sampling, and rose sampling on the training data set.

down_training <- downSample(x = scaled.training[, -ncol(scaled.training)], y = scaled.training$Class)
up_training <- upSample(x = scaled.training[, -ncol(scaled.training)], y = scaled.training$Class)
smote_training <- SMOTE(Class~., data = scaled.training)
rose_training <- ROSE(Class~., data = scaled.training, seed=2)$data

Finally we are done with the data preparation for model training. Also note that for simplicity purpose, we have neglected some pre-processing steps such as outlier removal and correlated variables identification.

Step 4: Model Development

We will train the model with CART (classification and regression tree) as we would like more interpretation on the model instead of predictive power. Here we will use cross validation to train the model, and instead of accuracy, we will also use ROC as the evaluation metric (the closer to 1 the better).

ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
orig_fit <- train(Class~., data = training, 
                  method = "rpart",
                  metric = "ROC",
                  trControl = ctrl)

After training the models on resampled data sets, we will examine the performance, in terms of ROC. It is observed that over-sampling (up) has the best performance, of mean ROC at 0.8078530.

Screenshot of the CART model performance on test data

We will visualize and interpret the CART model. It is observed that indeed duration has the most impact on subscriptions.

Screenshot of the top CART model

After converting the scaled duration back to original, we know the threshold call time is 473 seconds, or approximately 8 minutes. In other words, if a call lasts longer than 8 minutes, the chance of subscribing is at 84%. In addition, if a call is between 205 seconds and 473 seconds, and the contact method is unknown, then the chance of subscribing is 64%.

Step 5: Performance Testing

Since the most significant variable is call duration, we need to find more information about the successful calls, such as the sales representatives who conducted the calls and the recorded conversations, in order to create strategies to make the calls last longer.

We might also want to look into why contact method has been recorded as unknown for some of the clients, rather than telephone or cellphone, in order to derive more meaningful insight.

We also need to keep in mind that the correlations are not always causations, and there might be other hidden reasons for a client to subscribe. For example, the long duration might be a result of interested clients asking questions or they are setting up deposits over the phone. And therefore we will also need to set up small scale A/B testing to see if the subscription rate is increased significantly with longer call duration.

Final Thoughts

This sums up for the blog post, though we do not have actionable insights in the end, it may still help shed lights on how to solve business problems with data science.

Sometimes in data science, we are modeling over behaviors (what), instead of motivations (why). That is, we know that the behavior of clients having longer call durations may lead to higher chance of subscribing, but we cannot exactly know why or what their motivations are. In other words, in our CRM databases, we are recording features on what customers are doing, instead of what they are thinking.

In order to create adequate business strategies, sometimes we still need to leverage qualitative research such as interviews and focus groups.

R-code

Original Source.