This is an engineering blog post written for engineers and technical managers interested in applying machine learning to improve predictive analytics within their organizations.  Connect with the author: @n_ranjan or LinkedIn.

Summary:

We used regularized linear regression to learn parameters for our prediction model that accepted an input of 11 lead characteristics and output the lead to opportunity velocity (LTOV: number of days taken by a lead to convert to an opportunity) for a lead.   Our model was able to predict LTOV values with 23.64% less error compared to the “average method” described in more detail below.

Context:

A key question we are trying to answer for marketing departments is: given a set of leads, how much revenue will result from them and how will it be distributed across future months?  For example, here is how we forecast how much revenue will result in April 2014 from webinar leads created in February 2014 (similarly, we can also forecast how much revenue will result from the February leads in March or May):

R = L * LTO% * OTD% * DA, where

R = Revenue in April 2014 from Feb 2014 webinar Leads

L = Number of webinar Leads created in Feb 2014 such that (Created Date + LTOV + OTDV) is a date in April 2014

LTOV = Average number of days (Velocity) for webinar Leads to convert to Opportunities

OTDV = Average number of days (Velocity) for webinar Opportunities to convert to Deals

LTO% = Lead to Opportunity conversion rate for webinar Leads

OTD% = Opportunity to Deal conversion rate for webinar Opportunities

DA = Average webinar Deal Amount

For example, the LTOV value is calculated by averaging the LTOV values for all webinar leads created in the past 6 months.  So, we assume that the average value of LTOV in the past is a good predictor of the LTOV value in the future.  This is good first step but it predicts the same LTOV value (average from the past 6 months) for all webinar leads in the future and, thus, ignores any webinar lead characteristics that might influence the predicted LTOV value.

We wanted to use machine learning techniques to see if we could use past data and relevant lead characteristics to predict future values of LTOV, OTDV, LTO%, OTD%, and DA better than the average method.

We chose to focus on LTOV values first and this article presents the results of our efforts.

Method:

Given a lead L, we assume that its LTOV value is:

LTOV(L) = P0 * X0 + (P1 * X1) + (P2 * X2) + (P3 * X3) + … + (P14 * X14)

where

P0, P1, … P14 are parameters of the model,

X1, X2, … X11 are lead characteristics of L (number of campaign touches, number of campaign responses, etc.),

X0 = 1,

X12 = X12,

X13 = X22,

X14 = X32.

We can think of X0 to X14 as the lead’s 15 element feature vector.

We are given a data set of leads that have converted to opportunities in the past 9 months.

We divide the data set into two equally sized randomly sampled sets to create our training set and test set.

m = number of leads in the training set = number of leads in the test set.

From the training set:

1) We create a matrix X_train with m rows and 15 columns such that the i’th row of the matrix, X_train(i), contains the 15 element feature vector for the i’th lead, L(i).

2) We create a vector y_train whose i’th element is the LTOV value of the i’th lead, L(i).

Similarly, we create the matrix X_test and the vector y_test from the test set.

We use the normal equations method to compute the parameter vector, [P0, P1, … , P14], that minimizes the following cost function, J, on the training set:

J = 1 / (2 * m) * { Sum(i)_from_1_to_m [ (LTOV( X_train(i) ) – y_train(i))2 ] + lambda * (P12 + P22 … + P142) }

where

lambda = the regularization parameter.

Using the normal equations method:

[P0, P1, …P14] = (X_trainT * X_train + (lambda * D))-1 * X_trainT * y_train

where

D = 15 x 15 identity matrix with the (1, 1) element replaced with 0.

Using the P0 to P14 parameters, we calculate the training set error, J_train, and the test set error, J_test:

J_train = 1 / (2 * m) * ( Sum(i)_from_1_to_m { (LTOV( X_train(i) ) – y_train(i))2 } ),

J_test = 1 / (2 * m) * ( Sum(i)_from_1_to_m { (LTOV( X_test(i) ) – y_test(i))2 } )

We use the following 16 values for lambda: 30000, 15000, 11000, 10000, 9000, 8000, 3000, 1000, 300, 100, 50, 30, 20, 10, 3, 1.  As shown in the table below and the graph above, for each value of lambda, we calculate the training set error, J_train, and the test set error, J_test.  We choose the lambda value (9000) that minimizes the test set error to 882.07.

Lambda Training Set Error Test Set Error
30,000 801.43 905.16
15,000 784.28 885.43
11,000 778.22 882.53
10,000 776.53 882.20
9,000 774.76 882.07
8,000 772.89 882.19
3,000 760.31 888.56
1,000 746.91 893.52
300 723.68 888.36
100 698.77 886.42
50 687.60 892.84
30 682.18 899.92
20 679.12 905.71
10 675.63 914.44
3 673.02 924.47
1 672.51 928.72

Next we compare our LTOV model’s performance with the average method.

Average LTOV in training set = 30.8229 days

Using the average LTOV value from the training set, we compute the test set error, J_test_with_avg:

J_test_with_avg = 1 / (2 * m) * { Sum(i)_from_1_to_m [ (30.8229 – y_test(i))2 ] } = 1155.2

So, our model achieves a test set error of 882.07 and the average method ‘s test set error is 1155.2.

This shows that our model predicts LTOV values with 23.64% (((1155.2 – 882.07) / 1155.2) * 100%) less error than the average method.

Limitations:

1) We tested 25 different combinations of manually chosen lead characteristics and then chose the set of characteristics that yielded the lowest test set error.  There are probably more automated ways of doing feature selection that can help us do a more exhaustive search over the full sample space of lead characteristic combinations.  It is possible that we are missing out on using some other set of lead characteristics that might yield an even lower test set error.

2) Our data set was relatively small and specific to a single customer’s data.  The set of lead characteristics we chose for this customer might not generalize well to other customers.

3) We did not use a cross-validation set when we tried various values of lambda.  Before deploying this model in production, we should divide the dataset into a training set (50%), a cross validation set (25%), and a test set (25%).  We should choose the lambda value that yields the minimum cross validation set error.  The parameter vector calculated from the chosen lambda value should be used to compute the model’s final test set error.

Conclusion:

We were able to verify our intuition that machine learning techniques (specifically, regularized linear regression) can do better than the average method for predicting LTOV values.  A reduction in test set error of 23.64% is a significant improvement and encourages us to apply a machine learning based approach to predictions of the other variables (OTDV, LTO%, OTD%, and DA) that comprise the revenue prediction formula.

Next Steps:

We can apply the above regression approach to predict OTDV and DA values.  We can apply classification approaches like logistic regression, neural networks, and support vector machines to identify leads with a high probability of conversion to opportunities and, consequently, predict LTO% values.

The same classification approaches can help us identify opportunities with a high probability of conversion to deals and, thus, predict OTD% values.   Then, we can plug the values of these variables (LTOV, OTDV, LTO%, OTD%, DA) into the revenue prediction formula above and compute the predicted revenue from a particular set of leads.

Enjoy this post?

Share the love.