+1-316-444-1378

IMB 621

Kiran R, Doctoral Student, Indian Institute of Management Lucknow, Arunabha Mukhopadhyay, Associate Professor, Indian Institute of

Management Lucknow, and U. Dinesh Kumar, Professor of DSIS, Indian Institute of Management Bangalore prepared this case for class

discussion. This case is not intended to serve as an endorsement, source of primary data, or to show effective or inefficient handling of decision

or business processes.

Copyright © 2017 by the Indian Institute of Management Bangalore. No part of the publication may be reproduced or transmitted in any form or

by any means – electronic, mechanical, photocopying, recording, or otherwise (including internet) – without the permission of Indian Institute of

Management Bangalore.

MACHINE LEARNING ALGORITHMS TO DRIVE CRM

IN THE ONLINE E-COMMERCE SITE AT VMWARE

KIRAN R, ARUNABHA MUKHOPADHYAY AND U DINESH KUMAR

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 2 of 16

On February 25, 2016, in the VMWare (VMW) office in Silicon Valley, next to Stanford University, in

the sprawling 100+ acre green campus in Palo Alto, California, winter had just ended and it was warm

weather, as great as it possibly could be in February. In his office cabin in building Hilltop E, Michael

Butler, the global head of the store business of VMW was in discussion with Parag Girish Chitalia, the

global leader for advanced analytics and data sciences. Michael and Parag were discussing how to drive

more revenues from Workstation business in the VMW store. The VMW store was the online portal of

VMW (store.vmware.com), where end-customers could purchase certain products of VMW such as

Fusion and Workstation online. The store was similar to any e-commerce site with a home page, category

pages, and product detail pages, add to cart pages, checkout page and a confirmation of order page.

Fusion helped end-customers and businesses run Windows on top of Mac machines, whereas Workstation

helped customers run Mac on top of Windows machines. Since many customers would like to have both

Windows and Mac operating systems on their computers, VMW store received many visitors to its

website. Data on customer’s usage of VMW store is collected to understand consumer behavior. With

rich behavioral data of the VMW website, Michael Butler was keen to see how the data sciences and

analytics team could be leveraged to drive further Workstation sales as it was a key product in the

competitive business environment.

ABOUT VMWare

VMware (VMW) has been a Palo Alto headquartered software company that reported USD 6.57 billion in

2015, up 9% from 2014. VMW has been one of the most profitable software companies in history with

GAAP net income of approximately USD 1 billion in 2015. Cash flows were healthy as well with free

cash of USD 1.56 billion generated in 2015 (Exhibit 1). Founded in 1998 by Stanford Professors Diane

Greene and Mendel Rosenblum, the company was headed by Pat Gelsinger in 2016 and had more than

18,000 employees worldwide. VMW has been the industry leader in virtualization business with more

than 80% market share. Virtualization is about using software to virtualize hardware – for example, the

same central processing unit (CPU) can be shared by multiple users using the VMW software.

Virtualization brings about great savings in costs to IT departments of companies and VMW has been the

industry leader by a distance in this space with market share several times that of its nearest competitors.

VMW garnered its revenues from three streams namely software defined data center (vSphere – for

computing virtualization, NSX – for software defined networking & security, VSAN – for storage

virtualization), end-user computing (Airwatch – for mobile computing, Horizon – enterprise desktop,

Fusion, Workstation), and cloud (Private cloud vCloud Air).

Michael Butler was in charge of the store (Exhibit 13) business powered by Fusion and Workstation

products. Parag had joined VMW in 2014 to set up the advanced analytics and data sciences team called

Analytics Community of Excellence under the Information Innovation Center/Enterprise Information

Management organization. The team comprised data scientists and analysts hired from premier institutes

in India such as the Indian Institute of Technology (IITs) and Indian Institute of Management (IIMs) and

from around the world such as Georgia Tech and Stanford. Ravi Kondapalli was the lead data scientist in

the data sciences innovations team powering Parag’s team. Ravi, a NIT Warangal grad with double

Masters from Georgia Tech and IIM Bangalore had more than 15 years of experience in the industry.

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 3 of 16

Driving Higher Workstation Revenues from the Store

The primary objective of the meeting between Parag and Michael was to discuss how Parag’s newly

formed data sciences group could assist in increasing store revenues with focus on key products starting

with Workstation. Michael started the meeting by saying:

Workstation forms the bulk of the purchases for our online store/e-commerce business for

which we have both individual consumers and businesses as our customers. Growing

revenues this year will be a challenge as there is no new version of Workstation planned.

In a software business, renewals via upgrade to a latest software version form a major

portion of the revenue and this year will be a challenge. I would like to understand how

we can leverage data sciences and advanced analytics to target new workstation

customers, up-sell to existing customers, cross-sell to customers that do not have

Workstation.

Parag shared some macro-level data on Workstation sales that Ravi, his lead data scientist in the data

sciences innovations team, had compiled. Workstation revenues had doubled in the last 8 years

(Exhibit 2) and formed a significant portion of the Overall Store Bookings (Exhibit 3). Different

versions of Workstation had been launched over the years. Workstation 6 was launched in 2007 and the

latest versions of the Workstation product were Workstation 12 and Workstation 12 Player. Significant

portion of VMW Workstation customers upgraded to higher versions of Workstation. In Exhibit 4, each

cell xij in the table denotes the number of customers that upgraded from Workstation version in the row i

to the Workstation version in the column j. There was an opportunity in the sense that a large number of

the customer base had not yet upgraded to the latest versions of Workstation. The store was visited by

approximately 7 million visitors annually of whom approximately 2 million viewed some page related to

Workstation products. However, only around 1.6 million visitors out of the 7 million were identifiable

with an e-mail id (Exhibit 5). The visitor data contained rich clickstream/digital data that was housed in a

Hadoop big data environment that the analytics team leveraged continually for their analysis. Apart from

this visitor behavior, all previous purchases (if any) by the e-mail ids were stored in the Greenplum data

warehouse. Greenplum is a massively parallel database and owned by Pivotal that has proven to be better

than Teradata, Oracle, and other data warehouses. Online–offline integration for the de-anonymized

visitors was possible with “e-mail id” as the common inter-linking key.

Parag had driven the following key points to lay the ground for a discussion on analytics engagement.

 Workstation was going to be an important driver of the overall store revenues.

 There was untapped opportunity in the form of the old Workstation customers that had not yet

upgraded to the latest version of Workstation, presenting an opportunity for up-sell.

 There were a large number of visitors to the online store that included those that had bought other

store products presenting an opportunity for cross-sell.

 The data sciences and analytics team had access to rich sets of information about the customers

and also the potential customers including their digital footprint (online) and their purchase

history (offline).

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 4 of 16

Being a sales leader, Michael liked the key points. He got straight to the point:

We definitely have a great Customer Relationship Management (CRM) opportunity here

in the form of up-sell, cross-sell and targeting. These present multiple challenges that I

and my management team will go into in detail. For example: While I can drive

incremental sales with coupons, I would want to give the coupons only to those customers

that are most likely to buy and not indiscriminately to all.

Can your team provide me with the list of the email ids most likely to purchase our latest

products Workstation 12 or Workstation 12 Player in the next 3 months so that my team

can target these email ids?

Parag immediately proposed a propensity model as a quick win. A propensity model rank ordered e-mail

ids or customers in their decreasing order of likelihood to purchase. His advanced analytics team powered

by the data sciences innovations team had delivered great results in the past by the usage of these models.

This propensity model could leverage the online and offline data for the e-mail ids and rank order them

using machine learning techniques.

Michael said:

That’s awesome Parag! I only believe things that cause an increase in my sales! If you

can create such a list, I will be happy to execute via one of the digital marketing channels

(email with coupon, re-targeting on other websites, social targeting) or by targeting on

our website.

I will believe your list only when the cash machine rings up Workstation sales and when I

can measure the upside scientifically.

Michael was a technology geek and would only believe things once they were scientifically proven. Parag

said he would get back to Michael with a propensity scored list within a couple of weeks. The presence of

Ravi in the team gave Parag the confidence to suggest two weeks.

PROPENSITY MODEL DEVELOPMENT

Through e-mail, Parag briefed Ravi, the lead data scientist to be ready with what it would take to build a

propensity model and also to brainstorm on what should be the overall data sciences plan that was to be

presented to Michael. They had a detailed telephonic conversation the next day.

Ravi set the baseline for the discussion:

This is an example of a binary classification problem, where the visitor either buys or

does not buy Workstation. The target variable will be if a visitor who visited the site buys

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 5 of 16

Workstation in the next few months. The value of that target can be either 0 or 1, making

it a classical binary classification problem.

Ravi went through a deck which highlighted the following challenges.

 What should be the entity on which we should build a propensity model? As Exhibit 5 shows,

only about 1.6 million out of about 7 million visitors had an e-mail id.

 We should decide on the sampling strategy, Should we use random sampling, time-based

sampling or stratified random sampling?

 What data sciences and machine learning techniques should we try out in this instance?

 What cross-validation or training-validation technique should we use in order to have an estimate

of how the model would perform in the real world?

Ravi’s recommendations to Parag were as follows:

 Given a quick win, we should model on e-mail id level for the first cut. Longer term, we have to

think of analytical approaches to target those without an e-mail id.

 There is only one right way to perform cross-validation. In this instance, we should do time-based

cross-validation. In this method, we simulate the real world by aggregating data to a period and

then predicting for the next period.

o For example: Say we need to predict who will buy during April–June 2016. In this instance:

 For training, we could aggregate data up to September 2015 and predict the Workstation

buyers during October–December 2015.

 For validation, we could aggregate data up to December 2015 and compare the

predictions against actual Workstation buyers during January–March 2016.

 For scoring, we could aggregate the data up to March 2016.

 We could try any 2-class classifier such as Naïve Bayes, Logistic Regression, Decision Tree, or

machine learning algorithms such as Random Forest, Gradient Boosting, etc. We could compare

the lift curves of different models to see which one would work best.

 We could use the lift numbers on the validation set to obtain an estimate of the real world.

Ravi further explained the time-based cross-validation using the following conceptual diagram.

Ravi said he could build the model in a couple of weeks.

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 6 of 16

DATA DESCRIPTION

In order to build a detailed propensity model, Ravi collected data from 2008 to 2016. A stratified sample

of 100,000 de-anonymized customers was used (provided in a separate spreadsheet). He aggregated data

at an e-mail id level to come up with a set of features across online and offline (Exhibit 6), which could

be used for model building. Sample training data is shown in Exhibit 7 with the variable names in

Exhibit 8.

DATA ANALYSIS

To understand which features were important, Ravi’s team examined odds ratios of the target variable

against each of the features. Odds ratio is explained in Exhibit 9. The key findings are shown in

Exhibit 10. Odds ratio greater than 1 indicates that the feature is favorable towards purchase and odds

ratio less than 1 indicates the opposite. A higher odds ratio would indicate a higher degree of favorability.

OBJECTIVES

The final objective was to leverage data sciences and analytics for targeting, up-sell and cross-sell to

customers in the online store, thereby increasing customer value. The immediate need was a propensity to

buy a model that could result in the set of top customers that Michael and team should target.

At this point, Ravi had the following questions in mind.

 What feature selection techniques could he use?

 If he were to use the standard techniques – logistic regression or decision tree and any one

advanced technique (random forest or neural network or support vector machine or gradient

boosting…), how would the lift curves appear?

 Based on the lift curve, how should he communicate the potential opportunity from the model to

Michael?

 Could there be incremental lift or other approaches that he could adopt – for example, clustering

before classification?

Having built several propensity models at VMW, Ravi knew that sales teams liked Whitebox models.

Whitebox models are models whose workings can be explained to the sales teams. For example:

Customer X is more likely to upgrade if the support for the older version is coming to an end OR if a

compelling newer version is being launched. Sales leaders are not comfortable with just getting a list that

works. They also want to know why the list worked. The question on Ravi’s mind was also how best to

explain the characteristics of a Workstation buyer to the business.

At the same point, Parag had a further list of questions to discuss with Ravi once the model was fully

built.

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 7 of 16

 How should Parag and Ravi arrive at the number of e-mail ids that Michael should send?

o Remember the e-mails were to be sent with a coupon. Sending too many could impact the

margins.

o Should this list be different for different marketing channels?

 How do we interpret the results for business decision making?

 While lift is an analytics or internal validation measure, what marketing intervention should he

suggest to Michael so that there can be a scientific measurement of the return on investment to

the store business from the exercise?

o Can we conduct some form of Control–Test experiment to quantify the upside? If yes, how

should the experiment be set up?

Parag was also thinking about how he should set up an executive deck to summarize the results and

measurement plan to Michael. At the same time, he was wondering about the overall value proposition

that he could drive for the VMW store using analytics and data sciences.

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 8 of 16

Exhibit 1

VMW Financials

Source: VMW Publicly available annual report: http://d1lge852tjjqow.cloudfront.net/CIK-0001124610/67b316e9-d82e-4848-ade6-

e046775865be.pdf

VMW Q4 2015 Earnings Call: http://s2.q4cdn.com/112802898/files/doc_financials/2015/q4/Q4-15_earnings_w_tables_final.pdf

Exhibit 2

Workstation Revenues over the Years1

Source: Bookings Data (masked)

1 All numbers are directional and for illustration purposes only. The data shared is masked and only illustrative of real data. These have been done

to maintain confidentiality.

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 9 of 16

Exhibit 3

Workstation as a Proportion of Store Bookings* (masked)

Source: Bookings Data (masked)

Exhibit 4

Cross-Sell Behavior of Workstation (masked)

Source: Bookings Data (masked)

Workstation 6 Workstation 7 Workstation 8 Workstation 9 Workstation 10 Workstation 11 Workstation 12 Workstation 12 Player

Workstation 6 97593 10842 6604 5213 4420 2602 2179 109

Workstation 7 97431 24005 19858 15939 9376 8050 293

Workstation 8 67588 24326 21903 12319 10408 311

Workstation 9 65935 23648 15326 11683 373

Workstation 10 68998 18294 16665 508

Workstation 11 45851 13535 485

Workstation 12 41650 623

Workstation 12 Player 5139

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 10 of 16

Exhibit 5

De-anonymized/Anonymized Store Visitor Funnel (masked)

Source: Bookings Data (masked)

17 MM (# of Visitors to VMW Store)

5 MM (# of Visitors to Workstation in Store)

~4MM (store.vmware.com visitors with email id)

~1.7MM(Unique emails of Personal Desktop Buyers)

~500K unique

emails of

Workstation

buyers

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 11 of 16

Exhibit 6

List of Feature Buckets

Source: Bookings Data (masked)

Exhibit 7

Training Data

Data with 1,00,000 rows can be downloaded from the following link:

http://hrm.iimb.ernet.in/iimb/download/IMB_621.htm

Source: Bookings Data (masked)

Metrics for the

Dimension

Inputs

Workstation, Fusion, vSphere, vCenter, vSOM, Horizon,

vRealize

Activation, Download, Registration, Page Views, Cart Add/Remove/View,

Checkout, Purchase, Form Success, Form Abandon, Buy Now etc.

Internal, Paid Search, Email, Social Network, Search Engines etc.

Google, Bing, Yahoo, MSN, YOL etc.

OS like Android, iOS, Linux, Mobile iOS, OS X, Windows OS,

Windows Mobile and Browser like Apple, Blackberry, Google, Dolphin, Microsoft,

AOL etc.

Dig

ital D

ata

Digital & Non-Digital Feature Engineering (Offline + Online)

Store

Products

Event Wise

Search

Engine Wise

OS/Browser

Wise

Referrer Type

Marketing

Channel

DemandBase Data, IDM Data

De-

anonymizatio

n Features

Paid/Organic Vehicle Data

Non D

igital

Data

Revenue

History

Responses/Camp

aign Features

Marketing

Channel ShareOther Products

Bought

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 12 of 16

Exhibit 8

Sample Feature Names

Variable Meaning

Train_period_workstation_purchase_f

lag

Outcome variable (Whether the customer purchased

workstation (coded as 1) or not (coded as 0))

fswk_booking_pct Share of Fusion and Workstation bookings

total_bookings_amount Total bookings from this customer

personal_desktop_booking_pct Share of Personal Desktop Bookings

tot_windows_visits Total no. of visits to vmware.com webpage from Windows OS

days_since_first_personal_desktop_p

urchase_date

Length of Relationship with VMW w.r.t Personal Desktop

products

ftr_growth_personal_desktop_13_14 Growth in 'Personal Desktop' product bookings from 2013 to

2014

num_orders Total no. of lifetime orders this customer placed with VMW

num_order_lines Total no. of lifetime order lines this customer placed with VMW

ftr_growth_personal_desktop_14_15 Growth in 'Personal Desktop' product bookings from 2014 to

2015

idm_total_no_of_day_visits_to Total no. of visits to MyVMware Portal (required for customers

to interact with VMWare support)

ftr_growth_personal_desktop_12_13 Growth in 'Personal Desktop' product bookings from 2012 to

2013

tot_osx_visits Total no. of visits to vmware.com webpage from OSX OS

tot_apple_browser_visits Total no. of visits to vmware.com webpage from Apple Safari

Browser

idm_no_of_day_visits_to_home_page Total no. of visits to MyVMware Portal Home page

tot_microsoft_browser_visits Total no. of visits to vmware.com webpage from Microsoft

Internet Explorer Browser

tot_store_page_views Total no. of views to VMW Store Page

idm_no_of_day_visits_to_download_

page

Total no. of visits to MyVMware Portal Download Page

tot_page_views Total vmware.com page views

tot_first_touch_direct_views Total no. of page views by marketing channel

idm_no_of_day_visits_to_info_page Total no. of visits to MyVMware Portal Info Page

idm_no_of_day_visits_to_license_pag

e

Total no. of visits to MyVMware Portal License Page

tot_first_touch_natural_search_views Total no. of page views by marketing channel

gu_num_of_employees Total no. of employees in the customer company as per DNB

data

tot_google_browser_visits Total no. of visits to vmware.com webpage from Google

Chrome Browser

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 13 of 16

idm_no_of_day_visits_to_eval_page Total no. of visits to MyVMware Portal Eval Page

tot_visits Total vmware.com page visits

purchase_events Total vmware.com purchase events

tot_mozilla_browser_visits Total no. of visits to vmware.com webpage from Mozilla

Firefox Browser

tot_last_touch_direct_views Total no. of page views by marketing channel

tot_first_touch_internal_views Total no. of page views by marketing channel

tot_page_views_l90d Total vmware.com page views in last 90 days

ftr_growth_vsom_14_15 Growth in 'vSOM' Bookings from 2014 to 2015

tot_last_touch_natural_search_views Total no. of page views by marketing channel

num_any_campaign_responses No. of responses from this customer for all VMW campaigns

tot_last_touch_internal_views Total no. of page views by marketing channel

tot_visits_l90d Total vmware.com visits in last 90 days

ftr_growth_enterprise_desktop_13_14 Growth in 'Enterprise Desktop' product bookings from 2013 to

2014

Source: Data Analysis

Exhibit 9

Odds Ratio Explanation

Target = 0

Target = 1

Feature = 0

Feature = 1

Odds for feature = 1 is defined as d/c

Odds for feature = 0 is defined as b/a

Odds ratio = (d/c)/(b/a) = da/bc

Source: Data Analysis

A b

C d

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 14 of 16

Exhibit 10

Sample Averages of Features versus Target Variable

Source: Data Analysis

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 15 of 16

Exhibit 11

Purchasers of Workstation as of End of Each Quarter from 2013 (data masked)

Quarter No. of Workstation

Buyers

13Q1 2784

13Q2 2300

13Q3 3020

13Q4 4198

14Q1 2480

14Q2 2530

14Q3 1878

14Q4 3808

15Q1 2988

15Q2 2582

15Q3 3370

15Q4 4164

16Q1 2726

16Q2 2264

16Q3 2340

16Q4 1194

Source: Bookings Data (masked)

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare

Page 16 of 16

Exhibit 12

Sample Purchase Paths on E-commerce

Home Page → Product Detail Page → Cart → Checkout → Purchase

Source: VMWare

Exhibit 13

About the VMW Store

The store sells many products of which Fusion and Workstation are key to helping run Windows

on Mac and Mac on Windows, respectively. It is an e-commerce site in the truest sense and is

frequented for purchases both by consumers and businesses. The link to the store is provided

here: http://store.vmware.com/store/vmware/en_US/home

The store is a collection of pages. A sample purchase path for a user is indicated in Exhibit 12.

This is by no means the only path and there could be several paths but is shown to indicate how

the visitors purchase on the site.

Source: VMWare

For the exclusive use of M. Abouzahra, 2019.

This document is authorized for use only by Mohamed Abouzahra in 2019.

Categories: Uncategorized