In our previous article, we introduced IBM Watson Studio and discussed helping our CMO and marketing team better utilize limited resources with advanced analytics. In this article, we will reveal our findings.

Project Background

Our CMO has asked us to identify specific prospect segments that historically convert into paying customers to focus advertising spend on the right prospects. After understanding the best prospects to target, marketing tasked us to determine what types of social media are most effective to reach them with a minimal content marketing budget. They also wanted to understand what makes a social media post go viral.

Finding Top Prospect Segments to Target

To analyze prospect data, we begin by exporting contact data from our Salesforce CRM. We then combine contact data with third-party demographic data sources and prepare it for machine learning using IBM Data Refinery. Data Refinery is a visual data preparation solution that allows anyone to interactively discover, cleanse, and transform data. It includes over 100 built-in operations for simple transformations and preparation processes – no coding needed.

After loading our data, we quickly review profiles of each attribute and look for data quality issues.

Watson Studio Data Refinery

Data Refinery Profile View

To prepare our dataset for machine learning, we added descriptive bins of numeric variables and fix errors using built-in charts, statistics and menu options.

Watson Studio Data Refinery

Data Refinery Data Prep View

Since we are using personally identifiable data that is sensitive, we also publish our prepared dataset with automatic, dynamic masking of personally identifiable (PII) data attributes and apply governance policies limiting prospect dataset discovery and usage to the IBM Watson Knowledge Catalog.

Watson Studio Knowledge CatalogJuly2018

IBM Knowledge Catalog

Now we are going to build and evaluate machine learning models in Watson Studio. We start with a binary classifier since our target “Buyer” is a binary – yes/no variable.

Watson Studio ML

IBM Watson Studio Machine Learning Model Configuration

Running through the training and validation results using a Train: 60%, Test: 20%, Holdout: 20% split, the results are not compelling. We noted Area Under ROC of 0.688. Essentially, binary classification only accurately predicted about 18% more than random guesses.

Next, we try a Logistic Regression model and let Watson automatically prepare the dataset for us. This time we get a slightly better Area Under ROC result of 0.706. It is still not a strong model. Thus, we keep on experimenting.

This time we spin up IBM® SPSS® Modeler Flow Editor to develop predictive models. Here we have more control over the specific algorithms used, the input parameters and output.  We try a Decision List algorithm from the library of available machine learning options since the output of Decision Lists are easy for anyone to understand.

Watson Studio ML

IBM® SPSS® Modeler Flow Editor

Looking through the results, we identify several segments to target that have 77% purchase probability. This is better than our two previous models. Let’s try one more model type, and get these insights ready to share with marketing.

Watson Studio ML

Marketing Prospect Segments to Target

For our last machine learning model, we select a C.5 Tree Model. Reviewing the results, we learn this model seems to perform the best. The C.5 Tree Model also is straight-forward for explaining the results to our marketing stakeholders. The first key finding is obvious but it also confirms the model works – Cars and Age – are the most important attributes to assess for finding the most likely prospects to buy a bike. We also see Commute Distance also is relevant.

Watson Studio ML

C.5 Tree Diagram Ranked Feature Importance

Diving deeper into our predictive tree model, we can find more specific business rules. Here we identify the ideal segments for marketing to target and the segments to ignore. This will help them use their limited advertising budget wisely.

Watson Studio ML

C.5 Tree Diagram

Peeking through the C.5 Tree Model Decision Rules listing, we see the category, record counts, percentage and rule confidence level. Some of these business rules exceed 90% likelihood to purchase. These are fabulous findings. We then highlight the high conversion segments and add those to a presentation for our CMO.

Watson Studio ML

C.5 Tree Diagram Decision Rules

Content Marketing: What Drives Viral Posts

Now that we know who to specifically target, the next question marketing needs answered is how to reach these segments cost effectively with social media. Last year our marketing team noticed that social media engagement and post shares declined. They were not alone. According to Buzz Sumo, a tool used to analyze what content performs best for any topic or competitor, in 2015 approximately 50% of randomly selected posts received 8 shares or less. That number dropped to 4 shares or less in 2017.

50% of social media posts receive less than 4 shares

To be seen by prospects, marketing needs an amplification plan that includes advertising spend. Knowing what social media content gets shared escalated from a “nice to know” to a “need to know”.  To learn more about what drives a viral post on social media platforms, our team downloaded and analyzed a public dataset on News Popularity.

For this analysis, we began with an effortless Watson Studio Automatic Model. After defining a name we picked Automatic, clicked next and assigned our News Popularity dataset. Since News Popularity’s target value forecasts numeric article shares, we opted to use a Regression model. The input attributes in this dataset include number of pictures, videos, day of the week, article length, industry, sentiment and so forth.

Watson Studio ML

Watson Studio Auto-Model

Watson Studio ML

Auto-Model Regression Results

Watson Studio ML

Linear Regression Ranked Feature Importance

Ultimately, we learned social media posts with many links, more than 12 images, a video or more than three keywords consistently performed better than other posts. Posts that contain a high number of images makes sense might go viral. Think about a natural disaster, catastrophe or storm article. Those posts usually do have a high number of photos and indeed are shared and viewed by many more people than an ordinary news story.

Posts with many images, keywords or a video were consistently shared more than other social media posts.

Other key findings were that grumbly articles performed better than happy ones! Posts with avg_negative_polarity where shared more often than posts with avg_positive_polarity. Last but not least, weekend posts were less likely to get shared. If you think about all the people that surf the social web at work, that insight also seems intuitive.


Now that marketing is armed with the right prospect segments to target and knows what specifically makes a social media post more likely to be shared, we will annotate our baseline results and share our findings using Watson Studio collaboration capabilities. You can add collaborators at the project level, giving team members across the enterprise governed access to the project data sources, analytical notebooks, predictive models and other assets.

If you’d like to learn more about Watson Studio, please review the following recommended resources.

This post was brought to you by IBM Watson Studio. I received compensation to write this post but all opinions expressed are my own.