IBM Watson Studio has come a long way since I first tested IBM Data Science Experience in November 2016. The new Watson Studio delivers a more collaborative, enterprise quality data science experience by serving a wider audience and providing choice, depth, and breadth of powerful functionality. If you love programming, you can keep on coding in Watson Studio Notebooks. If you prefer visual authoring, Watson Studio Dashboards, Data Refinery and Visual Modeling should delight you. In this article, we will briefly review several areas of Watson Studio and create two machine learning models to help marketing better utilize limited resources. In a future article, we will reveal our results and look at Deep Learning and other capabilities.
Watson Studio is a single environment made up of a comprehensive suite of tools on IBM Cloud that expedites the end-to-end analytics, data science and AI workflow. From an enterprise data catalog that allows organizations to efficiently store and find curated datasets to numerous analytics tools for preparing data, to creating machine learning, deep learning or optimization models, and authoring dashboards, this solution delivers. Integrated community examples provide rapid quick start solutions and inspire new data science initiatives. Let’s dive in and explore it. Watson Studio enables multidisciplinary teams across the organization to collaborate. In our scenario, the analytics, data science and web development teams will be collaborating with marketing.
After signing into IBM Cloud and selecting Watson Studio from the extensive portfolio of available services, a welcome screen appears. From here, you can create a new project or jump right into a specific Watson Studio area: Refine Data, New Notebook, Deep Learning, New Modeler Flow and New Model. We will create a new Complete project type that contains numerous tools within it for our tests.
After setting up our project, we load data using the Add to Project option. Watson Studio allows import of local files, connections to databases, streaming data and publicly shared files in the linked community assets.
For our tests, we will upload two csv files to help marketing find the best prospects and channels to focus on for advertising spend. The CMO has asked us to find specific segments where they had the most success converting and how to best reach them with minimal investment through social media. One file contains collected data on News Popularity for improving content marketing engagement rates. The other file has demographic information on current buyers. After our data gets loaded into Watson Studio, the analytical fun begins! Now we can begin visually exploring it in the Dashboard tool, Jupyter notebooks or RStudio.
Visually Exploring Data with Dashboards
Watson Studio includes an interactive visual Dashboard tool for data discovery. To begin creating a Dashboard, you connect to your data asset and select a template. For our tests, we will pick the free-form single page view.
Now that we have a blank canvas, we will begin selecting columns in our data set to visually investigate. Let’s begin with Age. To see distribution and spread of values, we pick a bar chart and show counts by age value.
With the chart type selected, we drag-and-drop fields into a chart builder. To get counts, we need to summarize age values and pick the count function from the list of available options within the chart builder options.
Peeking at the preview, we do see a nice distribution of ages. We can continue to review income, cars and gender by region. Notably, dashboard charts get automatically, contextually filtered. As you click on a bubble, bar, pie chart slice or filter, all data is auto-filtered for you – no dashboard coding or mapping required. Now, we will follow the same exploratory process with the News Popularity data.
Dashboards are ideal for getting a quick overview of your data and collaborating across teams in an enterprise but they don’t provide needed statistical insight for properly preparing machine learning model data. After getting familiar with our two datasets and confirming a few questions with the other project stakeholders, we will evaluate the statistical qualities, possible outliers and other issues in an analytical notebook.
Analyzing Data with Notebooks
Watson Studio comes with pre-installed open source RStudio and Jupyter interactive analytical notebooks. Data science libraries and frameworks such as Spark MLlib, scikit-learn, XGBoost, TensorFlow, Keras, Caffe, and PyTorch are also cooked in along with IBM’s proprietary SPSS algorithms and prescriptive analytics decision optimization APIs. Jupyter notebooks support R, Python or Scala kernels for ad-hoc analysis or scheduled runs.
For our tests, we will explore our two datasets using a Jupyter notebook. After selecting our BikeBuyers csv file from the list of data assets in our project library, we see automatically generated Python code for loading that data into a pandas dataframe. By simply clicking on the Run button, we can now preview our data.
Watson Studio libraries contain powerful statistical operations to gain a deeper understanding of the size, distribution and shape of our dataset. This information provides essential insight into the types of data prep activities that we might want to perform before building machine learning models. Here is an example of applying describe and boxplot functions on our data.
From a skim of dataset stats, we learn that we have a 1,000-record file with a mixed set of categorical and numerical attributes. Since certain algorithms can only handle categorical or numerical values, we will need to add more columns to this file for encoding categorical values and binning numerical values to expand the range of possible machine learning algorithms that we can test.
Delving a little further into our dataset, boxplots reveal income outliers that might need additional handling in data prep operations. If the outliers can occur in the real world, you probably should not remove them but instead merely transform them during the data prep step. Keep in mind outliers will influence machine learning algorithms that numeric input values. Untreated outliers will negatively impact predictive model performance in those cases.
Preparing Data with Data Refinery
Although many data scientists love to code, visual point-and-click data prep is a wonderful alterative for the rest of us. To swiftly prepare your data in an intuitive manner, select Data Refinery from the Add to Project menu. Data Refinery simplifies complex data wrangling steps and provides scripting support for numerous dplyr R library operations.
After launching Data Refinery, select a dataset for cleansing or shaping on the Data tab. For data profiling and visualization, click on the Profile and Visualizations tabs. Here you can see the same types of information that we looked at in the Jupyter notebook – no coding needed.
When you are ready to start prepping your data for machine learning, you visually click a column and an operation to apply to it. Data Refinery includes a wide variety of cleansing, organizing and shaping functions. Here are a few of the most commonly used ones:
- Calculate and Math
- Convert column type
- Filter or Remove
- Sort ascending or descending
- Convert column value to missing
- Remove duplicates, empty rows
- Replace missing values, substring
- Concatenate, Join
- Split column
Note as you prep your data, Data Refinery will keep track of all applied operations in steps. After each step, you can preview results. If your results look incorrect, you can easily edit or modify the step.
When you’ve completed prepping your dataset in Data Refinery, you can run the data flow on your entire data set or schedule it to run later. Data Refinery flow job status, results and output of your work gets saved to your project for use in other services. After BikeBuyers is ready, you can work on News Popularity.
Building Machine Learning Models
Now that we have two machine learning ready datasets, let’s try building a model. In Watson Studio, we’ll add a Model to our project. This time we will begin with News Popularity since it is a numeric prediction model.
We’ll begin with an effortless Watson Studio Automatic Model. After defining a name and select Automatic, click next and select our News Popularity dataset.
Keep in mind that News Popularity’s target value forecasts article shares. The input attributes include number of pictures, videos, day of the week, article length and so on. Unlike our BikeBuyers dataset, all of News Popularity’s input data is numeric. Seeing as we have all numeric input values and a forecast type problem to solve, we will pick Regression for the model technique to try.
On the Regression screen, a configurable setting for training, test and holdout data is displayed. For our test, we will use the default settings – 60/20/20. That is all we need to do to generate a Regression model. Watson Studio automatically does the rest for you when you click next.
When the Regression model finishes, you will see common evaluation performance results returned. If you want to save the model and run it on a new dataset to get predicted article share estimates, click the save button.
For control and insight into the entire machine learning modeling process, you will use Watson Studio’s visual model flow tools aka IBM SPSS® Modeler Flow Editor. Realistically, this is where most experienced pros will build predictive models. Currently, Watson Studio supports industry standard PMML v2.0-4.2 for exchanging models between different solutions. Another nice functionality is a Test API input form that allows you to experiment with input values to see instantly see predicted output values. Without further delay, let’s build a model with it.
Using a straightforward drag-and-drop approach, you will load your data, sample it, transform attributes, apply machine learning tasks, and evaluate model performance. For this test, we will use our BikeBuyers dataset that has a mix of numeric and categorical input variables. We give the flow a name, choose Modeler Flow and IBM SPSS runtime. We’ll save the awesome Neural Network Modeler flow type for Part 2.
Here we will see a canvas for designing our machine learning workflow. Starting with Importing tasks on the left side of the screen, we pick Data Asset and assign BikeBuyers to it. Then we use a task to transform data types, apply a convenient Auto Data Prep task and select a machine learning algorithm to apply and evaluate.
As we explore the available tasks, several extremely helpful ones stand out. The first one is Synthetic Minority Over-Sampling (SMOTE). This task is fabulous for fraud and other use cases where your dataset only has a few instances of the key scenario to predict. It resamples your dataset for you. Another related task is a balancing operation – it also resamples your data automatically for optimal machine learning results.
Yet another cool capability is the Auto Data Prep task – it eliminates frustration from using incorrect data types with algorithms. If you run data that has not been optimized for an algorithm, you’ll get annoying errors that can be tricky to resolve without domain knowledge of the algorithm data type nuances. The Automatic Data Prep task performs the following prep steps for you.
- Extract the first n records (as data sample) from input data and determine if string categories exceed max specified number.
- Missing values handling: (Apply MissingFields transformer).
- String features – use missing fields as a separate category specified as input.
- Numeric features – use the mean value Category encoding (Apply CatEncode estimator).
- For each string field perform string indexing (Apply StringIndexer).
- Group all numeric fields in a separate vector (Apply VectorAssembler).
- For each numeric field check if is categorical. (Apply VectorIndexer). In affirmative case replace actual values with zero index categories.
- In negative case keep the field as is Groupe all fields generated by StringIndexer and VectorIndexer in a separate vector field.
- Filter out temporary fields generated by StringIndexer and VectorIndexer. Normalize feature column returned by Category Encoding operation.
Now that our data should be good to go. Let’s build a predictive model. For our non-technical marketing stakeholders, we select a Decision List algorithm from the library of available machine learning options since that one is easy for anyone to understand for creating marketing campaign business rules unlike trying to explain a Neural Network to the CMO.
Continuing in the machine learning model building process, configuring task parameters and finish it by adding a model evaluation task. Then we run our entire flow and review results.
Looking over the Decision List model results, we find that this simple model is not a strong model. However, it is still useful for the business regardless. We can share our list of learned scenario probabilities to help marketing better target prospects and optimize marketing spend.
Since our first Decision List model only has a small lift over random chance, we will want to continually improve upon it by tweaking the input variables and adding more informative attributes to it – engineer more features. We can use Watson Studio tools to evaluate where errors occur to incrementally improve on our initial design. For now, our initial Decision List model will be used as a performance baseline.
To operationalize machine learning models in Watson Studio to make them useful by marketing, we can create a web service, run batch predictions and download a scored file, or input values into an online form.
This completes our first foray into the latest and greatest version of Watson Studio. Admittedly, there is much more to it than you saw here. Interested in our results? Tune in to Part 2 to see what we found, how we presented the results, and put Watson Studio machine learning to work for the CMO and marketing team to help optimize limited funding and resources.
For More Information
In this Solution Review, we introduced Watson Studio Dashboards, Notebooks, Data Refinery and Machine Learning tools. In our next article in this series, we will explore other areas. In the meantime, if you’d like to learn more about Watson Studio, please review the following recommended resources.
- IBM Watson Studio: ibm.biz/watsonstudio
- IBM Watson Studio online docs
- IBM Watson Studio solution brief
- Deep Learning webinar
This post was brought to you by IBM Watson Studio. I received compensation to write this post but all opinions expressed are my own.