DataRobot is the world’s most advanced automated machine learning platform. It empowers data analysts and data scientists to rapidly find key insights, hidden data patterns and make better predictions faster. With unmatched ease of use – no complicated math or scripting required – DataRobot automates the training and evaluation of numerous predictive models in parallel, delivering more accurate predictions and easier model deployment at scale. In this article, I will briefly introduce and walk through a tour of the DataRobot automated machine learning solution.
I have been following DataRobot for several years now. To be completely transparent with you, DataRobot is one of the most impressive solutions that I have reviewed in a long time. The DataRobot automated machine learning platform expedites predictive model building, training, evaluation, and deployment. Using drag-and-drop, point-click guided menu options, users with all degrees of data science experience can build predictive models simply and quickly with automated machine learning.
Machine learning life-cycle steps that used to take me weeks or months of effort can now be completed in hours.
What makes DataRobot truly unique is the baked-in model blueprints and best practices that were designed by some of the world’s leading data scientists. DataRobot’s built-in optimizations and safeguards allow analytical talent of all skill levels – from business analysts to highly experienced data scientists – to safely apply machine learning models to properly prepared data.
Domain knowledge and best practices designed by the world’s leading data scientists have been uniquely baked in.
DataRobot’s extensive libraries of algorithms are also quite impressive. It supports popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, and TensorFlow. DataRobot streamlines model development by performing a parallel heuristic search for the best model or ensemble of models based on the characteristics of the data and the prediction target. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of servers, DataRobot delivers the best predictive model in the shortest amount of time.
Take a Tour
To get started with DataRobot, you will log in and load a prepared dataset. To learn how to properly prepare data for DataRobot, please refer to this article, webinar, and complimentary white paper on that topic. Although DataRobot has some data cleansing, preparation, and transformation capabilities, usually a niche data wrangling tool is recommended for advanced data preparation.
Loading and Profiling Data
DataRobot currently supports uploading csv, tsv, dsv, xls, xlsx, sas7bdat, bz2, gz, zip, tar, and tgz file types, and reading data from a variety of enterprise databases via JDBC database connectivity. Directly loading data from production databases for model building allows you to quickly train and retrain models. It also eliminates the need to export data to a file for ingestion into DataRobot.
DataRobot’s JDBC connector virtually supports any database that provides a JDBC driver – meaning most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server and Hadoop Hive are most commonly used.
After you load your data, DataRobot performs exploratory data analysis, detecting the data types and showing the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution.
Selecting a Prediction Target
Next, you will select a prediction target (what you are trying to predict) from the uploaded dataset and click the big “Start” button to begin training models in Autopilot mode. Note: if you have dates in your dataset, you might also see time-aware modeling settings.
If you want to customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning and sampling options with the Show Advanced Options link. For more control over which models DataRobot runs, there are manual and quick-run options.
Once the modeling process begins in DataRobot, the platform further analyzes the data to create an Importance column. This Importance grading provides a quick cue to better understand the most influential variables for your chosen prediction target.
On this screen, visual plots reveal relationships between each feature and the target variable. There are also options to drill down on variables to view distributions, add features, and apply basic transformations.
Reviewing Automated Modeling Results
DataRobot’s autopilot searches through hundreds or thousands of possible combinations of algorithms, pre-processing steps, features, transformations, and tuning parameters. It then uses supervised learning algorithms to analyze the data and identify predictive relationships. Autopilot is ideal for smart data exploration, finding key influencing variables and patterns. After it completes, you will be shown a Leaderboard of top-ranking predictive models you can explore further.
To examine the ranked predictive models, you click on a model name and are shown a variety of options to Understand, Describe, Evaluate, and Predict. Popular exploratory capabilities here include the Feature Impact rankings, Model X-Ray, Prediction Explanations and Word Cloud. These all help enlighten you on what drives a model’s predictions.
Feature Impact measures how much each feature contributes to the overall accuracy and outcome / prediction of the model (i.e. column values within Age and Commute Distance have significant effects on determining whether an individual would purchase a bike.) Feature Impact highlights which columns you should explore further. This information alone can be valuable in guiding an organization to focus on what matters most.
The Model X-Ray chart displays more details on a per-feature basis—a feature’s effect on the overall prediction—depicting how a model “understands” the relationship between each variable and the target. It provides specific values within each column that are likely large factors in determining whether someone will purchase a bike or not. This information is great for understanding where the model makes errors for input tuning.
Prediction Explanations reveal the reasons why DataRobot generated individual predictions. They provide a qualitative indicator of variable effect on individual predictions. This particular feature helps business analysts better understand DataRobot’s models and back up decisions with detailed reasoning.
Prediction Explanations identify specific values that drive target outcomes.
Diving deeper, DataRobot’s Insights tab provides more graphical representations of your model. There are tree-based variable rankings, variable effects to illustrate the magnitude and direction of a feature’s effect on a model’s predictions, hotspots, anomaly detection, text mining charts, and a word cloud of keyword relevancy.
The Word Cloud tab provides a graphic of the most relevant words and short phrases in a word-cloud format. The tab is only available for models trained with data that contains unstructured text. Here is an example from a different healthcare related dataset.
In the Describe tab, you can view the end-to-end model blueprint containing details of the specific tasks and algorithms DataRobot uses to run the model. They do this by linking out to detailed model documentation that facilitates knowledge sharing and training. You can also review the size of the model and how long it ran.
After you build a set of models, you can then evaluate and select which one is best to use for prediction. You can refer to the model Leaderboard to view a ranked list of models with summary performance information, charts, graphs, and functions. To estimate possible model performance, the Evaluate options include industry standard Lift Chart, ROC Curve, Accuracy over Time, Confusion Matrix, and Advanced Tuning. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don’t require a steep learning curve in order to understand what they convey. Business analysts and citizen data scientists will be able to easily figure out which model should perform best for a given use case.
DataRobot model evaluation and validation helps assess model accuracy. There are several industry standard methods available for validating models including, but not limited to,Train-Validation-Holdout and k-fold cross validation.
You can immediately put your DataRobot model findings to work with Predict options. Here you can upload a new dataset to DataRobot to be scored and downloaded. You also have an option to download all the DataRobot charts if you want to create a presentation or report of your findings.
Actionable DataRobot output can be used for exploration, making decisions, creating presentations, or integrating predictions.
Every model built in DataRobot is immediately ready for deployment. DataRobot API options allow you to integrate predictions into apps, reports, or business processes. There are also options to export scoring code for applications where API scoring is not an option.
DataRobot can automatically generate model documentation – a detailed report containing an overview of the model development process, with full insight into the model assumptions, limitations, performance and validation detail. This feature is ideal for organizations in highly regulated industries that have compliance teams that need to review all aspects of a model before it can be put into production. Of course, having this degree of transparency into a model has clear benefits for organizations in any industry.
Custom Models with Jupyter
Although DataRobot builds hundreds of predictive models “out of the box” using a vast set of diverse, best-in-class algorithms, there may be times when you want to test your own custom Python or R models in DataRobot. To use custom models with DataRobot, Jupyter Notebook integration is available.
User-built models get added into the Leaderboard rankings so you can see how they compare to other DataRobot-built models.
For More Information
In this week’s Solution Review, I have barely scratched the surface of DataRobot’s capabilities. There is so much more for you to explore. If you would like to learn more about automated machine learning, please review the following recommended resources or contact a DataRobot expert.
- DataRobot Website
- DataRobot Test Drive
- DataRobot Courses
- DataRobot Webinars
- DataRobot Blog
- DataRobot White Papers
Moving from BI to Machine Learning
Data Prep for Machine Learning