This past week has seemed like a former Microsoft Product/Program Management reunion! I was honored today with a Qlik Sense briefing from the respected business intelligence mind, Donald Farmer. A Qlik Sense review blog is coming soon. A mere few days earlier I also had the pleasure of meeting up with the always fun and enjoyable, Bruno Aziza, who is now the CMO at Alpine Data Labs. Alpine recently released version 4.0 of Chorus, a big data advanced analytics platform. Let’s dig in and take a look at it.
Chorus is an in-cluster/in-database data mining platform that supports the entire CRISP-DM lifecycle and applies parallel-processing technology of existing enterprise data platforms, including MPP databases and Hadoop. It is designed to operate within Hadoop data environments, also Greenplum, Oracle, PostgreSQL and Netezza database engines. Although I have not come across Alpine in my past data mining projects, I did note that they are listing a few impressive customers such as Morgan Stanley, GE Capital and Visa. They also are working with predictive analytics industry leaders like Zementis supporting the latest Predictive Model Markup Language (PMML) standards to operationalize predictive analytic models into business processes.
One of the differentiating characteristics I noted in my review was the ability to analyze data where it resides. I am not 100% certain how that feature works just yet since I tested the online upload option that currently supports files up to 100MB. The Chorus data mining model development environment is a web-based application that like many other data mining offerings includes data profiling, analytic workflows/ETL for data preparation and scheduled jobs for keeping mining models fresh and up to date. (Tip: I noted that the web application did not play nicely with Internet Explorer so I ended up using Safari for my testing.) Chorus also has predictive model collaboration features – collaboration is something that I am now seeing in a variety of modern data mining solution offerings to allow business subject matter experts to chime on model results, process changes and so on.
After setting up a trial account, you are presented with a workspace where you can upload data sets, analyze and collaborate on mining models with other team members. The Chorus trial also includes a few functional examples and data sets you can play with to get started…along with the standard video tutorial (Tip: Mute your computer if tutorial video music annoys you). If you already know how to develop data mining models, you should be able to get up and running in no time without having to watch the tutorial. The user interface seemed totally straight forward to me. Thus, no learning curve per se. If you do need help, Alpine offers data scientist assistance on-demand = a great perk. They showcase a cute Periodic Table of available features and also have exceptionally excellent online documentation. The documentation not only describes the functionality but also includes data mining supplemental training information on how the models work, principles and guidelines.
For my evaluation, I uploaded the Microsoft classic Bike Buyer data set. The uploaded csv file data types were automatically detected. I began with running the Analyze option that provides data profiling information such as statistics about data set contents, row counts and common values. I also explored a few of the data profiling visualizations. As I explored the data, a summary audit log of my activity was tracked.
Next I went ahead and created a Workflow. Although my Workflow was super simple and not worthy of showcasing, you can get quite sophisticated with the available library of Chorus operators for Data Extraction, Transformation, Sampling, Algorithms, Model Validation, Tools and even options for running SQL and Pig code. As for Workflow algorithms, there was an array of the most popular ones such as k-means, neural networks, naive bayes, principal components, support vector machines. I stumbled on a lovely little feature that provides additional data profiling, summary statistics, correlations, and frequency analysis in the Workflow by right-clicking on my data set operator. The right-click menus do appear to be available for most of the operators and contain nuggets of useful functionality.
Once my Workflow was developed, I ran it and could optionally visualize or publish my results to the collaborative web workspace. To validate model accuracy, typical data mining ROC curves, matrices, and lift options are available. When a good model is created, you can continue to keep it up to date or feed through new data sets for predictions with the job scheduling features.
All in all, the data set loading, profiling, workflow experience to model validation and publishing did feel a bit like Rapid Miner if you have ever used that solution. Since the data mining and predictive analytics market is getting fairly crowded right now with many similar looking solutions, it is more important than ever to understand what truly makes a vendor’s solution unique. From my conversation, I understood the lack of data movement that I did not test and Hadoop Pig specific features to possibly be the differentiators though I may have missed something. The on-call data scientist help was fairly unique but most big data analytics tools that I review including Microsoft’s new Azure ML offer free assistance. I am not sure the on-call data scientist selling point is really a differentiator these days.
Alpine Data Labs Chorus 4.0 was easy to use and fun to evaluate. Even if I was not a fan of the tutorial video music that my business analyst husband liked, the Chorus solution was pleasant and appears solid. If you’d like additional information on this specific solution, please reach out to Bruno and his team at Alpine.