Self-Service ETL, Data Quality, Cleansing, Prep and other Analytic Data Goodies
As promised, I wanted to share some of the options out there that I have been evaluating for self-service ETL, data quality and cleansing. In this post I will briefly highlight Microsoft Data Explorer Excel Add-In, Data Wrangler, LavaStorm Analytics, Alteryx, and Microsoft Data Quality Services. There are actually quite a few of these tools out there! A great list of more options is avallable on KD Nuggets, my favorite data mining site. Data prep is probably the most important aspect of getting a great predictive or analytic data model.
Lately there has been a lot of good Microsoft Excel MVP buzz about the Data Explorer Preview, an Excel add-in for Excel 2010 and 2013. Data Explorer seems to be a Microsoft light version of an Oracle Endeca-like offering. Data Explorer pricing and 2013 version support for it are not yet known – the preview is free. After the Rob Collie Excel 2013 Power Pivot rant, I wanted to be sure to clarify Data Explorer pricing/version details are totally unknown to me right now. (Digression: Also Microsoft is not evil – wow Rob’s blog post got a huge uproar. Microsoft is not a charity – they just happen to do a lot of charitable things and build low cost, less feature rich, for the masses-type tools. Microsoft is a great company that makes a positive difference in the world. Yes, I am biased.) Data Explorer is absolutely worth checking out if you want easy, self-service ETL type functionality within Excel. Data Explorer has a simple, non-technical wizard interface for loading, combining, and transforming both structured and unstructured data into Excel. The learning curve is minimal. The Data Explorer team did a great job of covering common data loading, data prep, and tweaking tasks from the most popular data and web data sources. They even have a social media connector and big data connector. It is notably interesting that instead of using SQL to query and tweak data, they introduced a new language called “M” with Data Explorer . “M” seems pretty powerful. You can read up on it in the M docs. Additional resources to ramp up on Data Explorer include an introduction tutorial and a more advanced one on combining data data sources.
Another tool in this self-service ETL space is LavaStorm Analytics. I really liked what this group offers but do confess that now we are closer to a power user or an enterprise ETL developer tool user experience. The LavaStorm Analytics tool user interface is much simpler than the enterprise ETL tools I have used before. The introductory YouTube video was quite helpful for me to get a quick understanding of capability to begin building ETL solutions right away the same day. I liked the richness of the Lavastorm Analytics Engine functionality, ETL component reusablity, pre-packaged analytic libraries to incorporate predictive analytic models and statistical data analysis, and R task/R integration out-of-the-box. You do need to script the R right now in the R task (bummer) but I am hopeful that will improve in the future. The ETL component reusablity is a big plus since analytic routines can be complex and often are shared for use on numerous analytic projects. I am not doing this great self-service ETL tool justice in a one paragraph overview. Download it, watch the online free video training, and play with it for yourself. They do have a free desktop version, and several levels of paid desktop versions and server versions starting at $4,500. They also have some other product related offerings around specific analytic use cases like fraud and spend, and optimization.
Alteryx seems to be the “king” of the self-service ETL tools I reviewed and they also had the highest pricing at ~$45,000 per seat and $15,000 for personal edition. Alteryx was a highlighted vendor in the 2013 Gartner BI Magic Quadrant report. Alteryx is super feature rich and comes with embedded, prepackaged industry data sources like MOSAIC profiles,TomTom geospatial, Dunn and Bradstreet, Experian, and other in-tool geographic, demographic, and business data that you get when you buy it. So the high price is really a bundled price for both the industry analytic data sources and the self-service ETL tool. Alteryx has bulk data loaders for SalesForce, Amazon S3, SharePoint, Teradata, Oracle, MongoDb, Hive Hadoop, and many other structured and unstructed data sources. They offer a read-write, hosted analytics application wrapper framework on their website. They also have nice R integration that did not require any R coding AND an R code task option if you want/need R code within your self-service ETL flows. I was not able to test the R functions myself since I had a version conflict with my local open source R installation. Alteryx’ predictive product manager was kind enough to spend time with me covering the R features in depth and also shared that they are working on a vNext iteration that will support side-by-side various R flavor installs. What I felt were true strengths with Alteryx includes their deep location/geospatial analytics and prepackaged industry data sources for mash ups. Powerful common analytic tasks like give me the geocodes for plotting 15 minute drive times from a specific store location on a map are very hard to geocode and then visualize with other tools – in Alteryx it is easy. They also just added a Tableau TDE ETL destination to allow Alteryx ETL data to be easily sent to Tableau for the deep visual, interactive analytics – see image above for the type of analytic solution this awesome combination empowers. Alteryx seems to be used by larger enterprise analytic groups – it is not for the occasional one-off, ETL project or even pure ETL. If you just need ETL, there are other much cheaper ETL tools like LavaStorm Analytics, SSIS or Pentaho.
Last but not least if you just want data cleansing and your company already owns SQL Server 2012, don’t forget about SQL Server Data Quality Services (DQS) and the Data Quality Services add-in for Excel. I am going to copy the description directly from the official overview “SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you to build a knowledge base and use it to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and de-duplication of your data. DQS enables you to perform data cleansing by using cloud-based reference data services provided by reference data providers. You can also perform data quality processes by using the DQS Cleansing component in Integration Services and Master Data Services (MDS) data quality functionality, both of which are based on DQS.” Now this option is really not self-service ETL but I felt the need to mention it here since many self-service ETL projects really are basic data cleansing projects where DQS fits very well. There are tons of free videos and resources to learn how to get started and use DQS.