I recently did a fun TDWI Practical Predictive Analytics presentation on how to get started with data mining and predictive analytics using a wide variety of tools including: RapidMiner, R, Predixion, Weka, Statistica, SAS, IBM, Microsoft, Oracle, Spotfire, Tableau, and Alteryx. This blog will summarize some of the presentation key points and reviews of what I liked about the various products shown.
I can’t believe it but it has actually been a little over eleven years since I first implemented a truly predictive project. I was working for a retailer in Hawaii while my husband was in the U.S. Navy supporting Operation Enduring Freedom overseas. Retailers historically are quite savvy in statistics and predictive. At the time, I was working alongside a Harvard University intern that was both brilliant and inspiring. He came to me with an algorithm called Item Based Collaborative Filtering (a Market Basket Recommender) and asked that I run it on the Oracle database servers that I was managing as part of my role back then. I was intrigued by conversations with him, the results of our predictive projects, dived deeper into what was SQL Server data mining, statistics with SAS, and other analytic areas. I later enrolled in a two year Data Mining Post-Graduate certificate program at the University of California, San Diego. From that point on I was hooked. Later on I implemented check fraud detection, healthcare payment estimation, customer segmentation, insurance agent classification, and a few other predictive projects. In general predictive projects seem few and far between. I literally beg to get these projects since I totally love it. For those of you just starting out, here is how you can get started learning this awesome science and some of my lessons learned. Enjoy! …and please ping me if you ever do have one of these projects and want some extra assistance.
What is predictive analytics? it is an area of statistics that is focused around capturing relationships between explanatory variables and predicted variables from past occurrences and using them for prediction. Often predictive analytics involves Data Mining, automatically discovering interesting patterns in data. Predictive analytics can be applied to unknowns in the Past, Present or Future. The accuracy and usability of your models varies upon the level of analysis and quality of your assumptions. Companies use predictive analytics to improve decision making and because in the world of exploding data there is waaaaaaaaay too much data and too many variables to manually analyze or use traditional statistical techniques effectively. Traditional analytics and statistical methods fail due to the complex non-linear and multi-variable combinations. Predictive analytics also provides a strategic competitive advantage. Most Fortune 500 companies already are using predictive analytics today. In the future, we will see much more predictive embedded into business processes and applications even in small companies.
One way to get started with learning data mining and predictive analytics is to download free open-source software, buy a couple books, take a course, get a mentor, and start learning hands-on with your own data. You don’t need to have big data or be a Data Scientist, PhD, or Statistician to learn the basics. I do feel that some specialty courses and an understanding of foundational statistics is important. A few of my favorite books are shown below.
Another great place to get started is KD Nuggets. KD Nuggets is a “goldmine” of resources and a true data mining classic. Don’t let the ugly, old, 90’s web site design fool you – this site is FANTASTIC.
Once you are ready to get started, you will follow what the industry calls the CRISP-DM process. This will involve choosing an appropriate Business Question or Problem to predict. Gathering, understanding, and preparing your data for predictive algorithms. Preparing data is time consuming and a bit of an art and a science. Predictive algorithms use “flattened” input data sets – you do not just point them at a data warehouse or data mart to get decent results. After the data is prepared in data mining friendly format, you then load that data into the analytics tool of your choice or use “in-database” predictive algorithms. From there identify true predictive influencers, evaluate various data mining models, further transform and iteratively experiment with variables until you have a good predictive model to share, deploy, or integrate into reporting or application logic. When evaluating predictive models, scoring typically means squared error or percent correctly classified. To choose the best model, keep the business problem in mind and what errors, true and false positives, really mean. Typically models are judged on estimation or classification Gain/Lift, ROI, or ROC Curves. To deploy predictive models, you might encode the rules into application logic, use industry standard PMML, program predictive queries for smart KPIs or forecast reports, or apply the models within ETL processes.
Some of the lessons learned from real-world predictive projects include:
- Do not overlook the critical importance of properly choosing, preparing, cleansing, transforming, and sampling data to train and develop high performing models
(Think MONEYBALL for your business)
- Use principal component analysis or other attribute reduction techniques to reduce variables to avoid “over fitting” predictive models
- Be sure to partner with the business process subject matter experts to make sure all relevant aspects are captured as to not “under-fit” or incorrectly design a model
- Choose a predictive modeling algorithm that can be effectively used and deployed within the business process and not just look cool in a slide deck
For deeper review of these concepts and more, please refer to the shared Practical Predictive Analytics presentation posted at SlideShare.
Some of the tools mentioned at the TDWI meeting included RapidMiner, R, Predixion, Weka, Statistica, SAS, IBM, Microsoft, Oracle, SAP, Spotfire, Tableau, Alteryx, and Knime. There are many more than these in the market today. Here is my quick take on these solutions.
RapidMiner: Free and one of the most popular predictive analytic tools in the world according to the KD Nuggets 2013 Poll Results. It has deep and compelling functionality across the entire life-cycle of predictive analytic model development. Initially learning and designing workflows can be cumbersome and confusing. There are some videos but the best resource I found was Matthew North’s book, Data Mining for the Masses, shown above.
R and Rattle: Free and seriously one of the most popular, most used statistics and predictive platforms. Most of the major BI and database vendors have added R wrappers into their solutions to leverage a sea of available community developed predictive algorithms. R is a must know for anyone seriously interested in an analytics career. Typically if enterprise scale R is needed, the Revolution Analytics R flavor is needed due to limitations in free, open source R.
Predixion: Founded by former Microsoft predictive analytics program manager/engineering expert Jamie MacLennan, Predixion has a wonderful Excel Add-In and other predictive offerings. I was really sad to see Jamie leave Microsoft but I am thrilled by the raging success of Predixion. Predixion grew 622% in popularity on the KD Nuggets poll in 2013 – they are doing exceptionally well. Predixion has a fantastic set of add-ins for Excel that layer on top of Micosoft’s base predictive offering in SQL Server Analysis Services. They also are up to date with their PMML unlike Microsoft’s base offering. They have a cloud solution and several pre-cooked industry solutions. If you are using Microsoft predictive, Predixion is one of the must see solutions.
Weka: Free and one of the classic data mining tools used by long term data miners. This was the tool I used throughout the University of California, San Diego program courses and it is covered in depth within the Witten Frank Data Mining classic text shown above. Weka has features across the entire life-cycle of data mining from preparation to evaluation, some visualization and workflow. Pentaho recently added Weka into their mix of tools.
Statistica: One of the more mature and popular for-fee vendors in this space, StatSoft Statistica has an entire suite of products and industry specific solutions including but by no means limited to ERP, SAP certified solutions, Teradata, ETL, Process Control, Quality, Text Mining, and many, many others… far too many to list here.
SAS Enterprise Miner and SAS JMP: SAS is quite popular and the best of the best… If you have it, I admit that I am jealous! Many eduational institutions including the analytics program that I would most LOVE to attend, NC State Analytics, uses SAS in their courses. Seriously SAS is a market leading for-fee vendor in this space. SAS solutions are high cost but also come with a sea of depth in robust, highly scalable predictive functionality. SAS Enterprise Miner is a true, full CRISP-DM life-cycle data mining solution offering. Although SAS JMP is not really a predictive modeling focused tool, it is more of a statistical survey, experimenting, and evaluation tool – SAS JMP does contain decision trees, neural networks, and some regression algorithms. SAS also has robust in-database processing options via SAS/ACCESS to can leverage Base SAS, SAS_PUT(), and SAS Scoring features within databases such as Teradata, Oracle, Aster, Netezza, Greenplum, DB2, and Hadoop via an SAS Embedded Process.
IBM SPSS Modeler: SPSS is also quite popular and one of the best of the best. SPSS Modeler is also a true, full CRISP-DM life-cycle data mining solution offering. IBM has exceptional depth in text Analytics with their innovative natural language, speaking “Watson”, research that they applied to their Semantic and Text Mining features making IBM a clear industry leader in those specific applications. SPSS is another vendor that is often used in eduational courses and has long been used by government agencies, top Fortune 500 and other high profile groups.
Microsoft Data Mining: Microsoft has had data mining algorithms embedded within Analysis Services at least since the SQL Server 2000 days. I recall playing with the Poisonous Mushroom Decision Trees demos way back when learning predictive for retail scenarios. Although this offering did not get significantly enhanced in the latest SQL Server 2012 release, the Add-Ins to Excel have continued to be upgraded to 64-bit, work with the latest versions of SQL Server, and also with both Excel 2010 and 2013. If you have basic data mining needs and want to start somewhere really easy, the Excel Data Mining Add-Ins are a no-brainer. You can also easily deploy and use Microsoft data mining models with DMX predictive queries in your applications or reports. I have a nice public live demo of combining Microsoft data mining models DMX predictive queries with Tableau here. Many Microsoft data mining customers often opt to upgrade those solutions to use Predixion.
Oracle Data Mining: Oracle Data Mining is a component of the Oracle Advanced Analytics Option for Oracle database. This is an “in-database” predictive solution offering with full life cycle development. The free Oracle Data Miner GUI is an extension to Oracle SQL Developer that enables working directly with data inside the database, exploring data graphically, building and evaluating multiple data mining models. Like other “in-database” predictive solutions, developers can use native SQL APIs to deploy and use predictive models within business intelligence, reports, or applications. Oracle can also integate with SAS via SAS/ACCESS.
SAP Lumira with Predictive Analysis: also SAP HANA PAL: SAP entered into the predictive market late last year with a new offering that combines a visual discovery/anaytics tool SAP Lumira with a Predictive Analysis tool. The Predictive Analysis tool snaps into the visual tool and allows for full life-cycle model development. It currently has limited out of the box models but does have a R wrapper that can be configured opening up the use of the wealth of R algorithms. The current toolsets are immature today but SAP is releasing frequently to catch up to other vendors. SAP Hana PAL, Predictive Analytic Function Library, is another “in-database” predictive solution offering and is quite nice. There are some great videos here to learn how to get started with the SAP Hana PAL.
Teradata Warehouse Miner and Aster: Teradata also has an “in-database” predictive solution offerings with full life cycle development. This solution has statistics, data profiling, transformation, data reduction functions, model management, and scoring. Teradata and Aster both can integate with SAS via SAS/ACCESS allowing SAS_SCORE and other predictive capabilties to be available as a database UDF for easy predictive model integration.
Spotfire: Tibco Spotfire also has predictive modeling capabilities though the focus of the software is typically perceived as being more on the visual discovery/anaytics side than the CRISP-DM model development process. For many years, the pharmaceutical industry has been using Spotfire in drug research and development. Base Spotfire has limited predictive options. More sophisticated Spotfire flavors allow you to visualize existing predictive models from other applications and also create new predictive models with TIBCO Enterprise Runtime for R (TERR), R, and S+. One of the nice features of Spotfire is the ability to combine or mash-up predictive model results with other data sets to visually examine what-if analysis. Here is a demo video showcasing this concept.
Tableau: It is no secret that I am a huge fan of Tableau and I have already posted several blogs this year on how to use Visualizing R Models in Tableau and with Microsoft DMX. Another popular combination is using Alteryx Predictive R with Tableau since a Tableau TDE destination component was added in Alteryx this year. Tableau is currently not a true data mining or predictive modeling development tool but rather a solution to visualize and explore results. There are some native Tableau, out-of-the-box, predictive features around Trending and Forecasting that do not require R or other any stats programs. There is also multi-pass aggregation capabilties that can be used for some predictive processing scenarios that can be easily overlooked. The Tableau Trending features cover linear and nonlinear model types including Linear, Logarithmic, Exponential and Polynomial modeling.
Alteryx: Although I think of Alteryx as the king of Self-Service ETL tools, it also has a nice array of predictive analytics features and a R algorithm wrapper that can be used by mainstream business analysts. I do not think of Alteryx as a true CRISP-DM data mining model development tool but you can certainly build, apply, and evaluate predictive models in a self-service ETL process rather easily with it. Since Alteryx comes with an awesome library of exceptional, high value, data sets such as MOSAIC, Dunn and Bradstreet, demographics, and geospatial, it may make a lot of sense to use it for predictive analytics in retail, marketing, and other use cases that reference those data sets.
Knime: Free and user-friendly visual workbench for the entire predictive analysis process: data access, data transformation, initial investigation, model development, visualisation and reporting. Although I have not used this tool myself, another respected TDWI member mentioned it and after looking at it I agree that it should be added to this list of favorite predictive modeling tools.
That finally concludes my high-level overview of the TDWI Practical Predictive presentation I did last week. I also have a few other blogs on this topic including: Predictive Analytics with Tableau, How To Predictive in Tableau, What-If Analytic Simulation Options with Excel and SSAS, Visualizing R Models in Tableau, and Predictive Analytics with Mahout. If you are interested in learning more about this fascinating and powerful topic, please don’t hesitate to contact me directly.