Earlier this week I attended and presented at the TDWI Big Data Analytics Solution Summit. It was interesting to hear the other presenters, talk to groups using big data, and get to play hands on with messy XML big data that was stored on a Cloudera Hadoop cluster for my big data visual analytics with Tableau demo. Prior to this event, I tooled around with big data over a year ago after Denny Lee, Saptak Sen, and a few others inspired me to do so while at Microsoft but confess I felt overwhelmed by all the strange terminology of Pig, HIve, Sqoop, and the high level of effort to be functional with it. Since then the ease of using big data has vastly improved. Today most big data vendors offer connectors like the Hive ODBC and market leading visual analytics tools like Tableau include additional user friendly, big data function libraries with those drivers to make the power of big data analytics accessible to a much wider audience of analysts and developers. As a result this technology is widespread. If you are a business intelligence professional, you need to learn what big data is and how to work with it. In this post, I will share how simple it is to do big data analytics with Tableau and cover some cool, value-added benefits of using the Tableau-specific features with unstructured, messy, big data objects that is a common pain point with many other visual analytic tools.
To get started with Tableau and big data, you need a big data source. Tableau can connect to quite a few of them including Hortonworks Hadoop Hive, MapR Hadoop Hive, Cloudera Hadoop, Cassandra, Hadapt, Karmasphere and Google BigQuery. If you don’t have one of these, most of the vendors do have virtual machines you can download with a cluster already set up that you can use to learn. Cloudera Hadoop is the one I am using. Once you have a big data source, you need the related Hive ODBC driver installed on each machine running Tableau Desktop or Tableau Server and also ensure Hive is set up. Hive enables SQL like queries on Hadoop file systems. More about Hive, what it is, and how it works can be found here. For me I have a 64-bit WIndows 7 OS and found out that setting up the ODBC was a little tricky. On 64-bit OS I had to call “C:\WINDOWS\SysWOW64\odbcad32.exe” in the command line to get the right ODBC user interface to see the Hive driver to create a DSN to use in Tableau. Once I had that figured out, I could point to the DSN in the Tableau data source connection window, choose my connection type, schema, and start having some real fun visualizing the big data. Creating big data visualizations in Tableau was no different than any other data source type – it was a drag/drop, pleasant, visual data discovery experience.
Tableau shines in big data visualization, no programming required, and also no waiting for super slow, long running MapReduce queries if you choose to use and schedule extracts like most groups do in the real world use cases after initial exploration and experimentation. In the 2012 TCC key note demos, Christian Chabot showcased how Tableau could render over 800,000 data points with ease blowing away the other data visualization vendors that often fail to render more than a thousand data points and use uncontrollable data sampling techniques to hide those limitations. In a world of big data, the ability to easily see and analyze patterns, outliers, or exceptions can be the difference between struggling to survive and thriving in your industry. I talked about this strategic competitive advantage during my TDWI Solution Summit demo.
I also discussed how working with big data sources often means working with messy data, unstructured data, JSON, and XML files that can be exceptionally challenging to analyze directly on a Hadoop cluster. Most other visual analytics tools require programming or ETL before you can analyze this type of data. That key point shown was the low level of effort required with Tableau in these common big data scenarios! I showed how easy it was to do drag/drop visual analysis of XML files with a live, direct Hadoop connection and using Tableau’s enhanced big data driver libraries for XML, JSON, and other string functions that uncomplicate text processing, unpacking nested data, performing data transformations, and processing URLs. Tableau’s big data driver features also support pass through Hadoop UDF functions and Java programs. This is another very important point to keep in mind because one of the benefits of using open source is getting to enjoy freely leveraging the work the world of open source developers further easing big data analytics. In Tableau, you can call Hadoop UDFs, often built as JAR files, by appending the JAR flle location to the Initial SQL (add JAR /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u1.jar; add FILE /mnt/hive_backlink_mapper.py;) For more informaton on this point refer to Hive CLI.
Another thing you can do with Tableau Initial SQL is performance tune the Hadoop MapReduce query settings. To force more query parallelism you can lower the threshold of data set size required for a single unit of work. (set mapred.max.split.size=1000000;) You can also specify using clustered fields, “bucketed fields”, to improve big data query join performance. (set hive.optimize.bucketmapjoin=true;) Another Tableau Initial SQL big data optimization technique is to consider the shape of your data. Often big data is unevenly distributed, Map/Reduce tasks can lead to system hot spots where a small number of compute nodes get slammed with the most computation work. The following Tableau Initial SQL setting informs Hive that the data may be skewed and to take a different approach formulating Map/Reduce jobs. (set hive.groupby.skewindata=true;)
For ad-hoc big data queries, on-the-fly ETL, or data cleansing per se, Tableau Custom SQL can be used. In real world implementations a LIMIT clause is often added to Custon SQL statements during development and later removed when the Tableau views are deployed into production. There are many other tips and tricks for using Tableau with big data. I hope the above provides a taste for some of the things you can easily do to start getting up to speed. More information on this topic is available in the Tableau knowledgebase and forums.