After my last super lengthy blog, I thought a short and sweet one on how to get started with big data was in order. I have been meaning to write about this topic now since July 2013 – it is long overdue. At the beginning of each year I review my goals and update them. Big data was an area that I wanted to dig into last year and I did! I was fortunate to work with Big Data Visual Analytics early in February where the Hadoop cluster environment was already installed, configured and data was loaded. I simply had to set up my data connection, create queries and dashboards. However, that was far too easy. I wanted to dig deeper as engineers do to learn HDFS, see MapReduce in action, script Pig code and truly understand what all the silly big data terms like ZooKeeper, Flume, Oozie, Scoop and so on mean. To do that, I wanted my own local Hadoop playground.

To set up a local Hadoop playground VM per se, I downloaded the wonderful Hortonworks Sandbox for Oracle VirtualBox. I was literally up and running in about 30 minutes. It was too good to be true but it really is true. After setting up the VM, I was able to walk through a myriad of tutorials on uploading a data file into HDFS, creating a table with HCatalog to store loaded data, view and create HiveQL queries with Beeswax, develop and run Pig Scripts, run MapReduce jobs and more.

HDFSPowerQuery2

I was also able to explore my local Hadoop VM instance with Excel, Power Query, Tableau, SAP Lumira, Microstrategy Desktop and other front-end BI tools by installing ODBC drivers to enable connections to learn the entire big data analytic process cradle to grave per se.

Since I am an avid reader, I also checked out my ACM Safari and Books 24×7 subscriptions to see what Hadoop books were available. Although I do love ebooks, I prefer buying physical technical books when I find one that I really like to add to my library. To fast-track learning Hadoop technologies, I invested in a few O’Reilly books: Hadoop: the Definative Guide, Programming Pig and Programming Hive. I did look into attending Strata but I can’t justify a ~$4,000+ training investment right now for my tiny business. If you do have the opportunity to go to a big data conference and have training budget to spend, Strata would be my #1 pick. The Strata content looks fantastic and I have heard from peers that it is the best one to attend. There are also a ton of courses (free and not free) about Hadoop and related big data technologies.

I hope this little article helps someone else out there. Big data is becoming a mainstream data technology. Analysts, BI pros and DBAs alike should all have a general idea of what Hadoop is, when and why it would be used, and how it all works to keep their skills up to date.