Next up in the Oceans of Data series covering Apache Spark, I will explore analyzing big data with SparkR, MLlib and H2O for data science and machine learning. In previous articles, I introduced the rise of Apache Spark in Part 1 and began analyzing data in Part 2 with Spark SQL.

Data Scientist Workbench

For analytics enthusiasts that love to learn new technologies in a “hands-on” manner, I found a wonderful free data scientist workbench online. I totally love it and warn you that it is addictive! The data scientist workbench includes Open Refine for data preparation, Jupyter and Zeppelin analytics notebooks, RStudio IDE pre-cooked into an Apache Spark cluster, Seahorse for Visual Spark machine learning, and text analytics. There are free step-by-step tutorials and free classes. If you have time over the holiday break, enjoy these awesome analytics gifts.

Data Scientist Workbench

Introduction to SparkR

Recently I presented Big Data Analytics with SparkR for the PASS Business Analytics community. The complete session recording is now available and the related deck is also posted on SlideShare.

SparkR is an Apache Spark cluster computing framework for large scale data processing. It provides a light-weight front-end to users of Apache Spark from R. Spark is best known for its ability to cache large data sets in memory. It’s default API is simpler than MapReduce.

SparkR is multi-threaded and can be executed across many machines. SparkR uses Spark’s distributed computation engine to allows us to run large-scale data analysis from an R shell. In Spark 1.61, SparkR added a distributed dataframe implementation that is a key component. Concurrently RStudio embraced Spark in the latest release. Now we see multiple vendors cooking RStudio directly into big data offerings allowing our R skills to easily upgrade into the big data analytics world with minimal learning curve.

IBM Watson DSX

There are several ways to work with SparkR. Spark-land sessions start life with a context. You can use the Spark shell to launch a SparkR shell. The difference between the two is the SparkR shell has the SQLContext and SparkContext. You can also use RStudio integrated development environment or analytics notebooks including Jupyter and Zepplin.

Dataframes are a fundamental data structure used for data processing in R and SparkR. They are conceptually similar tables in a relational database. Dataframes use the distributed parallel capabilities offered by Spark RDDs described in the previous Spark series article.

SparkR’s distributed dataframe implementation supports R operations like selection, filtering, aggregation, and so on for large data set analysis. Dataframes can also be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing local R data frames.

There are several ways to create a dataframe in SparkR.

  • Convert a local R dataframe into a Spark R dataframe
  • Created from raw data by creating a list of lists

Creating a dataframe requires a schema structure. One method of creating dataframes from data sources is to use read.df. This method takes in the SQLContext, the path for the file to load, and type of data source. SparkR supports reading JSON natively, and through Spark packages, you can find data source connectors for popular file formats like csv.

When using SparkR you may run into method name conflicts. To work around it, fully qualify method names using ::

To select columns in SparkR you can use select to get a list of column names or selectExpr to run a string containing a SQL expression. SparkR also provides a number of functions that can directly be applied to columns in data processing and during aggregation. To specify a column use the R $ syntax i.e. cars$mpg.

You can also perform the selections and filtering by using standard SQL queries with a SparkSQL dataframe. To do that, we register the table and use a SparkR SQL function with SQLContext. For example, SELECT cars$mpg FROM cars WHERE cyl > 4.

SQL queries should be comfortable for most analytics professionals. If not, stop reading this article and start learning how to query data with SQL. SparkR defines the following SQL aggregation functions that you can apply to dataframe object calls: avg, min, max, sum, count, count distinct, and some distinct. User defined functions are also available as of SparkR 2.0.

R projects such as dplyr for data preparation and shaping have also been optimized for complex data manipulation tasks on SparkR dataframes. SparkR also supports distributed machine learning using MLlib.

SparkR for Machine Learning

Machine learning identifies or infers patterns in data sets. A machine learning algorithm takes data as input and it produces a model as output. The model can be used to make predictions and score data for reporting or app integration.

Within SparkR for machine learning, thare is SparkML, Spark MLlib and other projects such as H20 Sparkling Water. Spark MLlib is a module, library, an extension of Apache Spark for machine learning algorithms on top of SparkRs RDD abstraction. SparkML is a module to provide distributed machine learning algorithms on top of Spark dataframes.

  • Spark MLlib contains the original API built on top of RDDs
  • Spark ML is a higher-level API built on top of dataframes

When developing machine learning with Spark, you will create a pipeline or sequence of stages contains a transformer or an estimator step. A transformer is an algorithm which can transform one dataframe into another dataframe. For example, an ML model is a transformer that combines a dataframe with model features with predictions. An estimator is an algorithm that produces a transformer. We also have parameters. All transformers and estimators share a common API for specifying parameters – or algorithm settings per se.

The rsparkling extension package provides bindings to H2O’s distributed machine learning algorithms via sparklyr. The rsparkling option allows you to access the machine learning routines provided by the Sparkling Water Spark package. This solution provides a few simple conversion functions that allow you to transfer data between Spark DataFrames and H2O Frames.

A typical machine learning pipeline with rsparkling includes the following steps.

  1. Perform SQL queries via sparklyr dplyr interface.
  2. Use the sdf_* and ft_* functions to generate new columns, or partition your data set into training, validation and test dataframes.
  3. Convert training, validation and/or test dataframes into H2O frames.
  4. Choose an appropriate H2O machine learning algorithm for modeling.
  5. Evaluate model fit, lift and apply it to make predictions with new data.

H2O is impressive. Check out the web site and github projects. H2O was written from scratch in Java and seamlessly integrates with Apache Hadoop and Spark to solve challenging big data analytics problems. It provides an easy-to-use web design experience and interfaces with R, Python, Java, Scala, JSON via APIs. One awesome feature I appreciate is the visual models evaluation. Real-time data scoring is also pretty cool.

H2O-Architecture-Slide That wraps up this article on SparkR. Stay tuned for more deep dives using these incredible open source big data analytics technologies.