Powered by Apache Spark, Seahorse is an open-source visual framework for data science pipelines. Seahorse’s compelling value proposition is that it quickly and easily allows the user to take advantage of high throughput capabilities with stream data processing by creating Spark applications without writing a single line of code. Creating Spark applications with Seahorse is as easy as dragging and dropping operations on a canvas. Previously I mentioned Seahorse in the Oceans of Data series article, Big Data Analytics with Spark Part 3. In this article, we will introduce Seahorse basics.
Data Science Workbench
For analytics enthusiasts that love to learn new technologies in a “hands-on” manner, you can play with Seahorse on the free online Data Scientist Workbench or download the Mac, Linux or Windows app from Deepsense.ai. Data Scientist Workbench also includes Open Refine for data preparation, Jupyter and Zeppelin analytics notebooks, RStudio IDE pre-cooked into an Apache Spark cluster. To fast-track newbies, Data Scientist Workbench apps come with step-by-step tutorials and data sets.
Seahorse’s web-based visual interface presents Spark applications as graphs of operations – a workflow. A typical Seahorse session consists of three steps: adding operations to the workflow, executing the part of it that’s already been created and exploring the results of the execution. After a workflow has been constructed, it can be exported and deployed as a standalone Spark application on production clusters. You can also use an embedded Python Notebook to interactively analyze data.
Using Seahorse, analytics and data science pros can create complex dataflows and machine learning projects without knowing Spark’s internals or writing code. If you do need or want to custom code a Seahorse transformation, you can write your own in Python or R. Seahorse includes the following key capabilities:
- Create Apache Spark applications in a visual way using a web-based editor.
- Connect to any cluster (YARN, Mesos, Spark Standalone) or use the bundled local Spark.
- Use the Seahorse Library to easily work with local files.
- Use Spark’s machine learning algorithms.
- Define custom operations with Python or R.
- Explore data with a Jupyter notebook using Python or R from within Seahorse, sharing the same Spark context.
- Export workflows and run them as batch Apache Spark applications using the Batch Workflow Executor.
For data science projects, it has a wonderful array of feature operations and predictive algorithms:
- Grid Search Hyper Optimization
- PCA Dimensionality Reduction
- Chi-Squared Feature Selection
- Regression with AFT Survival Regression, Decision Tree Regression, GBT Regression, Isotonic Regression, Linear Regression and Random Forest Regression
- Clustering with K-Means or LDA
- Classification with Decision Tree Classifier, GBT Classifier, Logistic Regression, Multilayer Perceptron Classifier, Naive Bayes and Random Forest Classifier
- ALS Recommendation
Your First Workflow
To get started in Seahorse, you can create a new workflow by clicking on the New Workflow button or clone one of the examples to see what is possible. Once you have opened a workflow either by creating a new one or by cloning one of the examples, press the “START EDITING” button. This will initiate the Seahorse Spark application responsible for executing your workflow. After you click “START EDITING”, you should be good to go!
Here’s a quick peek at Seahorse’s palette of operations and a workflow. There are dozens of machine learning algorithms and data transformations available. On this screen, you can author workflows by merely dragging them onto the canvas, defining connections and configuring parameters.
Seahorse supports data sources of the following types:
- External File – file accessibly via HTTP, HTTPS or FTP.
- Library File – file uploaded to Seahorse File Library
- HDFS – file on Hadoop Distributed File System
- JDBC – relational database
- Google Spreadsheet
You can read data into Seahorse using Input/Output > Read DataFrame. Drag and drop that operation from the toolbox on the left to the canvas, or just right-click on the canvas. Now click on Read Dataframe. You can set its parameters on the right-hand side panel. Select a data source to load and then run the operation by clicking the Run button from the top menu. Now that you have loaded data, you can apply transformation or machine learning operations to it.
After you create a workflow that you want to roll into production, you can use the Seahorse Batch Workflow Executor. It is an Apache Spark application that allows you to execute standalone workflows. This functionality can facilitate integration of Seahorse with other data processing systems and manage the execution of workflows outside of Seahorse Editor.
Keep in mind that if you are using the free Data Scientist Workbench environment, Seahorse performance will be significantly limited. I ran into trouble on a Saturday night – when it should be less used and running fast – with a continuous re-connection error when setting up a new Seahorse workflow. If you truly want to evaluate Seahorse performance, I recommend downloading the deepsense.ai app and installing it on a production quality Spark cluster.
To Learn More
With a point-and-click interface, Seahorse makes big data analytics with Spark easy. For more information on Seahorse by DeepSense.ai, check out the resources below.
- DeepSense.ai website: https://deepsense.ai/
- Online docs: http://seahorse.deepsense.ai/
- Operations: http://seahorse.deepsense.ai/operations.html
- Basic example: http://seahorse.deepsense.ai/basic_examples.html
- Advanced example: http://seahorse.deepsense.ai/casestudies/income_predicting.html