Recently I engaged in a guided “hands-on” evaluation of Infoworks, a “no code” big data engineering solution that expedites and automates Hadoop and cloud workflows. Within four hours of logging in, I successfully created Hadoop data ingestion pipelines, analytical transformations and designed an OLAP cube with delightful, visual point-and-click simplicity. Notably, I do know how to build data warehouses and OLAP cubes. However, I do not consider myself to be “Hadoop-savvy”. I merely studied big data concepts by playing with Cloudera, Hortonworks, Spark, and other big data sandboxes.

Based on that background, let’s dive into a brief overview of Infoworks. If you are interested in learning how to automate big data pipelines, analytics and machine learning projects with this solution, please join me in an upcoming webinar .

Infoworks Webinar

Eliminate Big Data Headaches

Historically, Hadoop has been an extremely difficult platform to master. Early Hadoop adopters struggled finding scarce talent needed to code and stitch numerous point solutions together to keep the immature ecosystem running. Programming data workflows with Java, Apache Sqoop, Pig, and other Hadoop scripting languages was painful and problematic. As the number of data sources and pipelines grew, so did the expensive, time-consuming challenges.

Inspired by experience implementing big data at Google and Zynga, Infoworks founders embarked on a journey to eradicate Hadoop complexity with automation and machine intelligence. Although there are numerous point solutions in the market today, I do not know any other unified, “all-in-one” solution with comprehensive, end-to-end big data analytics automation across the entire life-cycle.

Infoworks is the only unified, “all-in-one” solution with comprehensive end-to-end big data analytics automation

What makes Infoworks stands out from the crowd is the effortless “no code” visual UX design of complex big data pipelines, savvy Hadoop workload management, support for pause and continue, automated retries, parallel resource execution, end-to-end lineage, audit trails and export of pipelines to avoid lock-in.

Infoworks clients have rolled out complex, large-scale projects in days versus months with minimal resources

In one pilot project, Infoworks ingested and synchronized 60 data sources, built 10 data models and six cubes in two days. When compared to previous Hadoop big data analytics processes, many early adopters have achieved significant project cost, resource and time savings. The following table is a summary of known real-world “before and after” Infoworks results.

Infoworks Before and After

Getting Started with Infoworks

In this “hands-on” walk-through, we will ingest sales data from an Oracle database and combine it with extracted weather data from the internet.  Note that the author is a very experienced data integration and data warehousing expert in the relational data world.  However, I am less knowledgeable when it comes to applying those skills in a Hadoop, Spark or big data cloud environment.  So, while not a complete newbie in the world of big data, I am not an expert either.  This is typical of the skill sets in most organizations, so this walk through should be representative for the experience of anyone who has a similar background.

Add a Data Source

To begin setting up a project in Infoworks, you’ll configure a domain (workspace) and data source for automated schema, data type, pattern and metadata crawling scans.

Inforworks datasource

Note Infoworks has internally developed optimized native drivers that are designed specifically for continuous synchronization and dynamic adaptation in Hadoop of both data and schema changes. An SDK is also available for third-party custom connector development.

Ingest Data Source Metadata

After you have a data source set up, Infoworks Autonomous Data Engine will scan the metadata and infer data relationships. It currently supports streaming, batch and incremental data synchronization.

Infoworks ingestion logs

Data Ingestion

When you are ready to ingest data, Infoworks can be configured to run optimal high-speed, parallel, loads. Granular full and incremental load options are available on each table along with “intelligent data sample previews”. Other ingestion administration options include Yarn Queue naming, scheduling, and helpful data validation and reconciliation capabilities for data load troubleshooting.

Infoworks tables

Data and Schema Synchronization

Having defined a source data and how you want to load that data, you then select a Hadoop Hive schema and HDFS cluster destination. Keep in mind that Infoworks will robustly handle continuous changing data from the source systems using log-based and query-based methods. This is one of Infoworks key innovations – a proprietary, high performance “merge” process for Hadoop.

Infoworks ingestion logs

Infoworks’ incremental data synchronization process delivers several advantages over using a Sqoop approach. Native protocols designed by Infoworks minimize loads on source systems, allow detailed tuning, log based ingestion, and Change Data Capture (CDC). The proprietary Continuous Merge performs a high-speed merge of incremental records to the base tables in Hadoop, thereby overcoming the immutability of HDFS data.  The merge process maintains history of all inserts, updates and deletes, and provides Slowly Changing Dimensions Type 2 support automatically.

To see what is happening behind the scenes in Infoworks or troubleshoot issues, authors can view summary report or detailed ingestion logs of every command.

Data Transformation

Infoworks Data Transformation

To enable and accelerate self-service big data preparation with Hive, Impala or Spark, Infoworks Autonomous Data Engine provides interactive, drag-and-drop tasks with support for SQL-based and other transformations. The “smart” suggestion-based, graphical UX has been specifically designed to help non-Hadoop engineers succeed in designing efficient pipelines.

Big data pipeline transformations and workflows are easier than ever to build

For slowly-changing-dimensions (SCDs) requirements, Infoworks automatically handles the complexity – authors merely check a setting on a source table column that needs to retain history. Once set, SCD data and schema changes are automatically retained in current and historical Hadoop tables.

Infoworks SCDs

OLAP Cubes and Advanced Analytics on Hadoop

To empower the masses with big data insights in the tools they already know and love, Infoworks authors can create and populate OLAP cubes using ORC or Parquet storage. Just like Infoworks visual data prep UX, authors can design OLAP cubes with automatic creation of time series data, hierarchies and custom measures on Hadoop with drag-and-drop ease. Infoworks also automatically optimizes target data models for query performance with cube aggregation designs to provide sub-second response times.

Infoworks Model

The visual workflow designer includes easy point-and-click options for distributed orchestration requirements. With the workflow capability, production ops teams can easily design, monitor and manage production workflows.  Workflows can be paused and restarted, while individual steps can be automatically retried to recover from infrastructure or data issues. Orchestrated tasks can include Infoworks or non-Infoworks tasks which may be on or off-cluster.

Infoworks ingestion

Reporting applications such as Tableau, Excel, Power BI, Qlik, TIBCO Spotfire can connect to Infoworks OLAP cubes and query it with SQL, R, ODBC, JDBC or a programmatic API. For advanced analytics and predictive modeling use cases, the Spark ML library is available. Infoworks also supports implementation of custom Scala, Python and other data science libraries.

Security, Administration and Governance

Infoworks founders are acutely aware of complex and manually intensive Hadoop administration tasks. To solve commonly known security and administration pains, this solution comes pre-packaged with a rich set of role-based user authentication, orchestration, governance, lineage, monitoring, fault tolerance, notification and encryption capabilities for data in motion and at rest.

Not only does Infoworks help you accelerate the development of your data pipelines, it can help you properly schedule and run big data pipelines in an operational environment with confidence. Another valuable capability is automatic job restart on failed processes. The company also supports optional integration to popular Hadoop utilities such as Apache Yarn and Ranger.

Infoworks Lineage

Infoworks Logical Architecture

Now let’s review Infoworks architecture. Using a containerized model, the Infoworks solution can be run on most Hadoop flavors either on premises or in your favorite cloud environment. Currently Infoworks supports AWS EMR, Microsoft HD Insight Hortonworks, Impala, MapR, default Apache and Cloudera.

Infoworks Architecture

To avoid host platform lock-in, Infoworks provides exports of big data pipeline definitions. One Infoworks customer successfully migrated three data sources, six to eight big data pipelines, two OLAP cubes, and 12 Tableau reports from one cloud vendor (Microsoft Azure) to another ( Google Cloud Platform) in just one day using the export and migrate capability.

To avoid underlying host platform lock-in, pipeline definitions can be exported

Overall Impressions of Infoworks

In my professional opinion, Infoworks is ideal for skilled data warehousing and business intelligence professionals that want to create a Hadoop data lake or deliver analytics on top of one. The familiar steps and visual design user experience is much like creating classic OLAP cubes with dimensional schemas. Developing Hadoop (or cloud) big data analytics pipelines with Infoworks was much easier than Sqoop, Pig and Java scripting routines that I tried in the past. I also noted deeper, more user friendly, end-to-end administrative functionality than other offerings in this space that I have tested.

If you are a Hadoop big data engineer managing a sea of complicated, buggy code, adoption of this solution can shift your efforts away from tedious scripting tasks to delivering higher value analytical insights. Although there are point solutions for different areas of big data analytics, Infoworks all-in-one, “appliance-like” approach is compelling. The easy to use, unified offering simplifies and speeds up development while also giving you better control, governance, and deeper data lineage views than you’d get with competing offerings.

To Learn More

In this article, I briefly introduced Infoworks and shared what I think makes it a truly unique solution. For more information about Infoworks, please join me in the upcoming webinar, contact an Infoworks representative or review the following recommended resources.