Data of all shapes and sizes is rapidly pouring in from an increasing number of sources. To keep up with exponential data growth volumes, leading organizations are turning to ingest, store, analyze later design patterns using data lakes. Data lakes are more agile and flexible than traditional relational data management systems. They are also easier than ever to spin up as a cloud service.

No matter how tempting…don’t just dive into the deep end of a data lake. Learn how to swim first. Without descriptive metadata and proper maintenance, your lovely data lake can quickly turn into a data swamp. No one wants to be known for allowing or creating a big data mess. Just like data warehouse and self-service BI initiatives of the past, you will want role-based access controls, governance policies, organization of stored data along with metadata, logging, auditing, and monitoring of activities.

I do expect demand for big data engineers to swell in the future. In my analytics industry research activities, I am hearing desperate pleas for big data professionals. Specifically, I get requests for “hands-on” experience orchestrating big data pipelines and managing production data lake environments. Several groups expressed that they must have folks with real-world experience. Why?

Datameer Atlanta

Please join me and several early adopters in a free data lake event on August 22, 2017 to find out. We’ll be sharing guidance and lessons learned to help you understand what it takes for your data lake to be a success. In addition to “brain food”, I understand there will be snacks, beverages and networking opportunities for Atlanta data pros.

At Learn to Swim in a Data Lake You Trust, we will address major issues challenging data driven enterprises today. Data lakes, built on the promise of a schema-less structure, still require well defined processes and policies such as cleansing, quality checking, and governance ensuring users will trust data they look to monetize. During this event, we will also explore the structure of a modern data lake, as well as best practices to avoid turning your data lake into a swamp of mistrust.

What is a Data Lake

“In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”

Data lakes can complement your existing data warehouse in an enterprise information strategy. If you’re already using a data warehouse, or will be implementing one, a data lake can be used as a source for both structured and unstructured data.

A Data Lake should support the following capabilities:

  • Collecting and storing any type of data, at any scale
  • Securing and protecting the stored data
  • Searching and finding the relevant stored data
  • Enabling new types of ad-hoc big data analysis to be run

For analysis or use in other systems, unstructured data from the data lake can be queried into structured data using schema-on-read approaches. Note current analytics tools might not be optimized for unstructured data. To make big data useful and valuable for the business, you need to consider how to best deliver information aka “make big data, business ready”.

“Make big data, business ready.”

Avoid the Data Swamp

According to a fantastic article by Kiran Donepudi, he mentions that “it takes a lot of justification and validation for technology momentum like a data lake to build up and sustain. Most of the implementations start with great excitement and enthusiasm but they eventually slow down… slowing down is largely linked to lack of business sponsorship and broken data governance process.” Gartner also warns of the same issues.

Another risk is data security. Data ingested into a data lake with no oversight increases risk of data exposure. Since data lake technologies are rapidly evolving, filling current gaps may require a combination of people, process and tools to mitigate.

Peers that have implemented cloud data lakes cite similar challenges and risks. These projects seem to start out fine. Over time, several of my peers ended up reverting back to using old technology – classic ETL processes with SSIS to store data in traditional data warehouses. That is sad. Don’t let that happen to you.

Data lake technology can provide immense value in a digital era where big data empowers organizations with a superior competitive advantage via artificial intelligence and analytics. Be proactive. Learn how to keep your data lake pristine, clear and invaluable as a strategic asset from those that have done it successfully before you.