In a digital world where data lives everywhere, enterprise data catalogs are an invaluable asset in your information architecture. Over the past two years, I mentioned data catalogs for enhancing self-service BI governance, improving data quality and preparing for the GDPR. In this article, I’ll share why data catalogs have evolved from a “nice-to-have” to a “must-have”. I’ll also share tips on how to select the best data catalog for your organization from research gathered by data catalog industry leader, Waterline Data.

Why Buy a Data Catalog?

Ten years ago, reporting tools used to be complex. Only a few technical resources were granted access to known data sources for developing reports for the masses to consume. Data organization, governance and security was controllable.

Today reporting tools are simple to use and widely available. Data-driven cultures empower the masses with unprecedented self-service data source access. Concurrently, growing numbers of cloud apps, IoT and digital transformation of processes exponentially increases available data sources. Privacy regulation has also been evolving, making it progressively more difficult for organizations to effectively secure and govern their data.

To address modern data management needs, data catalog solutions like Waterline Data have become a true necessity. While it might have been considered a nice-to-have in the past, you can’t afford not to have a data catalog now due to the wide variety of data compliance regulations including the upcoming GDPR enforcement that begins in May 2018. The fines for violations can be the greater of €20 million or 4% of annual global turnover (revenue). The clock is ticking.

How to Select a Data Catalog

As you begin evaluating data catalogs, you’ll find a wide variety of solutions that resemble a data catalog but do not fulfill common requirements. You’ll also see solutions that ideally would be integrated with a real data catalog. To help you decipher what is and what is not a data catalog, here is an overview.

Data Catalog

How a Data Catalog Works

A good data catalog serves as a searchable business glossary of data sources and common data definitions gathered from automated data discovery, classification, and cross-data source entity mapping. Automated data catalog population is done via analyzing data values and using complex algorithms to automatically tag data, or by scanning jobs or APIs that collect metadata from tables, views, and stored procedures.

Data Catalog Tagging

Data Catalog Tagging

Automatically populated data catalogs allow crowdsourcing, human contributions of ratings, annotations, versioning, and documentation. Data catalog administrators should be able to assign data source owners, subject matter experts, stewards and consumers using role-based policies. Governed, audited data catalog access is also needed.

Data Catalog

Data Catalog Ratings and Reviews

Data catalog solutions foster search and efficient reuse of existing data in popular business intelligence, self-service data preparation and data discovery tools. They also provide insight into end-to-end data lineage.

Data Lineage

Data Lineage

Top 10 Data Catalog Capabilities

Waterline Data interviewed data catalog customers to learn what capabilities were most beneficial. The top 10 key capabilities list below represents the findings from that research.

  1. Automated, intelligent population of the catalog
    To efficiently scan and load metadata gathered from hundreds or thousands of potential data sources, automation is a necessity. Good data catalogs also use artificial intelligence and machine learning to intelligently profile, tag and populate objective metadata about the quality of the data.
  2. Crowdsourced curation of tags with machine learning feedback
    Data catalogs should allow human review and stewardship of automatically populated tags. Artificial intelligence powered catalogs should apply machine learning to automatically learn from human content additions to continually improve the accuracy of automated tagging.
  3. Crowdsourced ratings and reviews
    Data catalogs should support rating data sources, adding comments, and providing information about data sets for context.
  4. Ability to ensure tagging and metadata freshness
    To keep the data catalog current, new and incremental scans should update existing metadata. This is especially vital with sensitive data or data that is subject to regulatory compliance policies.
  5. Enterprise scalability
    Look for data catalogs that are natively developed on big data technology such as Spark, Solr or cloud infrastructure to scale service across the entire enterprise. Data catalogs need to be able to manage a wide variety of data source types (relational, semi-structured or unstructured) residing anywhere (on-premises, cloud, hybrid) and scale with your growing data landscape.
  6. Open APIs for integration with a wide variety of tools
    Data catalogs should be able to easily integrate with existing business glossaries, data prep, data discovery business intelligence tools and line of business applications with open APIs.
  7. Search
    Look for a data catalog that uses a search engine like Solr that has been proven to scale so all users can search the data catalog for data-driven decision making.
  8. Data catalog as a “platform”
    New initiatives on data security, governance, data rationalization and consent management applications will need to be built on top of data catalogs to adapt to changing data legislation. Ensure your data catalog vendor has a vision for these types of use cases.
  9. Data lineage
    Data catalogs need to provide an overview of where data originated and where it was sent. To address common data lineage gaps, the data catalog should automatically discover and suggest missing lineage between data sets to ease manual entry of missing data lineage chains.
  10. Data protection
    To govern and protect sensitive, look for data catalogs that support data masking along with role-based, granular data security.

About Waterline Data

Waterline Data is an industry leading data catalog vendor. Top fortune 500 companies, financial, government and healthcare institutions have already implemented Waterline Data’s Smart Data Catalog solution.

Waterline Data automatically builds a rich catalog of data assets that enables data scientists, business analysts, and data stewards to find, understand, and govern data to accelerate data discovery and ensure compliance. Waterline Data Fingerprinting™ technology combines big data analysis, machine learning and human curation to automatically catalog data and data lineage at scale. The Waterline Data catalog can be integrated with existing business glossaries, data preparation, self-service BI and line of business applications. It is also certified with Apache Atlas (used by Hortonworks) and Cloudera Navigator to share metadata, tags, and data lineage.

Waterline Data is an ideal solution
for GDPR.

Waterline Data is an ideal solution for governing sensitive GDPR data sources. It enables data stewards to review and curate automatic suggestions, data profiles, and data quality metrics. It also captures data lineage from other systems and infers data lineage when it isn’t already documented.

Support for data masking sensitive data is also available. For an additional level of data security, Waterline Data can automatically share tags with access control tools such as Apache Ranger, Cloudera Sentry or others via a REST API.

For More Information

To learn more about enterprise data catalog solutions, GDPR legislation, and lessons learned from early adopters, please review the following resources.