An Introduction To Data Catalogs The Future of Data Management
An Introduction To Data Catalogs The Future of Data Management
Data Catalogs
Research Sponsored by
By Dave Wells
Introduction
The difficulties of data management have without visibility into existing data sets, their data curation, and data governance. Data catalogs
intensified at a steady pace over the past several contents, or their quality and usefulness. As a touch nearly everyone who works with data.
years. The management complexities of big result, analysts spent much of their time finding Success with data cataloging begins with
data, cloud hosting, self-service analytics, and data, understanding data, and recreating data sets fundamental knowledge of data catalog basics.
data science can’t be ignored. Effective data that already existed. Data catalogs were designed You’ll need to understand the what and why of
management has become a top priority for most to address these issues. data cataloging, the role and purpose of data
organizations, but getting there is challenging. A From modest beginnings as a means to manage curation, how data catalogs are a game-changer
data catalog has an essential role in overcoming data inventory and expose data sets to analysts, the for metadata management, and the importance of
these challenges. data catalog has grown in functionality, popularity, collaboration and crowdsourcing. Ultimately, you’ll
Data catalogs were introduced to help data and importance. Modern data catalogs still meet need to plan for and drive data catalog adoption
analysts find and understand data. Before data the needs of data analysts, but have expanded their — getting all data stakeholders to participate in
catalogs, most data analysts worked blind — reach. They are now central to data stewardship, curation and cataloging processes and practices.
Starting the Data Cataloging Journey consumers, curators, stewards, subject matter experts, etc. Search metadata
supports tagging and keywords to help people to find data. Processing
Data catalogs have quickly become a core component of modern data metadata describes transformations and derivations that are applied as data
management. Organizations with successful data catalog implementations is managed through its lifecycle. Supplier metadata is especially important for
see remarkable changes in the speed and quality of data analysis, and in the data acquired from external sources, informing about sources and subscription
engagement and enthusiasm of people who need to perform data analysis. or licensing constraints. We’ll look more closely at catalog metadata in Chapter
By contrast, organizations without a data catalog often have these questions: 3: Data Catalogs and Metadata Management.
What is a data catalog? Why do we need a data catalog? What does a data Figure 1. Data Catalog Metadata Subjects
FIGURE 1 DATA CATALOG METADATA SUBJECTS
catalog do? These are all good questions and a logical place to start your data
cataloging journey.
people
searching
Data Catalog Defined
A Data Catalog is a collection of metadata, combined with data management
and search tools, that helps analysts and other data users to find the data that datasets
they need, serves as an inventory of available data, and provides information to
evaluate the fitness of data for intended uses.
This brief definition makes several points about data catalogs — data
management, searching, data inventory, and data evaluation — but all depend
processing suppliers
on the central capability to provide a collection of metadata.
Data catalogs have become the standard for metadata management in the
age of big data and self-service analytics. The metadata that we need today Why Do We Need a Data Catalog?
is more expansive than metadata in the BI era. A data catalog focuses first on
The data management benefits of a data catalog become apparent by
datasets (the inventory of available data) and connects those datasets with rich
reflecting on the value of metadata and the capabilities that are created with
information to inform people who work with data. Figure 1 illustrates the typical
comprehensive metadata. The greatest value, however, is often seen in the
metadata subjects contained in a data catalog.
impact on analysis activities. We work in an age of self-service analytics. IT
Datasets are the files and tables that data workers need to find and access. organizations can’t provide all of the data needed by the ever-increasing
They may reside in a data lake, warehouse, master data repository, or any other numbers of people who analyze data. But today’s business and data analysts
shared data resource. People metadata describes those who work with data — are often working blind, without visibility into the datasets that exist, the
© Eckerson Group 2019 www.eckerson.com
INTRODUCTION TO DATA CATALOGS 6
contents of those datasets, and the quality and usefulness of each. They spend and perform data preparation and analysis efficiently and with confidence. It is
too much time finding and understanding data, often recreating datasets common to shift from 80% of time spend finding data and only 20% on analysis
that already exist. They frequently work with inadequate datasets resulting in to 20% finding and preparing data with 80% for analysis. Quality of analysis is
inadequate and incorrect analysis. Figure 2 illustrates how analysis processes substantially improved and organizational analysis capacity increases without
change when analysts work with a data catalog. adding more analysts.
FIGURE 2. ANALYSIS
Figure WITHOUT AND WITH
2. Analysis Without andAWith
DATA CATALOG
a Data Catalog
• Data Access — The path from search to evaluation and then to data access A robust data catalog provides many other capabilities including support
should be a seamless user experience with the catalog knowing access for data curation and collaborative data management, data usage tracking,
protocols and providing access directly or interoperating with access intelligent dataset recommendations, and a variety of data governance features.
technologies. Data access functions include access protections for security,
privacy, and compliance sensitive data.
More Than Shared Databases datasets that is selected and managed to meet the needs and interests of a
specific group of people. Note that the focus here is datasets – files, tables, etc.
Data curation is a term that has recently become a common part of data – that can be accessed and analyzed. The distinction between “collections of
management vocabulary. Data curation is important in today’s world of data data” and “collections of datasets” is subtle but significant.
sharing and self-service analytics, but I think it is a frequently misused term. Data curation, then, is the work of organizing and managing a collection
When speaking and consulting I often hear people refer to data in their data of datasets to meet the needs and interests of a specific groups of people.
lakes and data warehouses as curated data, believing that it is curated because Collecting datasets is only the beginning. That is what we do when we store
it is stored as shareable data. Curating data involves much more than storing data in data warehouses or data lakes. But organizing and managing are the
data in a shared database. essence of data curation. Making datasets easy to find, understand, and access
is the purpose of data curation — a purpose that demands well-described
What is Curation? datasets. Data curation is a metadata management activity and data catalogs
are essential data curation technology.
Let’s set data aside for a moment and consider the meaning and the activities of
curating. The word “curated” is used frequently today. The traditional use of the
word is associated with collections of artifacts in a museum and works of art
Who Are the Data Curators?
in a gallery. More recently we’ve started to use the term to describe managed A typical organization has many people doing data curation work (see figure
collections of many kinds such as curated content at a website, curated music 3) with varying degrees of responsibility and time commitment. Everyone who
and videos available through streaming services, and curated apps through works with data has the opportunity to curate by sharing their knowledge
download services. Wired.com has described Apple’s App Store as “curated and experiences. Crowdsourcing of tribal knowledge is an important part of
computing.” curation practice. Collaborative data management is a necessity in the self-
Curation is the work of organizing and managing a collection of things to meet service world and knowledge sharing is the first step in creating collaborative
the needs and interests of a specific group of people. Collecting things is only culture. Curation collaborators will be large in number with a modest level of
the beginning. Organizing and managing are the critical elements of curation — responsibility and time commitment.
making things easy to find, understand, and access. Domain curators have subject expertise in specific data domains such as
customer, product, finance, etc. Domain curators record and share data domain
What is Data Curation? knowledge that helps data analysts to understand the nature of the data that
they work with. The number of domain curators is substantially smaller than
If “curated” describes collections of things that are selected and managed the number of collaborative curators, with greater level of responsibility and
to meet the needs of a specific group, then “curated data” is a collection of time commitment.
Collaborative Curators
number of people
What About Data Stewards? The roles of data steward and data curator are related and somewhat overlapping.
I frequently am asked about the differences between data curators and data Stewards and curators working together is a combination that maximizes the
stewards: Are they two names for the same role? Can data stewards be your value of data across all use cases from enterprise reporting to analytics and data
data curators? Why do we need both stewards and curators? These are good science. Stewardship and curation are both metadata management activities and
data governance roles. Data curation and data cataloging are important elements very differently from metatdata management practices of the past. Chapter 3:
of modern data governance. They are complementary disciplines that are both Data Catalogs and Metadata Management looks at metadata management in
essential in the age of self-service analytics. greater depth.
Ultimately, data curation is a metadata management activity, and data
cataloging is metadata management technology. But both approach metadata
Single Source for Shared Metadata As data management becomes more complex with data lakes, big data, self-
service analytics, and data science, the role of metadata changes and the
Recall that we previously defined a data catalog as “a collection of metadata, importance of metadata increases exponentially. Metadata that is current,
combined with data management and search tools, that helps analysts and accurate, and readily accessible is an imperative. Metadata disparity is not
other data users to find the data that they need, serves as an inventory of workable and metadata management as an afterthought is hazardous. We
available data, and provides information to evaluate the fitness of data for must actively manage metadata, and a data catalog is the right tool for the
intended uses.” Although accurate, this definition overlooks one very important job. The data catalog has become the new gold standard for metadata and a
point: The data catalog serves as a resource of shared metadata. Everyone cornerstone of data curation.
who has knowledge about data can share it through the catalog, and anyone
seeking knowledge about data can find it in the catalog.
Metadata in the Age of Self-Service
From modest beginnings as a means to manage data inventory and expose
data sets to analysts, the data catalog has grown in functionality, popularity, The real value of metadata is found in the answers it can provide. People
and importance. Modern data catalogs — originated to help data analysts find who depend on data have questions about trustworthiness, latency, lineage,
and evaluate data — continue to meet the needs of analysts, but they have sensitivity, preparation, and much more. Sometimes they want to find others
expanded their reach. They are now central to data analysis, data stewardship, who know or have worked with the data to get human perspective. And they
data curation, and data governance — all metadata dependent activities. need to know about access, privacy and security constraints, cost, etc. Robust
metadata ranging from data set names and properties to usage, access,
licensing, and subject experts is the key to answering the many questions that
A New Approach to Metadata data users and data managers will ask.
Management In today’s self-service world, metadata is essential for three distinct groups of
data management stakeholders:
It seems that everyone wants data management but most want to avoid
• Data consumers need metadata to help them find data for reporting,
metadata management. The distaste for metadata management is an artifact
analysis, and data science work, and to evaluate that data to ensure that
of past metadata approaches with disparate metadata collected by a variety of
they work with the right datasets.
tools using proprietary formats and without integration. Metadata management
in the BI era was painful, but we can’t avoid the reality that metadata is • Data curators need metadata to observe data usage, understand the needs
essential to data management. Just as you need data about finances for and interests of data consumers, and effectively manage the collection of
effective financial management, you need data about data (metadata) for shared data.
effective data management. You can’t manage data without metadata.
• Data governors (owners and stewards) need metadata to identify and FIGURE 4. METADATA IN A CATALOG
Figure 4. Metadata in a Catalog
protect sensitive data, trace data lineage, and establish trust in data. require transform- execute
license supplier process
ation
review
provide modify calculate
evaluate control
Chapter 4: Collaboration
and Crowdsourcing
People and Culture in Data Cataloging
Knowledge Sharing
Analysis Sharing
struggle. The key to data-driven success and maturity is data culture, and
Data Sharing
Data Warehouse
strong data culture begins with participation. Getting people at all levels from
chief data officer to self-service data consumer to actively participate in data
management activities is a barrier to building a strong and healthy data culture. MDM / RDM
Dataset Searching Data Understanding
A data catalog can be the catalyst that helps to break through the barrier with
Collaboration Data Curation
SaaS
collaboration and crowdsourcing.
SaaS Metadata
Applications
DATA CATALOG
LEGACY
Business Business
Collaboration is central to data-driven culture, creating an environment where Legacy Systems
Data Science
Analytics Intelligence
Reporting
no data stakeholders work in isolation, and where working together and sharing
knowledge and experience is the norm. A robust and full-featured data catalog
encourages collaboration and crowdsourcing with capabilities such as ratings, Why Collaboration and Crowdsourcing
reviews, annotations, and deprecations. This is the human side of data cataloging
that breaks down organizational silos and fosters a culture of sharing — knowledge
— An In-The-Trenches View
sharing, data sharing, process sharing (data preparation), and analysis sharing. Everyone with a role in data management and everyone with data knowledge
(See figure 5.) The data catalog becomes the centerpiece connecting people, data, has opportunity and responsibility to collaborate in the processes and activities
and use cases in a way that improves both speed and quality of analysis. that make a data catalog valuable and informative. Data consumers, data
Actively sharing knowledge, data, and experiences elevates data literacy and curators, and data governors must all participate to create a culture of data
competencies of everyone involved. Working together exposes every individual sharing, metadata sharing, and knowledge sharing.
to new information and different perspectives, often generating new ideas and Analysis and Reporting: Finding the right data for a self-service reporting
sometimes sparking innovation. or analysis project is typically a difficult and time-consuming task filled
with unanswered questions. Users of data have questions about quality, those who work with data. Recall the earlier description of three levels of data
trustworthiness, latency, lineage, and more. Sometimes they want to find others curators — lead, domain, and collaborative. Curators are the largest group,
who know or have worked with the data to get a human perspective. Through sharing and formalizing tribal knowledge and posting reviews and ratings
collaboration the network of people willing to share their data knowledge rapidly to share their experiences when working with data. Crowdsourcing of tribal
expands. The effect is amplified with a data catalog that identifies data stewards, knowledge enriches catalog metadata and elevates the user experience for
data coaches, data subject matter experts, and frequent users of datasets. everyone who works with data. Crowdsourced knowledge from people who
Figure 6. Rethinking Data Governance have worked with the data, consumer reviews, and usage tracking metadata
FIGURE 6. RETHINKING DATA GOVERNANCE help to evaluate and select the best-fit datasets for each unique analysis and
Old Style Command-and-Control Data Governance reporting use case. Collaboration within and among the three levels of curators
is an effective way to supercharge the richness and value of catalog metadata.
Define Monitor Enforce Data Governance: Adoption of self-service analytics has challenged
Policies Compliance Policies
conventional data governance practices. The top-down, command-and-control
governance techniques of the past are at odds with the agility and autonomy
interests of the self-service community. In the self-service world, collaborative
Modern Collaborative Data Governance
data governance is an emerging and important practice. We must govern with
Facilitate Policy Definition the belief that most people want to do the right thing. The primary role of
governance is to help them to know what is the right thing. Participation and
collaboration are essential to fulfilling that role. (See figure 6.)
Prevention Intervention Enforcement
The data catalog is a core component of collaborative data governance. It
provides a single point of reference for everyone who works with data. Everyone
Foster Policy Compliance
from chief data officers to self-service consumers sees the same metadata, and
all have opportunity to share their knowledge, experiences, and perspectives
Data Curation: Data curation, as previously described, is the work of organizing about data. Crowdsourced, participative data governance is a natural fit for self-
and managing a collection of datasets to meet the needs and interests of service organizations.
business
create & confirm, maintain
case discover
Chapter 4: Collaboration and Crowdsourcing discussed the importance of tool configure complete current &
& catalog use the
selection new data & enrich complete
datasets catalog
participation by all data stakeholders as a key to getting maximum value from technical catalog metadata metadata
case
your data catalog. Many organizations, however, find data catalog adoption This is
— getting people to participate — to be among the biggest challenges to data the
This is the easy part. This is a bit more difficult. hard
catalog success. Adoption is challenging, but understanding the causes of part!
resistance and developing an adoption plan help to overcome those challenges. New Methods – Breaking away from “the way we’ve always done it”
the Adoption Challenges Data Literacy – The skills to get from data to useful information
When planning for implementation, the human and cultural dimensions of data Motivation – Using the catalog isn’t hard but motivating people may be difficult
cataloging are often overlooked or subordinated to the process and technology
dimensions. A typical data catalog implementation process begins by defining
• New Methods as a Barrier — It is human nature to be anchored by “the way
the business and technical case, proceeds through technology selection and
that we’ve always done it.” The shift to new ways of doing things pushes
installation, then moves on to data discovery and populating the metadata
people away from the familiar and comfortable. Self-service data consumers
catalog. (See figure 7.) This build-it-and-they-will-come approach fails to engage
may resist the data catalog and continue to rely on personal networks and
people to actively use the catalog.
tribal knowledge because it is what they know how to do. Using the data
Figure 7. Data Catalog Implementation catalog requires them to learn new things, which can seem time-consuming
FIGURE 7. DATA CATALOG IMPLEMENTATION
and disruptive for busy people.
business
create & confirm, maintain
case discover • Culture Shift — Data cataloging is most successful in a culture of data
tool configure complete current &
& catalog use the
selection new data & enrich complete
datasets catalog sharing, knowledge sharing, and collaboration. Behaviors such as “my
technical catalog metadata metadata
case data” mentality, territorialism, and knowledge hoarding are signs of an
unhealthy culture that is a barrier to becoming a data-driven organization. A
The final step — use the catalog — often doesn’t happen at the level expected healthy data culture encourages collaboration and sharing, and discourages
for a variety of reasons. (See figure 8.) Predominant among those reasons: the unhealthy behaviors. Participation is a key element of data culture —
participation at all levels. Leadership visibly invests in data management spreadsheets. The data catalog and access to abundant data often feels
and in growing data literacy throughout the organization. Staff are more like hazard than opportunity to these people.
encouraged and incentivized to access and analyze data and to share their • Motivation — Changing how you work and learning to use the data
knowledge about working with data and share the insights that they derive catalog can seem intimidating, time-consuming, or simply out of comfort
from data. zone. Most people will resist change until they see how it benefits them
• Data Literacy — Many line-of-business people have responsibilities that personally. What’s-in-it-for-me (WIFM) is a typical response, especially when
depend on data analysis but have not been trained to work with data. asked to do new things such as participate in metadata crowdsourcing and
The skills to get from data to useful information — data selection, data post ratings and reviews of datasets. WIFM is a major influence in resistance
understanding, data preparation, data analysis, data visualization, and to data sharing, resistance to knowledge sharing, reluctance to participate in
data storytelling — are not native and natural for them. Their tendency is to collaborative curation, and reluctance to post ratings and reviews.
do just enough data work to get by, and to do that work primarily in Excel
Closing Thoughts
Data catalogs are positioned to be an enduring part of the future of data case — know the what and why of data cataloging. Then put data curation
management. They fill critical roles for data analysis, data curation, data practices into action to manage metadata, and encourage collaboration and
governance, and data science. Effective use of a data catalog increases crowdsourcing to enrich the metadata. Systematically and incrementally
effectiveness and value derived from all of the other tools in your data and expand the reach of the data catalog, ultimately extending to all data
analytics technology stack. Data preparation, data analysis, and data science consumers and stakeholders. With this approach to data cataloging you’ll
tools all see marked ROI increases when coupled with data cataloging. To experience real business impact through increased capacity for data analysis,
realize the benefits of data cataloging, begin with the business and technical accelerated analysis, and improved quality and reliability of analysis results.
© Eckerson
© Eckerson Group
Group 2019
2019 www.eckerson.com
www.eckerson.com
INTRODUCTION TO DATA CATALOGS 23
About Alation
Alation, the data catalog company, is building a data-fluent world by changing
the way people find, understand, trust, use and reuse data. The first to bring
a data catalog to market, Alation combines machine learning and human
collaboration to bring confidence to data-driven decisions. More than 150
organizations, including eBay, Exelon, Munich Re and Pfizer, leverage the Alation
Data Catalog. Headquartered in Silicon Valley, Alation is funded by Costanoa
Ventures, DCVC (Data Collective), Harmony Partners, Icon Ventures, Salesforce
Ventures, and Sapphire Ventures.
For more information, visit alation.com.