Data Warehousing - Metadata Concepts
Data Warehousing - Metadata Concepts
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data
warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-stamping
Categories of Metadata
Business Metadata − It has the data ownership information, business definition, and changing policies.
Technical Metadata − It includes database system names, table and column names and sizes, data types and
allowed values. Technical metadata also includes structural information such as primary and foreign key
is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on
it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata
This directory helps the decision support system to locate the contents of the data warehouse.
Metadata helps in decision support system for mapping of data when data is transformed from operational
Metadata helps in summarization between current detailed data and highly summarized data.
Metadata also helps in summarization between lightly detailed data and highly summarized data.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
Definition of data warehouse − It includes the description of structure of data warehouse. The description is
defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents.
Business metadata − It contains has the data ownership information, business definition, and changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data
is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on
it.
Data for mapping from operational environment to data warehouse − It includes the source databases and
their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules.
summarizing, etc.
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of metadata,
it also has its challenges. Some of the challenges are discussed below.
Metadata in a big organization is scattered across the organization. This metadata is spread in spreadsheets,
Metadata could be present in text files or multimedia files. To use this data for information management
There are no industry-wide accepted standards. Data management solution vendors have narrow focus.
very high. Before data marting, make sure that data marting strategy is appropriate for your
particular solution.
In this step, we determine if the organization has natural functional splits. We look for
departmental splits, and we determine whether the way in which departments use information
tend to be in isolation from the rest of the organization. Let's have an example.
Consider a retail organization, where each merchant is accountable for maximizing the sales of
a group of products. For this, the following are the valuable information −
As the merchant is not interested in the products they are not dealing with, the data marting is a
subset of the data dealing which the product group of interest. The following diagram shows
The merchant could query the sales trend of other products to analyze what is happening to the sales.
Note − We need to determine the business benefits and technical feasibility of using a data
mart.
We need data marts to support user access tools that require internal data structures. The data in
such structures are outside the control of data warehouse but need to be populated and updated
on a regular basis.
There are some tools that populate directly from the source system but some cannot. Therefore
additional requirements outside the scope of the tool are needed to be identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be
directly populated from the data warehouse, rather each tool must have its own data mart.
There should to be privacy rules to ensure the data is accessed by authorized users only. For
example a data warehouse for retail banking institution ensures that all the accounts belong to
the same legal entity. Privacy laws can force you to totally prevent access to information that is
Data marts allow us to build a complete wall by physically separating data segments within the
data warehouse. To avoid possible privacy problems, the detailed data can be removed from the
data warehouse. We can create data mart for each legal entity and load it via data warehouse,
Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps in
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.
Network Access
Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage.
If detailed data and the data mart exist within the data warehouse, then we would face additional
Note − Data marting is more expensive than aggregations, therefore it should be used as an
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that
the LAN or WAN has the capacity to handle the data volumes being transferred within the data
The extent to which a data mart loading process will eat into the available time window depends
on the complexity of the transformations and the data volumes being shipped. The
Network capacity.