100% found this document useful (1 vote)
200 views12 pages

Metadata Definitions - V01

This document discusses definitions of various types of metadata that are essential for a statistical data warehouse. It begins by defining metadata as "data about data" and statistical metadata as "data about statistical data". It then discusses categories of metadata including active vs passive, structured vs free-form textual, and reference vs structural metadata. Active metadata can be used operationally, while passive metadata are used for documentation. Structured metadata follow standardized codes and hierarchies, while textual metadata can be free-form or semi-structured. Reference metadata help users understand the data, while structural metadata help users find and access the data.

Uploaded by

Tanuj Kukreti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
200 views12 pages

Metadata Definitions - V01

This document discusses definitions of various types of metadata that are essential for a statistical data warehouse. It begins by defining metadata as "data about data" and statistical metadata as "data about statistical data". It then discusses categories of metadata including active vs passive, structured vs free-form textual, and reference vs structural metadata. Active metadata can be used operationally, while passive metadata are used for documentation. Structured metadata follow standardized codes and hierarchies, while textual metadata can be free-form or semi-structured. Reference metadata help users understand the data, while structural metadata help users find and access the data.

Uploaded by

Tanuj Kukreti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Draft

ESSnet on Data Warehousing Memo 1(8)


Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

Some Metadata Definitions


The purpose of this paper is to initiate a discussion on what metadata are
essential to and specific to a statistical data warehouse. A reasonable starting
point is to establish some basic definitions. To that end the Internet and the
bookshelves were searched for metadata and data warehousing related
information. In particular Internet sites set up by national and international
organisations working with statistics and/or standards were searched. Most
of the results shown below come from Eurostat, OECD, UNECE as statistics
sites and NISO, ISO as standards organisations. Detailed search results have
been compiled in Annex 1.

Metadata and data


Metadata and statistical metadata
General definitions of metadata can be found in many books and many sites
on the Internet. Most of them are very short and simple. The most commonly
used generic definition states that:

 Metadata are data about data

There are some variations on the theme, e.g. claiming that metadata should
(or must) be structured or formalised. Perhaps somewhat unexpectedly the
sources that have a relation to statistics give definitions that are even shorter
and vaguer than some of the general purpose sources. The OECD definition
of statistical metadata is for example simply:

 Statistical metadata are data about statistical data

This definition will obviously cover all kinds of documentation with some
reference to any type of statistical data and is applicable to metadata that
refer to data stored in a statistical data warehouse as well as any other type of
data store.

Data and statistical data


Since the definition of metadata shows that they are just a special case of
data, we need a reasonable definition of data as well. A derivative from a
number of slightly varying definitions would be:

 Data are qualitative and/or quantitative information collected through


observation

As well as a definition of statistical metadata, we can find several definitions


of statistical data. OECD provides this definition:

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 2(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

 Statistical data are data from a survey or administrative source used


to produce statistics

For statistical data warehouse purposes this definition has to be slightly


revised:

 Statistical data are data from one or several surveys and/or


administrative sources used to produce statistics

Metadata categories
Metadata may describe many different aspects of data. Hence metadata can
be categorised in a number of ways, or overlapping dimensions.
Consequently, each metadata item normally belongs to several categories.

Active vs. passive metadata


Traditionally, metadata have been seen as a documentation of an existing
object or a process, such as a statistical production process, that is running
or has already finished – i.e. the result of a task most often carried out as the
last, even optional step of the production process. This indicates a passive,
recording role, which is useful for documenting, e.g., the methods used to
plan and carry out a survey or the quality achieved for the final results.

Passive metadata will become more active if they are used as input for
planning, e.g., a new survey round or a new similar statistics product. The
term active metadata should, however, be reserved for metadata that are
operational. Active metadata may be regarded as an intermediate layer
between the user and the data, which can be used by humans or computer
programmes to search, link, retrieve or perform other operations on data.
Thus active metadata may contain rules or code (algorithmic metadata).
Some authors use the term active only for those metadata, i.e. those that can
be interpreted or executed at runtime to support metadata driven processes,
calling all other non-passive metadata semi-active.

Passive metadata are used as documentation in all statistics production


regardless of storage environment. In a statistical data warehouse active
metadata must be available in what is often called the metadata layer.

Suggested definitions:

 Active metadata are metadata stored and organised in a way that


enables operational use, manual or automated, for one or more
processes (GSBPM)
 Passive metadata are any metadata that are not active

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 3(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

Structured vs. free-form textual metadata


As mentioned above some authors claim that metadata must be structured, or
formalised. The opposite would probably be metadata in a completely free
form. In practice all metadata probably follow some kind of structure, which
may be more or less strict. At one end we have completely and strictly
formalised metadata, meaning that only pre-determined codes or numerical
information from a pre-determined domain may be used. At the other end we
find a loose structure, e.g. a set of chapters, subdivisions, headings, etc. that
may be mandatory or optional and whose contents may adhere to some rules
or may be entered in a completely free form (text, diagrams, etc.).

Strictly structured metadata are obviously well suited for use in an active
role, but there is no simple, unambiguous mapping between active and
structured, and passive and free-form, respectively.

Since active metadata are vital to building an efficient statistical data


warehouse it follows that in that environment metadata should also be well
structured, whenever possible.

Suggested definitions:

 Structured metadata are metadata stored and organised according to


standardised codes, lists and hierarchies
 Textual metadata are metadata that contain descriptive information
using formats ranging from completely free-form to semi-structured

Reference vs. structural metadata


Most sources define two main categories of metadata, most often called
business and technical metadata. The distinction between those two varies
between the authors, but a generalised definition could be that business
metadata help the user understand, interpret and evaluate the contents, the
subject matter, the quality, etc, of the data, and technical metadata help the
user find and access the data by providing attributes such as names and
descriptions of files, tables, columns, fields, etc.

In the “statistical sources” the terms business and technical metadata are
rarely used. Several different synonyms can be found for business metadata,
e.g. conceptual or logical. Most commonly used is, however, reference
metadata. Instead of technical metadata you will often find the term
structural metadata
In the “statistical sources” the terms reference metadata and structural
metadata are preferred instead of business and technical metadata. The
definitions remain.

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 4(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

Structural/technical metadata can quite easily be represented as structured


and active, while more work and efforts are required to facilitate making
reference/business metadata active by storing them in a structured way.

Other similar categorisations are sometimes used, e.g. the term


administrative metadata (cf. NISO) for a subset of structural metadata to
define metadata that handle users’ rights to access and utilise data (rights
management metadata) and metadata specifically for archiving purposes
(preservation metadata).

Suggested definitions:

 Reference metadata are metadata that describe the contents and


quality of the data in order to help the user understand and evaluate
them (conceptually)
 Structural metadata are metadata that help the user find, identify,
access and use the data (physically)

Process metadata
Information on an operation, such as start and end times, result status code,
number of records processed, resources used, etc., is a specific type of
metainformation. This kind of metadata is known under several names, such
as process metadata, process data, process metrics, paradata. These data may
either contain expected values or actual outcome. In both cases they are
primarily intended for planning – in the latter case by evaluating finished
processes in order to improve recurring or similar ones. Process metadata
should be structured to facilitate computer aided evaluation.

Suggested definition:

 Process metadata are metadata that describe the expected or actual


outcome of one or more processes using evaluable and operational
metrics
Quality metadata
Quality metadata may be read as metadata on the quality of the data or
metadata (of high) quality. Both interpretations are relevant to statistics
production and data warehousing.

Keeping track of, maintaining and perhaps raising the quality of the data in
the warehouse is an important governance task that requires support from
metadata. Quality information should be available in different forms and
serve several purposes: to describe the quality achieved (e.g. how a survey
was carried out, or what the outcome was), or to measure the outcome (a

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 5(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

contribution to the process metadata). The main objective of the former is to


serve the end users of the data, while the latter primarily supports governance
and future improvements. Hence quality metadata may be seen as a different
dimension that cuts through all the others.

Metadata quality is obviously a very important issue, and it should be high,


within the restrictions of reasonable cost-benefit analysis. Inferior metadata
quality may lead to unnecessary misinterpretations of the data contents or
even in completely useless data.

Suggested definition:

 Quality metadata are any kind of metadata that contribute to the


description or interpretation of the quality of data.

Metadata structures
Several sources claim that the data warehouse needs a central system where
its metadata are registered and logically stored, a metadata registry. This
registry will make it easier to handle identification, checks for duplicates,
ensure consistency, etc. It is, however, a logical matter; a centralised
metadata registry does not imply that metadata are physically stored in a
centralised system.

The term metadata repository is also frequently used, particularly when


discussing metadata in relation to data warehousing. In this case the
distinction between logical and physical matters seems less clear. The
repository is logically centralised, but while some also advocate a centralised
physical solution, based on some form of central “metadatabase”, others
prefer coordinated, physically distributed systems. This means that a meta-
data registry may be seen as a subset of a metadata repository, or as a
narrower definition.

A third commonly used term is the metadata layer. A data warehouse is


often described as consisting of several parts that serve separate functions,
sometimes called layers. The metadata layer may in this case be interpreted
as a synonym for either the metadata registry or the metadata repository,
depending on the exact definitions being used.

Metadata collection and usage


The metadata lifecycle is commonly described as divided into the following
three basic phases:

1. Collection
Metadata should be captured as early as possible in the production

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 6(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

process. The sources vary. Collection of some types of metadata can


and should be automated. When data is entered into the data
warehouse basic metadata must already exist in a correct form

2. Maintenance
Metadata must be up to date at all times. Processes must be in place
to capture changes, synchronize metadata with the changing
architecture

3. Deployment
Metadata must be available to users in the right form and with the
right tools.

Collection of metadata should be automated whenever possible. This means


that, e.g., metadata that exist in the sources, such as administrative data files
used as input, should be used directly or in a derived form.

Another way of simplifying metadata collection is to use what already exists.


Reuse and inherit are common keywords in metadata literature. One of the
major advantages of using metadata is that duplicate and “near-duplicate”
data can be revealed and avoided. Reusing data and metadata saves
resources, increases efficiency and quality. Revealing, e.g., variables having
almost, but not quite, the same definitions can improve harmonisation and
comparability. The data harmonisation that will be enabled by metadata
harmonisation is a vital task for the data warehouse – possibly the most
important and at the same time one of the most difficult ones.

Different user categories need different metadata and have different


requirements. End users want to use metadata to easily and correctly find and
interpret the data they need. Data stewards want an inventory of what is
stored in the data warehouse. Analysts want to compare the data sources.
Programmers want to make sure that they use the standard names. These are
just a few examples of metadata usage. The use ranges from detailed and
operational to overview and descriptive.

Metadata standards
Standards for metadata have been discussed for many years, but still have not
developed very far. The most successful effort is probably ISO/IEC 11179,
Metadata registries, which is a standard on the conceptual level. Several
NSIs have based their metadata systems on that standard.

The Common Warehouse Metamodel (CWM) is a specification for


modelling metadata for data warehouses. The standard is supported by the
Object Management Group, which in turn is supported by several major
software companies.

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 7(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

DDI, the Data Documentation Initiative, is an XML based standard


specification for documentation of social science data. It is supported by an
international alliance.

SDMX, Statistical Data and Metadata eXchange, is also based on XML.

Several NSIs are currently cooperating on the development of a Generic


Statistical Information Model (GSIM), which includes the Common
Reference Model (CRM), and is linked to the Generic Statistical Business
Process Model (GSBPM). The work is lead by ABS.

Metadata for statistical data warehouses


“Metadata is the DNA of the data warehouse, defining its elements and how
they work together. [...] Metadata plays such a critical role in the architecture
that it makes sense to describe the architecture as being metadata driven.”1

Panos Vassiliadis2 of the University of Ioannina, Greece, summarizes well


the requirements of data warehouse metadata. They should include
information on:

1. the contents of the data warehouse, their location and their structure
2. the processes that take place in the data warehouse
3. the implicit semantics of data along with any other kind of data that
aids the end-user exploit the information of the warehouse
4. the infrastructure and physical characteristics of components and the
sources of the data warehouse
5. security, authentication, and usage statistics that aids the
administrator tune the operation of the data warehouse as appropriate

The metadata categories described earlier in this paper are general. Some
sources mention metadata categories specific to the data warehouse
environment, e.g. ETL metadata (for the “Extract–Transform–Load”
process), but these all seem to be subsets or just renaming the categories
already defined.

Looking at the categories, and keeping in mind the specific demands of a


statistics production environment it is possible to assess which categories
play special roles building and maintaining a statistical data warehouse
(SDW).

 SDW requires active metadata. The amount of objects (variables,


value domains, etc.) stored makes it necessary to provide the users
1
Kimball, The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008, p. 117
2
Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009

552060849.docx
21-10-25 07.42
Draft
ESSnet on Data Warehousing Memo 8(8)
Statistics Sweden 2011-05-1306-10
Lars-Göran Lundell

(persons and software) with active assistance finding and processing


the data.

 SDW requires structured metadata. The amount of metadata items


will be large and the requirement for metadata to be active makes it
necessary to structure the metadata very well.

 SDW requires structural metadata. Active metadata must, at least to


some part, be structural.

 Process metadata are vital to a SDW. Since the data warehouse


supports many concurrent users it is very important to keep track of
usage, performance, etc. In a data warehouse that has been less than
perfectly designed one user’s choice of tool or operation could impair
the performance for other users. An analysis of process metadata can
be an input to correcting this anomaly.

This does not mean that the remaining metadata categories should be
disregarded, but that they are used and needed in a statistical data warehouse
in the same way as in any statistics production environment.

552060849.docx
21-10-25 07.42
Annex 1
1(4)

Metadata related terms

Sources
 Wikipedia Direct quotations from Wikipedia and from its sources
http://en.wikipedia.org/wiki/Metadata
http://en.wikipedia.org/wiki/Data_warehouse
 ISO (International Standards Organization, ISO/IEC 11179 Metadata registries (MDR)), http://metadata-stds.org/11179/
 NISO (National Information Standards Organization), Understanding Metadata. http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
 UNECE Metadata Common Vocabulary, MCV (Draft, March 2006) http://circa.europa.eu/Public/irc/dsis/metadata/library?
l=/metadata_forces/force_meeting_092007/mtf-6-mcv-anxpdf/_EN_1.0_&a=d
 UNECE, Terminology on Statistical Metadata (2000) http://www.unece.org/stats/puSblications/53metadaterminology.pdf
 UNECE, Guidelines for the modeling of statistical data and metadata (1995) http://www.unece.org/stats/publications/metadatamodeling.pdf
 OECD, Glossary of Statistical Terms http://stats.oecd.org/glossary/

Term Wikipedia ISO NISO OECD, UNECE (Metadata UNECE (Terminology,


Common Vocabulary) Guidelines)
Metadata Data providing information Data that defines and Structured information that Data that defines and describes Data and other documentation
about one or more aspects of describes other data describes, explains, locates, or other data. that describes objects in a
the data, such as: otherwise makes it easier to formalized way
 Means of creation of retrieve, use, or manage an Metadata are data that describe
the data information resource. Metadata other data, and data become
 Purpose of the data is often called data about data or metadata when they are used in this
 Time and date of information about information. way.
creation Metadata can describe resources
 Creator or author of at any level of aggregation. It
data can describe a collection, a
 Placement on a single resource, or a component
computer network where part of a larger resource
the data was created
 Standards use
Statistical Data about statistical data. Metadata describing statistical
metadata · Comprises data and other data
documentation that describes
objects in a formalised way.
· Provides information on data and
Annex 1
2(4)

Term Wikipedia ISO NISO OECD, UNECE (Metadata UNECE (Terminology,


Common Vocabulary) Guidelines)
about processes of producing and
using data.
Data Qualitative or quantitative Re-interpretable Characteristics or information, The physical representation of
attributes of a variable or set representation of usually numerical, that are collected information in a manner
of variables. Data are information in a through observation. suitable for communication,
typically the results of formalized manner interpretation, or processing
measurements [...] or suitable for by human beings or by
observations [...]. communication, automatic means.
interpretation, or
processing
Statistical Data from a survey or Data that are collected and/ or
data administrative source used to generated by statistics in
produce statistics process of statistical
observations or statistical data
processing
Structural Describe the structure of Indicate how compound objects Act as identifiers and descriptors of
metadata computer systems such as are put together, e.g., how pages the data. They are used to identify,
tables, columns and indexes. are ordered to form chapters use, and process data matrixes and
Bretheron & Singley data cubes, e.g. names of columns
(Technical) Defines the or dimensions of statistical cubes.
objects and processes from a
technical perspective [...] like
tables, fields, data types,
indexes [...] Kimball
Reference (Guide) Help humans find (Descriptive) Describe a Describe the contents and the
metadata specific items. resource for purposes such as quality of the statistical data. Should
Bretheron & Singley discovery and identification. It include conceptual, methodological
(Business) Describes the can include elements such as and quality metadata
contents [...] in user title, abstract, author, and
accessible terms [...] what keywords
data you have, where it
comes from, what it means,
[...] Kimball
Annex 1
3(4)

Term Wikipedia ISO NISO OECD, UNECE (Metadata UNECE (Terminology,


Common Vocabulary) Guidelines)
Administra- Provide information to help
tive manage a resource, such as
metadata when and how it was created,
file type and other technical
information, and who can access
it. There are several subsets [...]:
− Rights management metadata,
which deals with intellectual
property rights, and
− Preservation metadata, which
contains information needed to
archive and preserve a resource.
Process Describes the results of
metadata various operations [...] start
time, end time, CPU seconds
used [...]
Kimball
Metadata Instance of a metadata An instance of a metadata object. It A group of characters
item object has associated attributes. It can have describing the data and treated
a distinct status: mandatory, as metadata unit
conditional and optional.
Metadata Data virtualization, statistics Discovery and organisation of
usage and census services, data electronic resources,
warehousing interoperability, integration,
identification, archiving.
Algorithmic Include
metadata  the algorithms as such
behind statistical
procedures, including
procedures for statistical
analysis;
 descriptions of the
algorithms
Annex 1
4(4)

Term Wikipedia ISO NISO OECD, UNECE (Metadata UNECE (Terminology,


Common Vocabulary) Guidelines)
Metadata [data warehouse] The data The layer in the reference model for
layer dictionary – This is usually standardization in statistics used to
more detailed than an denote the set of attributes related to
operational system data statistical metainformation
dictionary.
Metadata A central location in an Information system for Provides information on the An information system for
registry organization where metadata registering metadata definition, origin, source, and registering metadata. Registration
definitions are stored and (MDR) location of data [...] at many accomplishes three main goals:
maintained in a controlled levels, including schemes, usage identification, provenance, and
method. Metadata registries profiles, metadata elements, and monitoring quality. [...] It manages
are used whenever data must code lists for element values. It the semantics of data.
be used consistently within provides an integrating resource
an organization or group of for legacy data, acts as a lookup
organizations. tool for designers of new
databases, and documents each
data element.
Metadata A data dictionary [...] a A logically central statistical (Metadata holding) A logical
repository "centralized repository of metadata repository that allows for or physical set of metadata
information about data such the query, editing, and managing of (e.g. database) stored together
as meaning, relationships to metadata. Such a system provides a with its description (e.g.
other data, origin, usage, and mechanism for looking up schema)
format." information about statistical
products as well as their design,
development, and analysis.

You might also like