Unit - 1 Introduction To Data Warehousing
Unit - 1 Introduction To Data Warehousing
Unit - 1 Introduction To Data Warehousing
Introduction to Data
Warehousing
Data Warehouse :
• A data warehouse is constructed by integrating data from
multiple heterogeneous sources.
• A data warehouse is a database, which is kept separate from the
organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the
organization to analyze its business.
• A data warehouse helps executives to organize, understand, and
use their data to take strategic decisions.
Data Warehousing :
• Data warehousing is the process of constructing and using a
data warehouse.
• Data warehousing involves data cleaning, data integration, and
data consolidations.
Features and Characteristics of Datawarehouse :
• Subject oriented
• Integrated
• Time variant
• Nonvolatile
Subject Oriented −
• A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations.
• These subjects can be product, customers, suppliers, sales, revenue,
etc.
• A data warehouse does not focus on the ongoing operations, rather it
focuses on modelling and analysis of data for decision making.
Integrated −
• A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc.
• This integration enhances the effective analysis of data.
Time Variant −
• The data collected in a data warehouse is identified with a
particular time period.
• The data in a data warehouse provides information from the
historical point of view.
Non-volatile −
• Non-volatile means the previous data is not erased when new
data is added to it.
• A data warehouse is kept separate from the operational database
and therefore frequent changes in operational database is not
reflected in the data warehouse.
Operational Database :
• The Operational Database is the source of information for the
data warehouse. It includes detailed information used to run the
day to day operations of the business.
• The data frequently changes as updates are made and reflect the
current value of the last transactions.
• Operational Database Management Systems also called as OLTP
(Online Transactions Processing Databases), are used to manage
dynamic data in real-time.
Difference between Data Warehouse and Operational Database :
Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into
datawarehouse. For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard format.
• L(Load): Data is loaded into datawarehouse after transforming
it into the standard format.
Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as
central repository. It actually stores the meta data and the actual
data gets stored in the data marts.
Datawarehouse stores the data in its purest form in this top-
down approach.
Data Marts –
Data mart is also a part of storage component. It stores the
information of a particular function of an organisation which is
handled by single authority.
There can be as many number of data marts in an organisation
depending upon the functions. We can also say that data mart
contains subset of the data stored in datawarehouse.
Data Mining –
The practice of analyzing the big data present in datawarehouse is
data mining. It is used to find the hidden patterns that are present in
the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a central
repository for the complete organisation and data marts are created
from it after the complete datawarehouse has been created.
Advantages of Top-Down Approach –
• Since the data marts are created from the datawarehouse,
provides consistent dimensional view of data marts.
• Also, this model is considered as the strongest model for
business changes. That’s why, big organisations prefer to follow
this approach.
• Creating data mart from datawarehouse is easy.
Top tier:
• The top tier is the client layer.
• This tier holds the tools used for high-level data analysis,
querying reporting, and data mining.
Data Warehouse Models :
From the perspective of data warehouse architecture, we
have the following data warehouse models −
• Virtual Warehouse
• Data mart
• Enterprise Warehouse
Virtual Warehouse :
• The view over an operational data warehouse is known as
a virtual warehouse.
• It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational
database servers.
Data Mart :
o Data mart contains a subset of organization-wide data.
o This subset of data is valuable to specific groups of an organization.
o In other words, we can claim that data marts contain data specific to
a particular group. For example, the marketing data mart may
contain data related to items, customers, and sales. Data marts are
confined to subjects.
o Points to remember about data marts −
• Window-based or Unix/Linux-based servers are used to
implement data marts. They are implemented on low-cost
servers.
• The implementation data mart cycles is measured in short
periods of time, i.e., in weeks rather than months or years.
• The life cycle of a data mart may be complex in long run, if its
planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data mart are flexible.
Enterprise Warehouse :
• An enterprise warehouse collects all the information and the
subjects spanning an entire organization
• It provides us enterprise-wide data integration.
• The data is integrated from operational systems and external
information providers.
• This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.
ETL Process in Data Warehouse :
• ETL stands for Extract, Transform and Load.
• It is a process in which an ETL tool extracts the data from
various data source systems, transforms it in the staging area
and then finally, loads it into the Data Warehouse system.
1. Extraction:
• Data from various source systems is extracted which can be in
various formats like relational databases, No SQL, XML and flat
files into the staging area.
• It is important to extract the data from various source systems
and store it into the staging area first and not directly into the
data warehouse because the extracted data is in various formats
and can be corrupted also. Hence loading it directly into the data
warehouse may damage it and rollback will be much more
difficult. Therefore, this is one of the most important steps of ETL
process.
2. Transformation:
• In this step, a set of rules or functions are applied on the extracted
data to convert it into a single standard format. It may involve
following processes/tasks:
o Filtering – loading only certain attributes into the data
warehouse.
o Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
o Joining – joining multiple attributes into one.
o Splitting – splitting a single attribute into multipe attributes.
o Sorting – sorting tuples on the basis of some attribute (generally
key-attribbute).
3. Loading:
• In this step, the transformed data is finally loaded into the
data warehouse.
• Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after
longer but regular intervals.
• The rate and period of loading solely depends on the
requirements and varies from system to system.
• ETL process can also use the pipelining concept i.e. as soon as
some data is extracted, it can transformed and during that period
some new data can be extracted.
• While the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed.
• The block diagram of the pipelining of ETL process is shown below:
ETL Tools:
Most commonly used ETL tools are :
• Sybase
• Oracle
• Warehouse builder
• CloverETL
• MarkLogic.
What is Metadata?
• Metadata is simply defined as data about data.
• The data that is used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the contents
in the book.
• In other words, we can say that metadata is the summarized data that
leads us to detailed data.
• In terms of data warehouse, we can define metadata as follows :
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
Categories of Metadata :
Metadata can be broadly categorized into three categories −
• Business Metadata − It has the data ownership information,
business definition, and changing policies.
• Technical Metadata − It includes database system names,
table and column names and sizes, data types and allowed
values. Technical metadata also includes structural
information such as primary and foreign key attributes and
indices.
• Operational Metadata − It includes currency of data and
data lineage. Currency of data means whether the data is
active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Role of Metadata :
• Metadata has a very important role in a data warehouse.
• The role of metadata in a warehouse is different from the
warehouse data, yet it plays an important role.
The various roles of metadata are explained below :
• Metadata acts as a directory.
• This directory helps the decision support system to locate
the contents of the data warehouse.
• Metadata helps in decision support system for mapping of
data when data is transformed from operational
environment to data warehouse environment.
• Metadata helps in summarization between current
detailed data and highly summarized data.
• Metadata also helps in summarization between lightly
detailed data and highly summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
Metadata Repository :
Metadata repository is an integral part of a data warehouse system.
It has the following metadata −
Definition of data warehouse − It includes the description of structure
of data warehouse. The description is defined by schema, view,
hierarchies, derived data definitions, and data mart locations and
contents.
Business metadata − It contains has the data ownership information,
business definition, and changing policies.
Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.
Data for mapping from operational environment to data
warehouse − It includes the source databases and their contents,
data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.
Algorithms for summarization − It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.
Challenges for Metadata Management :
• Metadata helps in driving the accuracy of reports, validates data
transformation, and ensures the accuracy of calculations.
• Metadata also enforces the definition of business terms to business
end-users.
• With all these uses of metadata, it also has its challenges. As :
• Metadata in a big organization is scattered across the
organization. This metadata is spread in spreadsheets,
databases, and applications.
• Metadata could be present in text files or multimedia files. To
use this data for information management solutions, it has to be
correctly defined.
• There are no industry-wide accepted standards. Data
management solution vendors have narrow focus.
• There are no easy and accepted methods of passing metadata.