Unit - 1 Introduction To Data Warehousing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Unit – 1

Introduction to Data
Warehousing
Data Warehouse :
• A data warehouse is constructed by integrating data from
multiple heterogeneous sources.
• A data warehouse is a database, which is kept separate from the
organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the
organization to analyze its business.
• A data warehouse helps executives to organize, understand, and
use their data to take strategic decisions.
Data Warehousing :
• Data warehousing is the process of constructing and using a
data warehouse.
• Data warehousing involves data cleaning, data integration, and
data consolidations.
Features and Characteristics of Datawarehouse :
• Subject oriented
• Integrated
• Time variant
• Nonvolatile
Subject Oriented −
• A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations.
• These subjects can be product, customers, suppliers, sales, revenue,
etc.
• A data warehouse does not focus on the ongoing operations, rather it
focuses on modelling and analysis of data for decision making.

Integrated −
• A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc.
• This integration enhances the effective analysis of data.
Time Variant −
• The data collected in a data warehouse is identified with a
particular time period.
• The data in a data warehouse provides information from the
historical point of view.

Non-volatile −
• Non-volatile means the previous data is not erased when new
data is added to it.
• A data warehouse is kept separate from the operational database
and therefore frequent changes in operational database is not
reflected in the data warehouse.
Operational Database :
• The Operational Database is the source of information for the
data warehouse. It includes detailed information used to run the
day to day operations of the business.
• The data frequently changes as updates are made and reflect the
current value of the last transactions.
• Operational Database Management Systems also called as OLTP
(Online Transactions Processing Databases), are used to manage
dynamic data in real-time.
Difference between Data Warehouse and Operational Database :

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)


It involves day-to-day processing.
It involves historical processing
1
of information.

OLAP systems are used by OLTP systems are used by clerks,


knowledge workers such as DBAs, or database professionals.
2
executives, managers, and
analysts.
It is used to analyze the It is used to run the business.
3
business.
It focuses on Information out.
4 It focuses on Data in.

It is based on Star Schema, It is based on Entity Relationship


5 Snowflake Schema, and Fact Model.
Constellation Schema.
It focuses on Information out.
6 It is application oriented.
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)

7 It contains historical data. It contains current data.


It provides summarized and It provides primitive and highly
8
consolidated data. detailed data.
It provides summarized and
It provides detailed and flat
9 multidimensional view of data.
relational view of data.

The number of users is in The number of users is in thousands.


10
hundreds.
The number of records The number of records accessed is
11
accessed is in millions. in tens.
The database size is from The database size is from 100 MB to
12
100GB to 100 TB. 100 GB.
It provides high performance.
13 These are highly flexible.
Why a Data Warehouse is Separated from
Operational Databases :
A data warehouses is kept separate from operational databases due to the
following reasons −
• An operational database is constructed for well-known tasks and
workloads such as searching particular records, indexing, etc. In
contract, data warehouse queries are often complex and they
present a general form of data.
• Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are
required for operational databases to ensure robustness and
consistency of the database.
• An operational database query allows to read and modify
operations, while an OLAP query needs only read only access
of stored data.
• An operational database maintains current data. On the
other hand, a data warehouse maintains historical data.
Why is data warehousing important?
Data warehousing is an increasingly important business intelligence
tool, allowing organizations to:
• Ensure consistency :
o Data warehouses are programmed to apply a uniform
format to all collected data, which makes it easier for
corporate decision-makers to analyze and share data
with their colleagues.
o Standardizing data from different sources also reduces
the risk of error in interpretation and improves overall
accuracy.
• Make better business decisions :
o Successful business leaders develop data-driven
strategies and rarely make decisions without consulting
the facts.
o Data warehousing improves the speed and efficiency of
accessing different data sets and makes it easier for
corporate decision-makers to guide the business and
marketing strategies that set them apart from their
competitors.
• Improve their bottom line :
o Data warehouse platforms allow business leaders to
quickly access their organization's historical activities
and evaluate initiatives that have been successful —
or unsuccessful — in the past.
o This allows executives to see where they can adjust
their strategy to decrease costs, maximize efficiency
and increase sales to improve their bottom line.
Need of Data Warehousing :
• Better Business intelligence for end – users
• Reduction in time to locate, access, and analyze the
information
• Strategic advantage over operational database
• Faster time –to –market for products and services
• Replacement of older, less-responsive decision support system
• Hold historical data
• Improved query performance
• Used to store multiple data in particular format
Data Warehouse Applications :
Data Warehouse Applications :
• Banking Industry
• Finance Industry
• Insurance
• Consumer Goods Industry
• Government and Education
• Healthcare
• Hospitality Industry
• Manufacturing and Distribution Industry
• The Retailers
• Services Sector
• Telephone Industry
• Transportation Industry
Define Data Warehouse Architecture :
• Data warehouse architecture is a data storage framework’s
design of an organization. A data warehouse architecture takes
information from raw sets of data and stores it in a structured
and easily digestible format.
• There are 2 approaches for constructing data-warehouse:
o Top-down approach
o Bottom-up approach
Top-down approach:
External Sources –
External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and
unstructured as well.

Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into
datawarehouse. For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard format.
• L(Load): Data is loaded into datawarehouse after transforming
it into the standard format.
Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as
central repository. It actually stores the meta data and the actual
data gets stored in the data marts.
Datawarehouse stores the data in its purest form in this top-
down approach.
Data Marts –
Data mart is also a part of storage component. It stores the
information of a particular function of an organisation which is
handled by single authority.
There can be as many number of data marts in an organisation
depending upon the functions. We can also say that data mart
contains subset of the data stored in datawarehouse.
Data Mining –
The practice of analyzing the big data present in datawarehouse is
data mining. It is used to find the hidden patterns that are present in
the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a central
repository for the complete organisation and data marts are created
from it after the complete datawarehouse has been created.
Advantages of Top-Down Approach –
• Since the data marts are created from the datawarehouse,
provides consistent dimensional view of data marts.
• Also, this model is considered as the strongest model for
business changes. That’s why, big organisations prefer to follow
this approach.
• Creating data mart from datawarehouse is easy.

Disadvantages of Top-Down Approach –


• The cost, time taken in designing and its maintainence is very
high.
Bottom-up approach:
• First, the data is extracted from external soures (same as
happens in top-down approach).
• Then, the data go through the staging area (as explained above)
and loaded into data marts instead of datawarehouse. The data
marts are created first and provide reporting capability. It
addresses a single business area.
• These data marts are then integrated into datawarehouse.
• This approach is given by Kinball as – data marts are created
first and provides a thin view for analyses and datawarehouse is
created after complete data marts have been created.
Advantages of Bottom-Up Approach –
• As the data marts are created first, so the reports are quickly
generated.
• We can accomodate more number of data marts here and in this
way datawarehouse can be extended.
• Also, the cost and time taken in designing this model is low
comparatively.

Disadvantage of Bottom-Up Approach –


• This model is not strong as top-down approach as dimensional
view of data marts is not consistent as it is in above approach.
Data Warehouse Architecture :
• Data Warehouse architecture is based on a Relational database
management system server that functions as the central repository for
informational data.
• In the data warehouse architecture, operational data and processing are
separate from data warehouse processing.
• Data warehouse architecture can vary depending upon the specifications
of any organization.
• The basic data warehouse is fed from one or more source systems and
end users directly access the data warehouse.
• A data warehouse architecture is a method of defining the overall
architecture of data communication processing and presentation that
exist for end-clients computing within the enterprise.
• Each data warehouse is different, but all are characterized by standard
vital components.
Fig. Architecture of Data Warehouse
Sources :
• Operational System
An operational system is a method used in data warehousing
to refer to a system that is used to process the day-to-day
transactions of an organization.
• Flat Files
A Flat file system is a system of files in which transactional
data is stored, and every file in the system must have a
different name.
Staging Area :
• We must clean and process your operational information
before put it into the warehouse.
• We can do this programmatically, although data warehouses
uses a staging area (A place where data is processed before
entering the warehouse).
• A staging area simplifies data cleansing and consolidation for
operational method coming from multiple source systems,
especially for enterprise data warehouses where all relevant
data of an enterprise is consolidated.
• Data Warehouse Staging Area is a temporary location where a
record from source systems is copied.
Data Warehouse :
Meta Data :
• A set of data that defines and gives information about
other data.
• Meta Data used in Data Warehouse for a variety of
purpose, including:
Meta Data summarizes necessary information
about data, which can make finding and work with
particular instances of data more accessible.
For example, author, data build, and data changed,
and file size are examples of very basic document
metadata.
• Metadata is used to direct a query to the most
appropriate data source.
Lightly and highly summarized data :
• The area of the data warehouse saves all the predefined
lightly and highly summarized (aggregated) data
generated by the warehouse manager.
• The goals of the summarized information are to speed up
query performance.
• The summarized record is updated continuously as new
information is loaded into the warehouse.
Data Marts :
• We may want to customize our warehouse's architecture for
multiple groups within our organization. We can do this by
adding data marts.
• A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit,
department or operation in the company.
• e.g., sales, payroll, production, etc.
End-User access Tools :
• The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making.
• These customers interact with the warehouse using end-client access tools.
• The examples of some of the end-user access tools can be:
• Reporting and Query Tools
• Application Development Tools
• Executive Information Systems Tools
• Online Analytical Processing Tools
• Data Mining Tools
Three-Tier Architecture :
• Three - tier architecture is most widely used architecture.
• It is having three layers as top layer middle layer and bottom layer.
Bottom tier:
• The bottom tier of the architecture is the data warehouse
database server. It is the relational database system.
• We use the back end tools and utilities to feed data into the
bottom tier.
• These back end tools and utilities perform the Extract, Clean,
Load, and refresh functions.
Middle tier :
• The middle Tier in data warehouse is an OLAP server which is
implemented using ROLAP for MOLAP model
• OLAP is a software category that allows user to analyze
information from multiple database system at the same time
• OLAP - online analytical processing
• ROLAP - relational online analytical processing
• MOLAP - Multidimensional OLAP

Top tier:
• The top tier is the client layer.
• This tier holds the tools used for high-level data analysis,
querying reporting, and data mining.
Data Warehouse Models :
From the perspective of data warehouse architecture, we
have the following data warehouse models −
• Virtual Warehouse
• Data mart
• Enterprise Warehouse

Virtual Warehouse :
• The view over an operational data warehouse is known as
a virtual warehouse.
• It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational
database servers.
Data Mart :
o Data mart contains a subset of organization-wide data.
o This subset of data is valuable to specific groups of an organization.
o In other words, we can claim that data marts contain data specific to
a particular group. For example, the marketing data mart may
contain data related to items, customers, and sales. Data marts are
confined to subjects.
o Points to remember about data marts −
• Window-based or Unix/Linux-based servers are used to
implement data marts. They are implemented on low-cost
servers.
• The implementation data mart cycles is measured in short
periods of time, i.e., in weeks rather than months or years.
• The life cycle of a data mart may be complex in long run, if its
planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data mart are flexible.

Enterprise Warehouse :
• An enterprise warehouse collects all the information and the
subjects spanning an entire organization
• It provides us enterprise-wide data integration.
• The data is integrated from operational systems and external
information providers.
• This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.
ETL Process in Data Warehouse :
• ETL stands for Extract, Transform and Load.
• It is a process in which an ETL tool extracts the data from
various data source systems, transforms it in the staging area
and then finally, loads it into the Data Warehouse system.
1. Extraction:
• Data from various source systems is extracted which can be in
various formats like relational databases, No SQL, XML and flat
files into the staging area.
• It is important to extract the data from various source systems
and store it into the staging area first and not directly into the
data warehouse because the extracted data is in various formats
and can be corrupted also. Hence loading it directly into the data
warehouse may damage it and rollback will be much more
difficult. Therefore, this is one of the most important steps of ETL
process.
2. Transformation:
• In this step, a set of rules or functions are applied on the extracted
data to convert it into a single standard format. It may involve
following processes/tasks:
o Filtering – loading only certain attributes into the data
warehouse.
o Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
o Joining – joining multiple attributes into one.
o Splitting – splitting a single attribute into multipe attributes.
o Sorting – sorting tuples on the basis of some attribute (generally
key-attribbute).
3. Loading:
• In this step, the transformed data is finally loaded into the
data warehouse.
• Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after
longer but regular intervals.
• The rate and period of loading solely depends on the
requirements and varies from system to system.
• ETL process can also use the pipelining concept i.e. as soon as
some data is extracted, it can transformed and during that period
some new data can be extracted.
• While the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed.
• The block diagram of the pipelining of ETL process is shown below:
ETL Tools:
Most commonly used ETL tools are :
• Sybase
• Oracle
• Warehouse builder
• CloverETL
• MarkLogic.
What is Metadata?
• Metadata is simply defined as data about data.
• The data that is used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the contents
in the book.
• In other words, we can say that metadata is the summarized data that
leads us to detailed data.
• In terms of data warehouse, we can define metadata as follows :
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
Categories of Metadata :
Metadata can be broadly categorized into three categories −
• Business Metadata − It has the data ownership information,
business definition, and changing policies.
• Technical Metadata − It includes database system names,
table and column names and sizes, data types and allowed
values. Technical metadata also includes structural
information such as primary and foreign key attributes and
indices.
• Operational Metadata − It includes currency of data and
data lineage. Currency of data means whether the data is
active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Role of Metadata :
• Metadata has a very important role in a data warehouse.
• The role of metadata in a warehouse is different from the
warehouse data, yet it plays an important role.
The various roles of metadata are explained below :
• Metadata acts as a directory.
• This directory helps the decision support system to locate
the contents of the data warehouse.
• Metadata helps in decision support system for mapping of
data when data is transformed from operational
environment to data warehouse environment.
• Metadata helps in summarization between current
detailed data and highly summarized data.
• Metadata also helps in summarization between lightly
detailed data and highly summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
Metadata Repository :
Metadata repository is an integral part of a data warehouse system.
It has the following metadata −
Definition of data warehouse − It includes the description of structure
of data warehouse. The description is defined by schema, view,
hierarchies, derived data definitions, and data mart locations and
contents.
Business metadata − It contains has the data ownership information,
business definition, and changing policies.
Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.
Data for mapping from operational environment to data
warehouse − It includes the source databases and their contents,
data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.
Algorithms for summarization − It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.
Challenges for Metadata Management :
• Metadata helps in driving the accuracy of reports, validates data
transformation, and ensures the accuracy of calculations.
• Metadata also enforces the definition of business terms to business
end-users.
• With all these uses of metadata, it also has its challenges. As :
• Metadata in a big organization is scattered across the
organization. This metadata is spread in spreadsheets,
databases, and applications.
• Metadata could be present in text files or multimedia files. To
use this data for information management solutions, it has to be
correctly defined.
• There are no industry-wide accepted standards. Data
management solution vendors have narrow focus.
• There are no easy and accepted methods of passing metadata.

You might also like