03 Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

03 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA


UNIT 2

Data Warehouse-Definition and Characteristics, Essential component of a Data Warehouse,3-layered architecture of Data
Warehouse, Implementation Issues related to DW,H/w and S/w requirements for a Data Warehouse, Enterprise Data
Warehouse, Data Mart, C/S Computing model and Data Warehouse, Data warehouse Schema
Data Warehouse

A data warehouse is a centralized repository for storing and managing large amounts of data from
various sources for analysis and reporting. It is optimized for fast querying and analysis, enabling
organizations to make informed decisions by providing a single source of truth for data. Data
warehousing typically involves transforming and integrating data from multiple sources into a unified,
organized, and consistent format
characteristics of data warehouse

Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction
processing of an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,


such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.

Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only
two operations in data accessing: initial loading of data and access of data.
Functions of Data warehouse

Data Consolidation: The process of combining multiple data sources into a single data repository in a data
warehouse. This ensures a consistent and accurate view of the data.
Data Cleaning: The process of identifying and removing errors, inconsistencies, and irrelevant data from the data
sources before they are integrated into the data warehouse. This helps ensure the data is accurate and trustworthy.
Data Integration: The process of combining data from multiple sources into a single, unified data repository in a
data warehouse. This involves transforming the data into a consistent format and resolving any conflicts or
discrepancies between the data sources. Data integration is an essential step in the data warehousing process to
ensure that the data is accurate and usable for analysis. Data from multiple sources can be integrated into a single
data repository for analysis.
Data Storage: A data warehouse can store large amounts of historical data and make it easily accessible for analysis.
Data Transformation: Data can be transformed and cleaned to remove inconsistencies, duplicate data, or irrelevant
information.
Data Analysis: Data can be analyzed and visualized in various ways to gain insights and make informed decisions.
Data Reporting: A data warehouse can provide various reports and dashboards for different departments and
stakeholders.
Data Mining: Data can be mined for patterns and trends to support decision-making and strategic planning.
Performance Optimization: Data warehouse systems are optimized for fast querying and analysis, providing quick
access to data.
DataWarehousing: A
Multitiered
Architecture
Data warehouses often adopt a
three-tier architecture,
Tier-1
The bottom tier is a warehouse database server that is almost always a relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (e.g.,
customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to
update the data warehouse .
The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS
and allows client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Object Linking and Embedding Database) by
Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse and its contents.
Tier 2 The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
(2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly implements
multidimensional data and operations

Tier 3 The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models:

From the architecture point of view, there are three data warehouse models:
1. the enterprise warehouse,
2. the data mart, and
3. the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive
business modelling and may take years to design and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of
users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized. Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. The implementation cycle of a data mart is more likely to be measured in
weeks rather than months or years. However, it may involve complex integration in the long run if its
design and planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
geographic area. Dependent data marts are sourced directly from enterprise data warehouses.

Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
query processing, only some of the possible summary views may be materialized. A virtual warehouse
is easy to build but requires excess capacity on operational database servers.
Extraction, Transformation, and Loading

Data warehouse systems use back-end tools and utilities to populate and refresh their Data. These
tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
Refresh, which propagates the updates from the data sources to the warehouse.
Metadata Repository

Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects.

Metadata are created for the data names and definitions of the given warehouse. Additional metadata
are created and captured for time stamping any extracted data, the source of the extracted data, and
missing fields that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.
Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension definition algorithms,
data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and
reports.
Mapping from the operational environment to the data warehouse, which includes source databases
and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules
and defaults, data refresh and purging rules, and security (user authorization and access control).
Data related to system performance, which include indices and profiles that improve data access and
retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication
cycles.
Business metadata, which include business terms and definitions, data ownership information, and
charging policies.
A data warehouse contains different levels of summarization, of which metadata is one.
Other types include current detailed data (which are almost always on disk),
Older detailed data (which are usually on tertiary storage),
lightly summarized data, and
Highly summarized data (which may or may not be physically housed).
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models

The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for online transaction processing.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates online data
analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist in the
form of a star schema, a snowflake schema, or a fact constellation schema.
Star schema: The most common
modeling paradigm is the star schema,
in which the data warehouse contains
(1) a large central table (fact table)
containing the bulk of the data, with
no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each
dimension.
The schema graph resembles a
starburst, with the dimension tables
displayed in a radial pattern around
the central fact table.
Snowflake schema: The snowflake schema
is a variant of the star schema model, where
some dimension tables are normalized,
thereby further splitting the data into additional
tables. The resulting schema graph forms a
shape similar to a snowflake.
Here, the sales fact table is identical to that of
the star schema main difference between the
two schemas is in the definition of dimension
tables. The single dimension table for item in the
star schema is normalized in the snowflake
schema, resulting in new item and supplier
tables.
For example, the item dimension table now
contains the attributes item key, item name,
brand, type, and supplier key, where supplier
key is linked to the supplier dimension table,
containing supplier key and supplier type
information.

Similarly, the single dimension table for location in the star schema can be normalized into two new tables:
location and city. The city key in the new location table links to the city dimension
when desirable, further normalization can be performed on province or state and country in the snowflake
schema
Fact constellation: Sophisticated applications
may require multiple fact tables to share
dimension tables. This kind of schema can be
viewed as a collection of stars, and hence is
called a galaxy schema or a fact
constellation.
This schema specifies two fact tables, sales
and shipping. The sales table definition is
identical to that of the star schema. The
shipping table has five dimensions, or keys—
item key, time key, shipper key, from location,
and to location—and two measures—dollars
cost and units shipped.
A fact constellation schema allows dimension
tables to be shared between fact tables. For
example, the dimensions tables for time, item,
and location are shared between the sales and
shipping fact tables.
In data warehousing, there is a distinction between a data warehouse and a data mart.
A data warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

For data warehouses, the fact constellation schema is commonly used, since it can model multiple,
interrelated subjects.
A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected
subjects, and thus its scope is department wide.
For data marts, the star or snowflake schema is commonly used, since both are geared toward
modelling single subjects, although the star schema is more popular and efficient.
OLAP(Online analytical Processing):

• OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.


• OLAP is part of the broader category of business intelligence, which also encompasses relational database,
report writing and data mining.
• OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives.
• OLAP consists of three basic analytical operations:

• Consolidation (Roll-Up)
• Drill-Down
• Slicing And Dicing
Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions.
For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends.
The drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales
by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view
(dicing) the slices from different viewpoints.
Types of OLAP:

1. Relational OLAP (ROLAP):

• ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational
tables and new tables are created to hold the aggregated information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
• ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database
• and its tables
ROLAP tools in order the
feature to bring
abilityback theany
to ask dataquestion
requiredbecause
to answer
thethe question. does not limit to the contents of a
methodology
cube. ROLAP also has the ability to drill down to the lowest level of detail in the database
2. Multidimensional OLAP (MOLAP):

• MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database.
Therefore it requires the pre-computation and storage of information in the cube - the operation known as
processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube contains all the
possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back data into the data set.
3. Hybrid OLAP (HOLAP):

• There is no clear agreement across the industry as to what constitutes Hybrid OLAP, except that a database will
divide data between relational and specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed
data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed
data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
• HOLAP tools can utilize both pre-calculated cubes and relational data sources
Outlier

Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a
different manner. They can be caused by measurement or execution errors. The analysis of outlier data
is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being generated by
the same method as the rest of the data objects.
Outliers are of three types, namely –
1.Global (or Point) Outliers
2.Collective Outliers
3.Contextual (or Conditional) Outliers
issues to consider during data
integration:
1. Schema Integration:
•Integrate metadata from different sources.
•The real-world entities from multiple sources are referred to as the entity identification problem.ER
2. Redundancy Detection:
•An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
•Inconsistencies in attributes can also cause redundancies in the resulting data set.
•Some redundancies can be detected by correlation analysis.
3. Resolution of data value conflicts:
•This is the third critical issue in data integration.
•Attribute values from different sources may differ for the same real-world entity.
•An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in
another.

You might also like