03 Data Warehouse

Uploaded by

vv9807898

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

03 Data Warehouse

Uploaded by

vv9807898

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

03 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

UNIT 2

Data Warehouse-Definition and Characteristics, Essential component of a Data Warehouse,3-layered architecture of Data
Warehouse, Implementation Issues related to DW,H/w and S/w requirements for a Data Warehouse, Enterprise Data
Warehouse, Data Mart, C/S Computing model and Data Warehouse, Data warehouse Schema
Data Warehouse

A data warehouse is a centralized repository for storing and managing large amounts of data from
various sources for analysis and reporting. It is optimized for fast querying and analysis, enabling
organizations to make informed decisions by providing a single source of truth for data. Data
warehousing typically involves transforming and integrating data from multiple sources into a unified,
organized, and consistent format
characteristics of data warehouse

Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction
processing of an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,

such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.

Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only
two operations in data accessing: initial loading of data and access of data.
Functions of Data warehouse

Data Consolidation: The process of combining multiple data sources into a single data repository in a data
warehouse. This ensures a consistent and accurate view of the data.
Data Cleaning: The process of identifying and removing errors, inconsistencies, and irrelevant data from the data
sources before they are integrated into the data warehouse. This helps ensure the data is accurate and trustworthy.
Data Integration: The process of combining data from multiple sources into a single, unified data repository in a
data warehouse. This involves transforming the data into a consistent format and resolving any conflicts or
discrepancies between the data sources. Data integration is an essential step in the data warehousing process to
ensure that the data is accurate and usable for analysis. Data from multiple sources can be integrated into a single
data repository for analysis.
Data Storage: A data warehouse can store large amounts of historical data and make it easily accessible for analysis.
Data Transformation: Data can be transformed and cleaned to remove inconsistencies, duplicate data, or irrelevant
information.
Data Analysis: Data can be analyzed and visualized in various ways to gain insights and make informed decisions.
Data Reporting: A data warehouse can provide various reports and dashboards for different departments and
stakeholders.
Data Mining: Data can be mined for patterns and trends to support decision-making and strategic planning.
Performance Optimization: Data warehouse systems are optimized for fast querying and analysis, providing quick
access to data.
DataWarehousing: A
Multitiered
Architecture
Data warehouses often adopt a
three-tier architecture,
Tier-1
The bottom tier is a warehouse database server that is almost always a relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (e.g.,
customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to
update the data warehouse .
The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS
and allows client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Object Linking and Embedding Database) by
Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse and its contents.
Tier 2 The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
(2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly implements
multidimensional data and operations

Tier 3 The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models:

From the architecture point of view, there are three data warehouse models:
1. the enterprise warehouse,
2. the data mart, and
3. the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive
business modelling and may take years to design and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of
users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized. Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. The implementation cycle of a data mart is more likely to be measured in
weeks rather than months or years. However, it may involve complex integration in the long run if its
design and planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
geographic area. Dependent data marts are sourced directly from enterprise data warehouses.

Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
query processing, only some of the possible summary views may be materialized. A virtual warehouse
is easy to build but requires excess capacity on operational database servers.
Extraction, Transformation, and Loading

Data warehouse systems use back-end tools and utilities to populate and refresh their Data. These
tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
Refresh, which propagates the updates from the data sources to the warehouse.
Metadata Repository

Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects.

Metadata are created for the data names and definitions of the given warehouse. Additional metadata
are created and captured for time stamping any extracted data, the source of the extracted data, and
missing fields that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.
Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension definition algorithms,
data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and
reports.
Mapping from the operational environment to the data warehouse, which includes source databases
and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules
and defaults, data refresh and purging rules, and security (user authorization and access control).
Data related to system performance, which include indices and profiles that improve data access and
retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication
cycles.
Business metadata, which include business terms and definitions, data ownership information, and
charging policies.
A data warehouse contains different levels of summarization, of which metadata is one.
Other types include current detailed data (which are almost always on disk),
Older detailed data (which are usually on tertiary storage),
lightly summarized data, and
Highly summarized data (which may or may not be physically housed).
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models

The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for online transaction processing.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates online data
analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist in the
form of a star schema, a snowflake schema, or a fact constellation schema.
Star schema: The most common
modeling paradigm is the star schema,
in which the data warehouse contains
(1) a large central table (fact table)
containing the bulk of the data, with
no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each
dimension.
The schema graph resembles a
starburst, with the dimension tables
displayed in a radial pattern around
the central fact table.
Snowflake schema: The snowflake schema
is a variant of the star schema model, where
some dimension tables are normalized,
thereby further splitting the data into additional
tables. The resulting schema graph forms a
shape similar to a snowflake.
Here, the sales fact table is identical to that of
the star schema main difference between the
two schemas is in the definition of dimension
tables. The single dimension table for item in the
star schema is normalized in the snowflake
schema, resulting in new item and supplier
tables.
For example, the item dimension table now
contains the attributes item key, item name,
brand, type, and supplier key, where supplier
key is linked to the supplier dimension table,
containing supplier key and supplier type
information.

Similarly, the single dimension table for location in the star schema can be normalized into two new tables:
location and city. The city key in the new location table links to the city dimension
when desirable, further normalization can be performed on province or state and country in the snowflake
schema
Fact constellation: Sophisticated applications
may require multiple fact tables to share
dimension tables. This kind of schema can be
viewed as a collection of stars, and hence is
called a galaxy schema or a fact
constellation.
This schema specifies two fact tables, sales
and shipping. The sales table definition is
identical to that of the star schema. The
shipping table has five dimensions, or keys—
item key, time key, shipper key, from location,
and to location—and two measures—dollars
cost and units shipped.
A fact constellation schema allows dimension
tables to be shared between fact tables. For
example, the dimensions tables for time, item,
and location are shared between the sales and
shipping fact tables.
In data warehousing, there is a distinction between a data warehouse and a data mart.
A data warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

For data warehouses, the fact constellation schema is commonly used, since it can model multiple,
interrelated subjects.
A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected
subjects, and thus its scope is department wide.
For data marts, the star or snowflake schema is commonly used, since both are geared toward
modelling single subjects, although the star schema is more popular and efficient.
OLAP(Online analytical Processing):

• OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

• OLAP is part of the broader category of business intelligence, which also encompasses relational database,
report writing and data mining.
• OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives.
• OLAP consists of three basic analytical operations:

• Consolidation (Roll-Up)
• Drill-Down
• Slicing And Dicing
Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions.
For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends.
The drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales
by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view
(dicing) the slices from different viewpoints.
Types of OLAP:

1. Relational OLAP (ROLAP):

• ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational
tables and new tables are created to hold the aggregated information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
• ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database
• and its tables
ROLAP tools in order the
feature to bring
abilityback theany
to ask dataquestion
requiredbecause
to answer
thethe question. does not limit to the contents of a
methodology
cube. ROLAP also has the ability to drill down to the lowest level of detail in the database
2. Multidimensional OLAP (MOLAP):

• MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database.
Therefore it requires the pre-computation and storage of information in the cube - the operation known as
processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube contains all the
possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back data into the data set.
3. Hybrid OLAP (HOLAP):

• There is no clear agreement across the industry as to what constitutes Hybrid OLAP, except that a database will
divide data between relational and specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed
data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed
data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
• HOLAP tools can utilize both pre-calculated cubes and relational data sources
Outlier

Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a
different manner. They can be caused by measurement or execution errors. The analysis of outlier data
is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being generated by
the same method as the rest of the data objects.
Outliers are of three types, namely –
1.Global (or Point) Outliers
2.Collective Outliers
3.Contextual (or Conditional) Outliers
issues to consider during data
integration:
1. Schema Integration:
•Integrate metadata from different sources.
•The real-world entities from multiple sources are referred to as the entity identification problem.ER
2. Redundancy Detection:
•An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
•Inconsistencies in attributes can also cause redundancies in the resulting data set.
•Some redundancies can be detected by correlation analysis.
3. Resolution of data value conflicts:
•This is the third critical issue in data integration.
•Attribute values from different sources may differ for the same real-world entity.
•An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in
another.

Data Warehousing - Architecture - Tutorialspoint
No ratings yet
Data Warehousing - Architecture - Tutorialspoint
7 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Advanced Database Presentation
No ratings yet
Advanced Database Presentation
11 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
Unit 1 (DWDM).docx
No ratings yet
Unit 1 (DWDM).docx
50 pages
CS2202_DataWarehouse_OLAP
No ratings yet
CS2202_DataWarehouse_OLAP
49 pages
C Lecture
No ratings yet
C Lecture
8 pages
What Is a Data Warehouse
No ratings yet
What Is a Data Warehouse
9 pages
All Unit
No ratings yet
All Unit
17 pages
DWDM Lecture Notes III-II (1)
No ratings yet
DWDM Lecture Notes III-II (1)
81 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
DMBI Unit-1
No ratings yet
DMBI Unit-1
37 pages
Approach, or A Combination of Both
No ratings yet
Approach, or A Combination of Both
12 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
51 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
R16 4-2 DataMining Notes UNIT-I
No ratings yet
R16 4-2 DataMining Notes UNIT-I
31 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Data Warehouse
No ratings yet
Data Warehouse
73 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
Data Ware House and Its Purposes
No ratings yet
Data Ware House and Its Purposes
13 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Data Warehousing
No ratings yet
Data Warehousing
16 pages
Unit 1
No ratings yet
Unit 1
22 pages
DWDM
No ratings yet
DWDM
15 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Data Warehouse
No ratings yet
Data Warehouse
39 pages
1 & 2 Data Warehousing_021052
No ratings yet
1 & 2 Data Warehousing_021052
80 pages
BA unit2 own
No ratings yet
BA unit2 own
10 pages
DH&DM Unit-1
No ratings yet
DH&DM Unit-1
16 pages
Data Warehouse
No ratings yet
Data Warehouse
86 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Lec 11- DW
No ratings yet
Lec 11- DW
32 pages
Unit-1.1 Data Warehouse
No ratings yet
Unit-1.1 Data Warehouse
29 pages
Unit-2: Multi-Dimensional Data Model?
No ratings yet
Unit-2: Multi-Dimensional Data Model?
21 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
Data Warehouse - DWDM
No ratings yet
Data Warehouse - DWDM
54 pages
Ba Unit 2
No ratings yet
Ba Unit 2
20 pages
Unit II Lecture Notes
No ratings yet
Unit II Lecture Notes
26 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Data Warehouse
No ratings yet
Data Warehouse
56 pages
Data Ware Housing1
No ratings yet
Data Ware Housing1
18 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
Lecture6 Three Tier Architecture 11052016
No ratings yet
Lecture6 Three Tier Architecture 11052016
13 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
52 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
UNIT 1
No ratings yet
UNIT 1
18 pages
DATA WAREHOUSE
No ratings yet
DATA WAREHOUSE
143 pages
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
No ratings yet
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
18 pages
DWM GUFRAN NOTES
No ratings yet
DWM GUFRAN NOTES
318 pages
Presented By: Nirmalya Fadikar B.E. Information Technology
No ratings yet
Presented By: Nirmalya Fadikar B.E. Information Technology
8 pages
DM Mod1 PDF
No ratings yet
DM Mod1 PDF
16 pages
DBMS_Unit 4_Part2
No ratings yet
DBMS_Unit 4_Part2
4 pages
$RRWYO9T
No ratings yet
$RRWYO9T
71 pages
WA Data Warehouse
No ratings yet
WA Data Warehouse
16 pages
BI Unit 1
No ratings yet
BI Unit 1
39 pages
Unit 1
No ratings yet
Unit 1
39 pages
Dataware House Unit-1 Continued
No ratings yet
Dataware House Unit-1 Continued
12 pages
BIDA NOTES (1)
No ratings yet
BIDA NOTES (1)
67 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Databricks
No ratings yet
Databricks
56 pages
Title Proposal Web-Based ROTC Student Information System
No ratings yet
Title Proposal Web-Based ROTC Student Information System
5 pages
MDG For Finance Deck
No ratings yet
MDG For Finance Deck
9 pages
Jozsef Attila Versek
No ratings yet
Jozsef Attila Versek
13 pages
ASM Questions
No ratings yet
ASM Questions
9 pages
Borang Inventory Project
No ratings yet
Borang Inventory Project
3 pages
Day 1 Intro To DS and ML - New
No ratings yet
Day 1 Intro To DS and ML - New
41 pages
Vishal Off Page Activity
No ratings yet
Vishal Off Page Activity
129 pages
Pervasive Computing Questions
No ratings yet
Pervasive Computing Questions
4 pages
Apa Citation Style: A Seminar On The Basics
No ratings yet
Apa Citation Style: A Seminar On The Basics
21 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
116 pages
Nus Iss Executive Education Planner 2019
No ratings yet
Nus Iss Executive Education Planner 2019
7 pages
[Ebooks PDF] download (Ebook) Uml for Database Design by Naiburg, Eric J., Maksimchuck, Robert A. ISBN 9780201721638, 0201721635 full chapters
No ratings yet
[Ebooks PDF] download (Ebook) Uml for Database Design by Naiburg, Eric J., Maksimchuck, Robert A. ISBN 9780201721638, 0201721635 full chapters
81 pages
Process Communication and Synchronization and Synchronization
No ratings yet
Process Communication and Synchronization and Synchronization
26 pages
BI - NOTES(question paper)
No ratings yet
BI - NOTES(question paper)
13 pages
XNJSN
No ratings yet
XNJSN
4 pages
Current and Future Effects of Social Media-Based Metrics On Open Access and IRs
No ratings yet
Current and Future Effects of Social Media-Based Metrics On Open Access and IRs
1 page
Tugas Pertemuan 8 Data Warehouse & Data Mining
No ratings yet
Tugas Pertemuan 8 Data Warehouse & Data Mining
3 pages
Sap Epm Basics
No ratings yet
Sap Epm Basics
3 pages
216715b8-7914-4620-93c0-aad5e47efab3
No ratings yet
216715b8-7914-4620-93c0-aad5e47efab3
3 pages
Bibliographic Reference Portion of Abstracts
No ratings yet
Bibliographic Reference Portion of Abstracts
19 pages
What Are The Best Areas of Research in Library Information Science
No ratings yet
What Are The Best Areas of Research in Library Information Science
22 pages
Question Bank CH 12.
No ratings yet
Question Bank CH 12.
8 pages
discussion forum 4
No ratings yet
discussion forum 4
2 pages
Top 10 GDPR Solution Providers 2020
No ratings yet
Top 10 GDPR Solution Providers 2020
44 pages
Lesson 1 Databases: What Is A Database?
No ratings yet
Lesson 1 Databases: What Is A Database?
20 pages
Invitation Front Page
No ratings yet
Invitation Front Page
1 page
Roleplay
No ratings yet
Roleplay
2 pages
Final - Report TU
No ratings yet
Final - Report TU
96 pages
SC c01 Cag HBC Das El 00004 Unlocked
No ratings yet
SC c01 Cag HBC Das El 00004 Unlocked
61 pages

03 Data Warehouse

Uploaded by

03 Data Warehouse

Uploaded by

03 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,

• OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

1. Relational OLAP (ROLAP):

You might also like