DWM Unit1 Solved QB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

DMW Unit-1 Question Bank

1) Define data warehouse

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical dataderived from
transaction data from single and multiple sources.

Reference book definition – A Data Warehouse is a collection of corporate


information, derived directly from operational system and some external data
sources

(both definitions have to be written)

2) Write and explain characteristics of data warehousing

1. Subject Oriented –

A data warehouse is subject oriented because it gives information around


a subject somewhat than the organization's ongoing operations. These
subjects contain product, clients, suppliers, sales, customers etc.
2. Integrated –

A data warehouse is built by integrating data from different sources such


as relation database files, etc. Integration improves the effective analysis of
data.

3. Time Variant –
The data collected in a data warehouse is already identified with a
particular time period. Data warehouse provides data from historical point
of view

4. Non-volatile –
When new data is added to previous data, old data is not deleted it means
nonvolatile. A data warehouse is keep separated from the operational database
& hence changes made in operational database are not reflected in the data
warehouse.

3) Difference between operational DBMS and Data Warehouse

Operational Database Systems Data Warehouse


Operational database systems are Data warehousing systems are
designed to support high-volume typically designed to support
transaction processing. high-volume analytical processing
(i.e.,OLAP).
Operational database systems are Data Warehousing Systems are
usually concerned with current usually concerned with
data. historical
data.
Data within operational systems are Non-volatile, new data may be
mainly updated regularly according added regularly. Once, the
toneed. dataadded rarely changed.
It is designed for real-time It is designed for analysis of
businessdealing and business measures by subject
processes. area,categories, and attributes.
It is optimized for a simple set of It is optimized for extent loads
transactions, generally adding or andhigh, complex, unpredictable
retrieving a single row as per queries that access many rows
timetable. per
table.
It is optimized for validation of Loaded with consistent, valid
incoming information during information, requires no real
transactions, uses validation timevalidation.
data-tables.
It supports thousands of concurrent It supports a few
clients. concurrentclients relative
to OLTP.
Operational database systems are Data warehousing subjects
widely functional or process oriented. arewidely subject-
oriented.
Operational systems are usually Data warehousing systems are
optimized to perform fast inserts and usually optimized to perform
updates of associatively small fastretrievals of relatively high
volumes of data. volumes of data.
Operational database system Data warehousing system
focuseson Data in. focuseson data out.
Less number of data accessed. Large number of data accessed.
Relational databases are created Data warehouse designed
forOLTP. forOLAP.
Data integration in Data integration in data warehouse
operationaldatabase is is subject based.
application based.
It provides detailed and flat It provides summarized and
relationalview of data. multidimensional view of
data.

4) Write needs of data warehousing

a. Business User: Business users require a data warehouse to view


summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
b. Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
c. Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
d. For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
e. High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degreeof
flexibility and quick response time.
5) Explain ETL with diagram

ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. ETL process can also use the pipelining concept i.e., as soon as
some data is extracted, it can transformed and during that period some new
data can be extracted. And while the transformed datais being loaded into the
data warehouse, the already extracted data can be transformed.
The process of ETL can be broken down into the following three stages:

1. Extraction:
The first step of the ETL process is extraction. In this step, data from various
source systems is extracted which can be in various formats like relational
databases, No SQL, XML, and flat files into the staging area. It is important to
extract the data from various source systems and store it into the staging area
first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the
data warehouse may damage it and rollback will be much more difficult.
Therefore, this is one of the most important steps of ETL process.

2. Transformation:
The second step of the ETL process is transformation. In this step, aset of
rules or functions are applied on the extracted data to convert it into a single
standard format. It may involve following processes/tasks:
• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States, and America intoUSA, etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).

3. Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.

6) Advantages, Disadvantages, Applications of Data Warehouse

Advantages:
1. Data warehouse house permits business users to quickly accesssignificant
data from a few sources all in one place
2. Data warehouse gives consistent data on various cross-functional actions
3. It assists to put together many sources of data to reduce time for analysis&
reporting
4. Data warehouse gives to reduce total rotate time for analysis &reporting
5. For reporting & analysis of data need to use restructuring & integrationwhich
make it easier
6. To save user's time of retrieving data from multiple sources it allows users to
access critical data from the number of sources in a single place

Disdvantages:

1. For unstructured data it is not an ideal option


2. Data warehouses creation & implementation is surely time confusing
matter
3. Data warehouse can be out of date relatively & rapidly
4. The data warehouse may appear easy but it is too difficult for the users.
5. In data warehouse sometime users will widen diverse business rules
6. It is difficult to make changes in Data type & ranges Data source schema
indexes & queries from data base

Applications:

• Financial sectors
• Banking areas
• Consumer supplies
• Retail services
• Controlled industrialized manufacturing.

7) Explain types of DW architecture

➢ Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its purpose is


to minimize the amount of data stored to reach this goal; it removes data
redundancies. The figure shows the only layer physically available is the
source layer. In this method, data warehouses are virtual. This means that
the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an intermediate
processing layer.

The vulnerability of this architecture lies in its failure to meet the


requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional
workloads.
➢ Two-Tier Architecture

The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a


separation between physically available sources and data warehouses, in
fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of


data. That data is stored initially to corporate relational databases or legacy
databases, or it may come from an information system outside the
corporatewalls.
2. Data Staging: The data stored to the source should be extracted, cleansed
to remove inconsistencies and fill gaps, and integrated to merge
heterogeneous sources into one standard schema. The so- named
Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and
load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be
directly accessed, but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are designed for
specific enterprise departments. Meta-data repositories store
information on sources, access procedures, data staging, users, data mart
schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed
to issue reports, dynamically analyze information, and simulate
hypothetical business scenarios. It should feature aggregate information
navigators, complex query optimizers, and customer-friendly GUIs.

➢ Three-Tier Architecture

The three-tier architecture consists of the source layer (containing multiple


source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer
sitsbetween the source data and data warehouse. The main advantage
of the reconciled layer is that it creates a standard reference data model
for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to
accomplish bettersome operational tasks, such as producing daily reports
that cannot be satisfactorily prepared using the corporate applications or
generating data flows to feed external processes periodically to benefit
from cleaning and integration. This architecture is especially useful for the
extensive, enterprise-wide systems. A disadvantage of this structure is the
extra file storage space used through the extra redundant reconciled
layer. It also makes the analytical tools a little further away from being
real-time.
8) Difference between data warehouse and data mart

S.NO Data Warehouse Data Mart


1. Data warehouse is a While it is a decentralised system.
Centralised system.
2. In data warehouse, lightly While in Data mart, highly
denormalization takes place. denormalization takes place.
3. Data warehouse is top-down While it is a bottom-up model.
model.
4. To build a warehouse is While to build a mart is easy.
difficult.
5. In data warehouse, Fact While in this, Star schema and
constellation schema is used. snowflake schema are used.
6. Data Warehouse is flexible. While it is not flexible.
7. Data Warehouse is the data- While it is the project-oriented in
oriented in nature. nature.
8. Data Ware house has long While data-mart has short life than
life. warehouse.
9. In Data Warehouse, Data are While in this, data are contained in
contained in detail form. summarized form.
10. Data Warehouse is vast in While data mart is smaller than
size. warehouse.
11. It collects data from various It generally stores data from a data
data sources. warehouse.
12. Long time for processing the Less time for processing the data
data because of large data. because of handling only a small
amount of data.
13. Complicated design process Easy design process of creating
of creating schemas and schemas and views.
views.

9) Explain DW models - Enterprise Dw, Data Marts, Virtual Warehouse

Enterprise Data Warehouse – EDW is a form of centralized corporate repository


that stores and manages all historic business data of an enterprise
Virtual Data Warehouse – VDW provides a collective view of the completed data. It
has no historic data. It can be considered as a logical data model of the containing
metadata
Data Marts – A data mart includes a subset of corporate-wide data that is of value
to a specific collection of users. The scope is confined to particular selected
subjects.

10) Difference ETL vs ELT

S.NO ETL ELT


1. ETL first extracts data from In ELT, data is immediately
a pool of data sources loaded after being extracted
which are typically from source data pools
transactional databases

2. Data is held in temporary There is no staging


staging database. database meaning the
Transformation operations are data is immediately
then performed to structure loaded into a single
and convert the data into a centralized repository
suitable form for the target
data warehouse system

3. Structured data is loaded into Data is transformed inside the


the warehouse ready for data warehouse system for use
analysis with business intelligence tools
and analytics

11) Define metadata repository

Metadata is simply defined as, data about data. The data that are used to represent
other data is known as metadata. For example, the index of a book serves as
metadata for the contents in the book. In other words, we can say that metadata is
the summarized data that leads us to the detailed data.
Metadata in a data warehouse is similar to the data dictionary or the data
catalogue in a database management system.
The metadata can be broadly categorized into following three categories:
1. Business Metadata: This metadata has the data ownership information,
business definition and changing policies.
2. Technical Metadata: Technical metadata includes database system names,
table and column names and sizes, data types and allowed values. Technical
metadata also includes structural information such as primary and foreign
key attributes and indices.
3. Operational Metadata: This metadata includes currency of data and data
lineage. Currency of data means whether data is active, archived or purged.
Lineage of data means history of data migrated and transformation applied
on it.
The generation and management of metadata serves two purposes:
A. To Minimize the Efforts for Development and Administration of a Data
Warehouse
B. To Improve the Extraction of Information

You might also like