0% found this document useful (0 votes)
8 views

Introduction To Data Warehouse

1. A data warehouse is a centralized repository for an organization's data that supports decision making. It integrates data from multiple sources and stores historical data in a non-volatile manner for analysis. 2. Data warehouses are needed to view summarized past data, store historical data, support strategic decision making, ensure consistent data quality, and provide high response times for queries. 3. Online analytical processing (OLAP) servers allow multi-dimensional analysis of data from multiple databases simultaneously through the use of cubes and dimensions that can be drilled down, rolled up, diced, sliced and pivoted.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Introduction To Data Warehouse

1. A data warehouse is a centralized repository for an organization's data that supports decision making. It integrates data from multiple sources and stores historical data in a non-volatile manner for analysis. 2. Data warehouses are needed to view summarized past data, store historical data, support strategic decision making, ensure consistent data quality, and provide high response times for queries. 3. Online analytical processing (OLAP) servers allow multi-dimensional analysis of data from multiple databases simultaneously through the use of cubes and dimensions that can be drilled down, rolled up, diced, sliced and pivoted.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT 1

Data Warehouse And Data Mining


Introduction
A data warehouse is a powerful tool that allows organizations to store,
manage, and analyse large amounts of data. It is designed to support the
decision-making process by providing a centralized location for all of an
organization's data.
Characteristics of a Data Warehouse

1.A data warehouse is a subject-oriented, integrated, time-variant and non-


volatile collection of data in support of management's decision-making
process.
2.Subject-Oriented: A data warehouse can be used to analyse a particular
subject area. For example, "sales" can be a particular subject.
3.Integrated: A data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a
product, but in a data warehouse, there will be only a single way of identifying
a product.
4.Time-Variant: Historical data is kept in a data warehouse. For example, one
can retrieve data from 3 months, 6 months, 12 months, or even older data
from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold
the most recent address of a customer, where a data warehouse can hold all
addresses associated with a customer.
5.Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

Need for Data Warehouse

Data Warehouse is needed for the following reasons:


1. 1) Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical,
the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon
the data in the data warehouse. So, data warehouse contributes to
making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring
the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant
degree of flexibility and quick response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of
data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of
historical data.
Difference between Operational Systems and Informational Systems :
S.No Operational Systems Informational Systems

Operational systems are Informational Systems deals with


1. designed to deal with the the collection, compilation and
running values of data. deriving information from data.

In operational systems, In informational systems,


2. optimization of data structure is optimization of data structure is
done for transactions. done for complex queries.

While informational systems have


Operational systems have
3. a response time of few seconds to
response time of sub-seconds.
minutes.

Operational systems are Informational Systems are mainly


4. generally suited for small designed for large volumes of data
volumes of data. and hence convenient to use.

Operational systems are process While informational systems are


5.
oriented. subject oriented.

Operational systems supports Informational systems only


6. various data access operations supports read operation for data
such as read, update and delete. access.

Data Warehouse Design Process


: A data warehouse can be built using a top-down approach, a bottom-up
approach, or a combination of both.
The top-down approach starts with the overall design and planning. It is useful
in cases where the business problems that must be solved are clear and well
understood.
The bottom-up approach starts with experiments and prototypes. This is
useful in the early stage of business modeling and technology development. It
allows an organization to move forward at considerably less expense and to
evaluate the benefits of the technology before making significant
commitments.
In the combined approach, an organization can exploit the planned and
strategic nature of the top-down approach while retaining the rapid
implementation and opportunistic application of the bottom-up approach.
The warehouse design process consists of the following steps:
. A Three Tier Data Warehouse Architecture
Tier-1:
The bottom tier is a warehouse database server that is almost always a
relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources.
These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well
as load and refresh functions to update the data warehouse. The data are
extracted using application program interfaces known as gateways.
. Tier-2:
The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an
extended relational DBMS that maps operations on multidimensional data to
standard relational operations. A multidimensional OLAP (MOLAP) model, that
is, a special-purpose server that directly implements multidimensional data
and operations.
Tier-3: The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).

OLAP stands for Online Analytical Processing Server. It is a software technology


that allows users to analyze information from multiple database systems at the
same time. It is based on multidimensional data model and allows the user to
query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes are known
as Hyper-cubes.

OLAP operations:
There are five basic analytical operations that can
be performed on an OLAP cube:
1. Drill down: In drill-down operation, the
less detailed data is converted into highly
detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed
by moving down in the concept hierarchy of Time dimension (Quarter ->
Month).

2. Roll up: It is just opposite of the drill-down operation. It performs


aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or


more dimensions. In the cube given in the overview section, a sub-
cube is selected by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results
in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current


view to get a new view of the representation. In the sub-cube
obtained after the slice operation, performing pivot operation gives
a new view of it.

Types of OLAP Servers


We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)

 Hybrid OLAP (HOLAP)

 Specialized SQL Servers


Relational OLAP
ROLAP servers are placed between relational back-end server and client front-
end tools. To store and manage warehouse data, ROLAP uses relational or
extended-relational DBMS.
ROLAP includes the following −
Implementation of aggregation navigation logic.
 Optimization for each DBMS back end.

 Additional tools and services.


Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for
multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows
to store the large data volumes of detailed information. The aggregations are
stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query
processing support for SQL queries over star and snowflake schemas in a read-
only environment.

6. OLAP vs OLTP

Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing Involves day-to-day processing.


of information.

2 OLAP systems are used by OLTP systems are used by clerks, DBAs, or
knowledge workers such as database professionals.
executives, managers and
analysts.

3 Useful in analyzing the Useful in running the business.


business.

4 It focuses on Information out. It focuses on Data in.


5 Based on Star Schema, Based on Entity Relationship Model.
Snowflake, Schema and Fact
Constellation Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed data.


consolidated data.

8 Provides summarized and Provides detailed and flat relational view of


multidimensional view of data. data.

9 Number or users is in Number of users is in thousands.


hundreds.

10 Number of records accessed is Number of records accessed is in tens.


in millions.

11 Database size is from 100 GB Database size is from 100 MB to 1 GB.


to 1 TB

12 Highly flexible. Provides high performance.

Data Warehouse Data Mart

A Data Warehouse is a vast repository A data mart is an only subtype of a Data


of information collected from various Warehouses. It is architecture to meet the
organizations or departments within a requirement of a specific user group.
corporation.

It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.

It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from a


given subject area or set of source systems.
In data warehousing, Fact constellation In Data Mart, Star Schema and Snowflake
is used. Schema are used.

It is a Centralized System. It is a
Decentralized System.

Data Warehousing is the data- Data Marts is a project-oriented.


oriented.

Components or Building Blocks of Data Warehouse

Architecture is the proper arrangement of the elements. We build a data


warehouse with software and hardware components. To suit the requirements
of our organizations, we arrange these building we may want to boost up
another part with extra tools and services. All of these depends on our
circumstances.

The figure shows the essential elements of a typical warehouse. We see the
Source Data component shows on the left. The Data staging element serves as
the next building block. In the middle, we see the Data Storage component
that handles the data warehouses data. This element not only stores and
manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the
different ways of making the information from the data warehouses available
to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad
categories:

2.6M
238
Java Collection MCQ Set 1

Production Data: This type of data comes from the different operating systems
of the enterprise. Based on the data requirements in the data warehouse, we
choose segments of the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private"


spreadsheets, reports, customer profiles, and sometimes even department
databases. This is the internal data, part of which could be useful in a data
warehouse.

Archived Data: Operational systems are mainly intended to run the current
business. In every operational system, we periodically take the old data and
store it in achieved files.

External Data: Most executives depend on information from external sources


for a large percentage of the information they use. They use statistics
associating to their industry produced by the external department.

Data Staging Component

After we have been extracted data from various operational systems and
external sources, we have to prepare the files for storing in the data
warehouse. The extracted data coming from several different sources need to
be changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging
area.
1) Data Extraction: This method has to deal with numerous data sources. We
have to employ the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from


many different sources. If data extraction for a data warehouse posture big
challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the
correction of misspellings or may deal with providing default values for missing
data elements, or elimination of duplicates when we bring in the same data
from various source systems.

Standardization of data components forms a large part of data transformation.


Data transformation contains many forms of combining pieces of data from
different sources. We combine data from single source record or related data
parts from many source records.

On the other hand, data transformation also contains purging source data that
is not useful and separating outsource records into new combinations. Sorting
and merging of data take place on a large scale in the data staging area. When
the data transformation function ends, we have a collection of integrated data
that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions.
When we complete the structure and construction of the data warehouse and
go live for the first time, we do the initial loading of the information into the
data warehouse storage. The initial load moves high volumes of data using up
a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data
repositories for the operational systems generally include only the current
data. Also, these data repositories include the data structured in highly
normalized for fast and efficient processing.

Information Delivery Component

The information delivery element is used to enable the process of subscribing


for data warehouse files and having it transferred to one or more destinations
according to some customer-specified scheduling algorithm.

Metadata Component

Metadata in a data warehouse is equal to the data dictionary or the data


catalog in a database management system. In the data dictionary, we keep the
data about the logical data structures, the data about the records and
addresses, the information about the indexes, and so on.

Data Marts
It includes a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to particular selected subjects. Data in a data
warehouse should be a fairly current, but not mainly up to the minute,
although development in the data warehouse industry has made standard and
incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data
warehousing are to developed a data warehouse with several smaller related
data marts for particular kinds of queries and reports.

Management and Control Component

The management and control elements coordinate the services and functions
within the data warehouse. These components control the data
transformation and the data transfer into the data warehouse storage. On the
other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in
the repositories. It monitors the movement of information into the staging
method and from there into the data warehouses storage itself.

What is Meta Data?

Metadata is data about the data or documentation about the information


which is required by the users. In data warehousing, metadata is one of the
essential aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.


2. Names, definitions, structures, and content of data-warehouse and end-
users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.

Types of Metadata

Metadata in a data warehouse fall into three major parts:

o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata

Operational Metadata

As we know, data for the data warehouse comes from various operational
systems of the enterprise. These source systems include different data
structures. The data elements selected for the data warehouse have various
fields lengths and data types.

In selecting information from the source systems for the data warehouses, we
divide records, combine factor of documents from different source files, and
deal with multiple coding schemes and field lengths. When we deliver
information to the end-users, we must be able to tie that back to the source
data sets. Operational metadata contains all of this information about the
operational data sources.

Extraction and Transformation Metadata

Extraction and transformation metadata include data about the removal of


data from the source systems, namely, the extraction frequencies, extraction
methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes
place in the data staging area.

End-User Metadata

The end-user metadata is the navigational map of the data warehouses. It


enables the end-users to find data from the data warehouses. The end-user
metadata allows the end-users to use their business terminology and look for
the information in those ways in which they usually think of the business.

You might also like