0% found this document useful (0 votes)
9 views35 pages

Data Warehouse & Data Marts

Uploaded by

k3333dua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views35 pages

Data Warehouse & Data Marts

Uploaded by

k3333dua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Warehouse

&
Data Marts
What is a Data Warehouse

According to W. H. Inmon, A data warehouse is a


• subject-oriented
• integrated
• time-varying
• non-volatile

collection of data that is used primarily in organizational decision


making.

MCA 204 12/01/202 2


4
Features of a Data Warehouse
• Subject-oriented: A data warehouse is organized around major subjects, such
as customer, vendor, product, and sales. Rather than concentrating on the day-
to-day operations and transaction processing of an organization, a data
warehouse focuses on the modeling and analysis of data for decision makers.
Hence, data warehouses typically provide a simple and concise view around
particular subject issues by excluding data that are not useful in the decision
support process.
• Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and on-line
transaction records. Data cleaning and data integration techniques are applied
to ensure consistency in naming conventions, encoding structures, attribute
measures, and so on.
• Time-variant: Data are stored to provide information from a historical
perspective (e.g., the past 5-10 years). Every key structure in the data
warehouse contains, either implicitly or explicitly, an element of time.
MCA 204 12/01/202 3
4
Features of a Data Warehouse
• Non volatile: A data warehouse is always a physically separate store of data
transformed from the application data found in the operational environment.
Due to this separation, a data warehouse does not require transaction
processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.

• In sum, a data warehouse is a semantically consistent data store that serves as


a physical implementation of a decision support data model and stores the
information on which an enterprise needs to make strategic decisions.
• A data warehouse is also often viewed as an architecture, constructed by
integrating data from multiple heterogeneous sources to support structured
and/or ad hoc queries, analytical reporting, and decision making.

MCA 204 12/01/202 4


4
How are organizations using the
information from data warehouses?
• Many organizations are using the information to support business decision
making activities, including
(1) increasing customer focus, which includes the analysis of customer buying
patterns such as buying preference, buying time, budget cycles, and
appetites for spending, etc.
(2) repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions, in order
to fine-tune production strategies
(3) analyzing operations and looking for sources of profit
(4) managing the customer relationships, making environmental corrections,
and managing the cost of corporate assets

MCA 204 12/01/202 5


4
Subject-Orientation
• It means data is organized around major subjects of the enterprise.
• For example, to learn more about your company's sales data, you can build a warehouse that
concentrates on sales.
• Using this warehouse, you can answer questions like "Who was our best customer for this item
last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the
data warehouse subject oriented.
 E.g. claims data are organized around the subject of claims and not by individual applications of
Auto Insurance and Workers’ Comp

MCA 204 12/01/2024 3


Subject-Oriented Data (Contd…)
• Figure below distinguishes between how data is stored in operational systems
and in the data warehouse.

MCA 204 12/01/2024 4


Subject-Oriented Data (Contd…)

• This is reflected in the need to store decision-support data rather than


application-oriented data.
• In DW, data is linked and stored by real-world business subjects, which differs
from enterprise to enterprise.
• Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing
• Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.

MCA 204 12/01/202 8


4
Integrated Data

• This is reflected in the need to store decision-support data rather than


application-oriented data.
• Integration is closely related to subject orientation.
• The integrated data source must be made consistent to present a unified
view of the data to the users.

MCA 204 12/01/202 9


4
Integrated Data (Contd…)

• Constructed by integrating multiple, heterogeneous data sources


• relational databases, flat files, on-line transaction records.
• may be come from internal operational system or from external sources
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding structures,
attribute, measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• Before moving the data into DW, a process of transformation,
consolidation and integration of the source data has to be followed.

MCA 204 12/01/202 10


4
Time Variant Data

• In order to discover trends in business, analysts need large amounts of data.


• A data warehouse's focus on change over time is what is meant by the term
time variant.
• Data in the warehouse is only accurate and valid at some point in time or over
some time interval.
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key structure of operational data may or may not contain “time
element”.

MCA 204 12/01/202 11


4
Time Variant Data (Contd…)

• The time horizon for the data warehouse is significantly longer than that of
operational systems
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years). Data is stored as snapshots over past and current
periods.
• The time-variant nature of the data in a DW
• Allows for analysis of the past
• Relates information to the present
• Enables forecasts for the future

MCA 204 12/01/202 12


4
Nonvolatile Data
• The data in the DW is not intended to run the day-to-day business.
• Data in the warehouse is not updated in real-time but is refreshed from
operational systems on a regular basis.
• Non-volatile means that, once entered into the warehouse, data are not
changed/updated.
• This is logical because the purpose of a warehouse is to enable you to
analyze what has occurred.
• New data is always added as a supplement to the database, rather than a
replacement.
• A physically separate store of data transformed from the operational
environment.

MCA 204 12/01/202 13


4
Nonvolatile Data (Contd…)
• As shown in figure below, every business transaction
does not update the data in the data warehouse.
• The business transactions update the operational
system databases in real time.
• We add, change, or delete data from an operational
system as each transaction happens but do not
usually update the data in the data warehouse.
• The data in a DW is not as volatile as the data in an
operational DW is.
• The data in a DW is primarily for query and analysis.

MCA 204 12/01/202 14


4
Differences between operational
database systems and data warehouses
• The major task of on-line operational database systems is to perform on-
line transaction and query processing. These systems are called on-line
transaction processing (OLTP) systems. They cover most of the day-
today operations of an organization, such as, purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting.
[Operational = OLTP]
• Data warehouse systems, on the other hand, serve users or “knowledge
workers" in the role of data analysis and decision making. Such systems can
organize and present data in various formats in order to accommodate the
diverse needs of the different users. These systems are known as on-line
analytical processing (OLAP) systems.
[Data Warehouse = Informational = OLAP = DSS]

MCA 204 12/01/202 15


4
Differences between operational
database systems and data warehouses
• The major distinguishing features between OLTP and OLAP are summarized as
follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used
for transaction and query processing by clerks, clients, and information
technology professionals. An OLAP system is market-oriented and is used for data
analysis by knowledge workers, including managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historical data, provides facilities for summarization and aggregation,
and stores and manages information at different levels of granularity. These
features make the data easier for use in informed decision making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data
model and an application oriented database design. An OLAP system typically
adopts either a star or snowflake model (to be discussed in later sessions), and a
subject-oriented database design.

MCA 204 12/01/202 16


4
Differences between operational
database systems and data warehouses
4. View: An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historical data or data in
different organizations. In contrast, an OLAP system often spans multiple
versions of a database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that originates from
different organizations, integrating information from many data stores.
Because of their huge volume, OLAP data are stored on multiple storage
media.
5. Access patterns: The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires concurrency control
and recovery mechanisms. However, accesses to OLAP systems are mostly
read-only operations (since most data warehouses store historical rather
than up-to-date information), although many could be complex queries.
• Other features which distinguish between OLTP and OLAP systems include
database size, frequency of operations, and performance metrics. These are
summarized
MCA 204 in the following Table. 12/01/202 17
4
Differences between operational
database systems and data warehouses
Feature OLTP OLAP (Data Warehouse)

Characteristic operational processing informational processing


Orientation transaction Analysis
User clerk, DBA, database professional knowledge worker (e.g., manager, executive,
analyst)

Function day-to-day operations long term informational requirements,


decision support

DB design E-R based, application-oriented star/ snowflake, subject-oriented


Data current; guaranteed up-to-date historical; accuracy maintained over time
Summarization primitive, highly detailed summarized, consolidated
View detailed, at relational summarized, multidimensional

MCA 204 12/01/202 18


4
Differences between operational
database systems and data warehouses
Feature OLTP OLAP (Data Warehouse)

Unit of work short, simple transaction complex query


Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
No. of records accessed tens Millions

No. of users thousands Hundreds


DB size 100 MB to GB 100 GB to TB
Priority High performance, high availability high flexibility, end-user autonomy
Metric transaction throughput query throughput, response time

MCA 204 12/01/202 19


4
Why have a separate data warehouse?
“Since operational databases store huge amounts of data, then why not perform
on-line analytical processing directly on such databases instead of spending
additional time and resources to construct a separate data warehouse?”
• A major reason for such a separation is to help promote the high performance of both
systems.
• An operational database is designed and tuned from known tasks and workloads, such as
indexing and hashing using primary keys, searching for particular records, and
optimizing queries.
• On the other hand, data warehouse queries are often complex. They involve the
computation of large groups of data at summarized levels, and may require the use of
special data organization, access, and implementation methods based on
multidimensional views.
• Processing OLAP queries in operational databases would substantially degrade the
performance of operational tasks.

MCA 204 12/01/202 20


4
Why have a separate data warehouse?
• Moreover, an operational database supports the concurrent processing of several transactions.
Concurrency control and recovery mechanisms, such as locking and logging, are required to ensure
the consistency and robustness of transactions. An OLAP query often needs read-only access of
data records for summarization and aggregation. Concurrency control and recovery mechanisms, if
applied for such OLAP operations, may risk the execution of concurrent transactions and thus
substantially reduce the throughput of an OLTP system.
• Finally, the separation of operational databases from data warehouses is based on the different
structures, contents, and uses of the data in these two systems. Decision support requires
historical data, whereas operational databases do not typically maintain historical data. In this
context, the data in operational databases, though abundant, is usually far from complete for
decision making. Decision support requires consolidation (such as aggregation and summarization)
of data from heterogeneous sources, resulting in high quality, cleansed and integrated data. In
contrast, operational databases contain only detailed raw data, such as transactions, which need to
be consolidated before analysis.
• Since the two systems provide quite different functionalities and require different kinds of data, it is
necessary to maintain separate databases.
MCA 204 12/01/202 21
4
Data Granularity
• In a data warehouse, data granularity refers to the level of detail.
• Operational data is usually kept at lowest level of detail
− Grocery store captures the units of sale for each transaction.
− If units of a product ordered in a month is required, all the orders
entered in the month are added.
• Data Warehouse keeps data summarized at different levels
− Depending on the query, user can access particular level.
• Decision on granularity level is based on the data types and the expected
system performance for queries.
• The lower is the level of detail, more data is stored in a data warehouse

MCA 204 12/01/202 22


4
Data Granularity (Contd…)

MCA 204 12/01/202 23


4
Data Mart – From Data Granularity
• Data Mart: A scaled-down version of the data warehouse
• A subset of a data warehouse that supports the requirements of a
particular department or business function.

• Characteristics include
• Focuses on only the requirements of one department or business
function.
• Do not normally contain detailed operational data unlike data
warehouses.
• More easily understood and navigated.

MCA 204 12/01/202 24


4
Data Warehouse and Data Mart

MCA 204 12/01/2024 25


Data Warehouse and Data Mart

MCA 204 12/01/2024 26


Reasons for Creating Data Mart
1. To give users access to the data they need to analyze most often.
2. To provide data in a form that matches the collective view of the data by a
group of users in a department or business function area.
3. To improve end-user response time due to the reduction in the volume of
data to be accessed.
4. To provide appropriately structured data as dictated by the requirements of
the end-user access tools.
5. Building a data mart is simpler compared with establishing a corporate
data warehouse.
6. The cost of implementing data marts is normally less than that required to
establish a data warehouse
7. The potential users of a data mart are more clearly defined and can be
more easily targeted to obtain support for a data mart project rather than a
corporate data warehouse project.

MCA 204 12/01/202 27


4
Approaches for building a data warehouse

Top-Down Approach
• Extract data from operational systems; then transform; clean,
integrate and keep the data in the DW.
• A big picture approach in which the overall, big, enterprise-wide
DW is build.
• There is no collection of fragmented islands of information.
• The DW is large and integrated.

MCA 204 12/01/202 28


4
Top-Down Approach (Contd…)
• Advantages:
• A truly corporate effort and enterprise view of data
• Inherently architectured – not a union of disparate data marts
• Single, central storage of data about content
• Centralized rules and control
• May see quick results if implemented with iterations
• Disadvantages:
• Takes longer to build even with an iterative method
• High exposure / risk to failure
• Needs high level of cross-functional skills
• High outlay without proofs of concept
MCA 204 12/01/202 29
4
Bottom-Up Approach
• Ralph Kimball, an expert practitioner in DW, is a proponent of the bottom-
up approach.
• He envisions the corporate DW as a collection of conformed data marts.
• In this approach, data marts are created first to provide analytical and
reporting capabilities for specific business subject based on the
dimensional data model.
• Data marts contain data at the lowest level of granularity and also as
summaries depending on the needs for analysis.
• Further, these data marts are joined together by conforming the
dimensions.

MCA 204 12/01/202 30


4
Bottom-Up Approach (Contd…)
• Advantages:
• Faster and easier implementation of manageable pieces
• Favorable return on investment and proof of concept
• Less risk of failure
• Inherently incremental; can schedule important data marts first
• Allows project team to learn and grow
• Disadvantages:
• Each data mart has its own narrow view of data
• Permeates redundant data in every data mart
• Perpetuates inconsistent and irreconcilable data
• Proliferates unmanageable interfaces

MCA 204 12/01/202 31


4
A Practical Approach

• Although the top-down and the bottom-up approaches each have their pros and cons, a
compromise approach accommodating both views appears to be practical.

• In this approach we do not lose sight of the overall big picture for the entire enterprise
(based on top-down approach) then build the conformed data marts based on a priority
scheme (based on bottom-up approach).

• One should go to the basics and determine what exactly your organization want is long
term.

MCA 204 12/01/202 32


4
A Practical Approach (Contd…)
• The key to this approach is that you first plan at the enterprise level, gather requirements at the
overall level and establish the architecture for the complete warehouse.
• Then determine the data content for each supermart.
• Supermarts are carefully architected data marts.
• Make sure that the data content among the various supermarts are conformed in terms of data
types, field lengths, precision, and semantics.
• In this approach, data mart is a logical subset of the complete DW. Therefore, DW is a conformed
union of all data marts.
• Individual data marts are targeted to particular business groups in the enterprise but the collection
of all the data marts form an integrated whole, called the enterprise data warehouse.

MCA 204 12/01/202 33


4
A Practical Approach (Contd…)

• The steps in this practical approach are as follows:


• Plan and define requirements at the overall corporate level
• Create a surrounding architecture for a complete warehouse
• Conform and standardize the data content
• Implement the data warehouse as a series of supermarts, one at a time

MCA 204 12/01/202 34


4
Than
k You

You might also like