UNIT-1 (RIT-062) : Data Warehousing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

UNIT-1 (RIT-062)

DATA WAREHOUSING
Syllabus Unit-1
• Data Warehousing:
Overview, Definition, Data Warehousing Components,
Building a Data Warehouse, Warehouse Database, Mapping
the Data Warehouse to a Multiprocessor Architecture,
Difference between Database System and Data Warehouse,
Multi Dimensional Data Model, Data Cubes, Stars, Snow
Flakes, Fact Constellations, Concept
OVERVIEW:--

• The term "Data Warehouse" was first coined


by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented,
integrated, time-variant, and non-volatile
collection of data. This data helps analysts to
take informed decisions in an organization
DEFINITION DATA WAREHOUSE
• Data warehouse is an information system that
contains historical and commutative data from single
or multiple sources. It simplifies reporting and
analysis process of the organization.
• It is also a single version of truth for any company for
decision making and forecasting.
FEATURE OF DATA WAREHOUSING
• Subject Oriented − A data warehouse is subject oriented
because it provides information around a subject rather than
the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it
focuses on modeling and analysis of data for decision making.

• Integrated − A data warehouse is constructed by integrating


data from heterogeneous sources such as relational databases,
flat files, etc. This integration enhances the effective analysis
of data.
CONTD…
• Time Variant − The data collected in a data
warehouse is identified with a particular time period.
The data in a data warehouse provides information
from the historical point of view.
• Non-volatile − Non-volatile means the previous data
is not erased when new data is added to it. A data
warehouse is kept separate from the operational
database and therefore frequent changes in
operational database is not reflected in the data
warehouse.
COMPONENT OF DATA WAREHOUSING
The data warehouse architecture is based on a relational
database management system server that functions as
the central repository for informational data.
Operational data and processing is completely
separated from data warehouse processing.
This central information repository is surrounded by a
number of key components designed to make the entire
environment functional, manageable and accessible by
both the operational systems that source data into the
warehouse and by end-user query and analysis tools.
COMPONENTS
• Data Warehouse Database
• Sourcing, Acquisition, Cleanup and Transformation
Tools
• Meta data operational , extraction & Transforamtion,
end users
• Access Tools
• Data Marts
• Data Warehouse Administration and Management
• Information Delivery System
• Data warehouse database:-- The central data warehouse database is cornerstone of
the data warehousing environment. This database is almost always implemented on the
relational database management system technology. A warehouse implementation based on
traditional RDBMS technology is constrained by the fact that traditional RDBMS
implementation are optimized for transactional data base processing.
• Sourcing , Acquistion cleanup and Transformation Tools: A significant portion of the
data warehouse implementation effort is spent extracting data from operational system and
putting it in a format suitable for informational applications that will run off the data
warehouse.
• The data sourcing , clean up transformation and migration tools perform all of the
conversions , summarization , key changes structural changes and condensation needed to
transform disparate data into information that can be used by the decision support tool.

Metadata: metadata is data about data that describes the data warehouse. It is used for
building managing , maintaining and using the data warehouse.
DatawareHouse Architecture
• There are mainly three types of Datawarehouse Architectures:

Single-tier architecture:-- The objective of a single layer


is to minimize the amount of data stored. This goal is to
remove data redundancy. This architecture is not
frequently used in practice .
Two-tier architecture Two-layer architecture
:--
separates physically available sources and data
warehouse. This architecture is not expandable and
also not supporting a large number of end-users. It
also has connectivity problems because of network
limitations.
• Three-tier architecture: This is the most widely used architecture. It
consists of the Top, Middle and Bottom Tier.

1. Bottom Tier: The database of the Datawarehouse servers as the


bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end tools.

2. Middle Tier: The middle tier in Data warehouse is an OLAP server


which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.

3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data warehouse. It
could be Query tools, reporting tools, managed query tools, Analysis
tools and Data mining tools.
4.
Diff in DBMS & DW
Data Base Data WareHouse
Is designed to record Is designed to analyze

Is an application-oriented collection of data Is a subject-oriented collection of data


Stores data from any number of
Normally limited to a single application
applications
Data is refreshed from source systems
Data is available real-time
when needed
Is efficient in processing and storage Is efficient in analytics
MultiDimensional DataModel:
1. The Logical Multidimensional Data Model
2. The Relational Multidimensional Data Model
3. The Analytic Workspace Implementation of the Model

The Logical Multidimensional Data Model:--


The multidimensional data model is an integral part of On-Line Analytical
Processing, or OLAP. Because OLAP is on-line, it must provide answers
quickly; analysts pose iterative queries during interactive sessions, not in
batch jobs that run overnight.
Because OLAP is also analytic, the queries are complex. The
multidimensional data model is designed to solve complex queries in real
time.

• "The central attraction of the dimensional model of a business is its


simplicity.... that simplicity is the fundamental key that allows users
to understand databases, and allows software to navigate databases
efficiently.“
• The multidimensional data model is composed of logical cubes, measures,
dimensions, hierarchies, levels, and attributes.

• The simplicity of the model is inherent because it defines objects that


represent real-world business entities.

• Analysts know which business measures they are interested in examining,


which dimensions and attributes make the data meaningful, and how the
dimensions of their business are organized into levels and hierarchies.
cube:
• Cube is a data structure that can be imagined as multi-
dimensional spreadsheet. How we can imagine it? Take a
spreadsheet, put year on columns, department on rows –
that’s two-dimensional cube. Now create multiple sheets with
data of the same structure, say one sheet per country. Now
you have three-dimensional cube.
• Facts and Measures: Fact is most detailed information that
can be measured. Example of a fact might be a contract, a
spending, a phone call, a visit.
We can measure:
• contract: financial amount, discount, planned amount
• spending: financial amount, quantity
• phone call: duration, cost
• visit: duration
Those measurable properties, such as amount, discount or
duration are called measures.
We are mostly interested in summarized view: “what was the
overall spending?”, “what is the average call duration?” or
“how many contracts are there?” Those computed values are
called aggregates or aggregated measures.
Facts might have multiple measures or they might even have
none. If there are no measures we still can at least answer
questions of type “how many?”.
• Dimensions:OLAP is suitable mostly for data which can be
categorized – grouped by categories. The categorical view of
data should be also the main interest of the data analysis.
Example of categories might be: color, department, location
or even a date.
• The categories are called dimensions.
Dimensions provide context for facts:
• Where did that happen?
• When was the contract signed?
• What kind of goods or services was in the contract?
Dimensions are used to filter queries:
• What was the spending last year?
• How many contracts signed by the department of Health?
They are used to control scope of aggregation of facts:
• What was the number of contracts by department?
• What was the average visit duration per month?
• What are the sales of each product?
Hierarchies and Levels:
Concept Hierarchies
• We might be interested in amount per year, then per month for
particular year; products can be grouped by categories and
subcategories; location might be defined by country, country
might have multiple cities… Those are concept hierarchies of
dimensions.
• Hierarchy has multiple levels and there might be various
hierarchical views of any dimension. For example the date
might be split by year, month and day. Or it might be split by
year, quarter, month and no day (because we have no daily
data) or by year and week (for weekly data).
Slicing and Dicing
• We have a data cube full of facts, how can we
explore the data? We slice the cube! What
does that mean?
• Say we have a data cube of contracts with
dimensions: time, country and type (of
procured subject)
• We might be interested in spending in 2010:
• Slicing and dicing is an operation that filters the data cells of a
cube and narrows our focus from broader view:
The Relational Implementation of the Model

• The relational implementation of the multidimensional data model is


typically a star schema, as shown in Figure , or a snowflake schema. A star
schema is a convention for organizing the data into dimension tables, fact
tables, and materialized views. Ultimately, all of the data is stored in
columns, and metadata is required to identify the columns that function
as multidimensional objects.
Dimension Tables:A star schema stores all of the information about
a dimension in a single table. Each level of a hierarchy is
represented by a column or column set in the dimension table. A
dimension object can be used to define the hierarchical relationship
between two columns (or column sets) that represent two levels of a
hierarchy; without a dimension object, the hierarchical relationships
are defined only in metadata. Attributes are stored in columns of the
dimension tables.

Fact Tables:Measures are stored in fact tables. Fact tables contain a


composite primary key, which is composed of several foreign keys
(one for each dimension table) and a column for each measure that
uses these dimensions.
• Materialized Views:
Aggregate data is calculated on the basis of the hierarchical
relationships defined in the dimension tables. These aggregates
are stored in separate tables, called summary tables or
materialized views. Oracle provides extensive support for
materialized views, including automatic refresh and query
rewrite.
• Queries can be written either against a fact table or against a
materialized view. If a query is written against the fact table
that requires aggregate data for its result set, the query is either
redirected by query rewrite to an existing materialized view, or
the data is aggregated on the fly.

• Each materialized view is specific to a particular combination


of levels; in Figure , only two materialized views are shown of
a possible 27 (3 dimensions with 3 levels have 3**3 possible
level combinations).
The Analytic Workspace Implementation of the Model
• Analytic workspaces have several different types of data containers,
such as dimensions, variables, and relations. Each type of container
can be used in a variety of ways to store different types of
information.
• Multidimensional Data Storage in Analytic Workspaces
– In the logical multidimensional model, a cube represents all measures
with the same shape, that is, the exact same dimensions. In a cube
shape, each edge represents a dimension. The dimension members are
aligned on the edges and divide the cube shape into cells in which data
values are stored.
Build/Design a Data warehouse
• There are several reason why org. consider data warehousing a critical
need. These drivers for data warehousing can be found in buisness climate
of a global market place.
• For a buisness perspective , to survive and succeed in today’s highly
competitve global environment , buisness users demand buisness answer
mainly because--
• Decision need to be made quickly and correctly , using all available
data .
• Users are buisness domain experts, not computer professional.
• The amount of data is doubling every 18 months, whichaffects
response time and sheer ability to comprehend its content
• Competition is heating up in the areas of business intelligence and
added informaiton value.
Nine steps method in the design of a Data Warehouse
• Choosing the subject matter
• Deciding what a fact table represents
• Identifying and conforming the dimensions.
• Choosing the facts
• Storing pre calculations in the fact table
• Rounding out the dimension table
• Choosing the duration of the database
• The need to track slowly changing dimensions
• Deciding the query priorities and query modes.
Choosing the subject matter of a particular data mart
• The first data mart you build should be one with the most bang of buck. It
should simultaneously answer the most important buisness questions and
be the most accessible in terms of data extraction.
• According to kimball – a great place to start in most enterprises is to build a
data mart that consists of customer invoices or monthly statements. This
data is probably fairly accessible and of failry of high quality.
• The best data soruce in any enterprise is the record of how much money
they owe us
Deciding exactly what a fact table record represents
• This steps according to kimball technical detail at this early point. Fact
table is the large central table in the dimensional design that has a multipart
key.
• Each component of the multipart key is a foreign key to an individual
dimension table.
• In the example of customer invoices , the “the grain ” of the fact table is the
individual line item invoice .
• Once the fact table representation is decided , a coherent discussion of what
the dimensions of the data mart’s fact table are can take place.

You might also like