0% found this document useful (0 votes)
0 views

Data Warehousing-Notes(Module -I & II) (1) (1)

The document provides an overview of data warehousing, including its definition, characteristics, architecture, and types such as Enterprise Data Warehouse, Operational Data Store, and Data Mart. It emphasizes the need for data warehousing in enhancing business intelligence, standardizing data, and providing historical data for analysis. Additionally, it discusses the components of data warehousing, including sourcing, metadata, access tools, and the implementation steps for creating data marts.

Uploaded by

nishamurugan273
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Warehousing-Notes(Module -I & II) (1) (1)

The document provides an overview of data warehousing, including its definition, characteristics, architecture, and types such as Enterprise Data Warehouse, Operational Data Store, and Data Mart. It emphasizes the need for data warehousing in enhancing business intelligence, standardizing data, and providing historical data for analysis. Additionally, it discusses the components of data warehousing, including sourcing, metadata, access tools, and the implementation steps for creating data marts.

Uploaded by

nishamurugan273
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Warehousing & its application

Module-I:
• Topic: The Need For data warehousing, Paradigm shift, Data Warehouse Definition
and Characteristics, Data warehouse Architecture, Sourcing, Acquisition, Cleanup and
Transformation, Metadata, Access tools, Data marts, Data Warehouse administration
and Management, Building a data warehouse: business consideration, technical
consideration, design consideration, implementation consideration, integrated
solutions, Benefits of data warehousing. Data Warehouse Architecture: Two and Three
tier Data Warehouse architecture.

Introduction to Data Warehouse: A data warehouse is a subject-oriented,


integrated, time-variant and non-volatile collection of data in support of management's
decision-making process.

Subject-Oriented: A data warehouse can be used to analyse a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.
Types of Data Warehouses (DWH)
Typically, enterprise systems use three main types of data warehouses (DWH):

1. Enterprise Data Warehouse (EDW): As a centralized data warehouse, EDW


provides a holistic approach to organizing and presenting data.
2. Operational Data Store (ODS): An Operational Data Store (ODS) is a type of data
store that is suitable when neither the OLTP nor a DWH can support a business’s
reporting requirements.
3. Data Mart: A data mart is designed for departmental data, such as sales, finance, and
supply chain.

2. Need For data warehousing:

 The Data Warehouse is an environment, not a product.


 Provides Architectural decision support that provides users with Current & historical
decision support data.
 A data warehouse is usually used for linking and analyzing heterogeneous sources of
business data.
 Data warehouses gather data from multiple sources (including databases), with an
emphasis on storing, filtering, retrieving and in particular, analyzing huge quantities of
organized data.
 Enhancing the turnaround time for analysis and reporting: Data warehouse allows
business users to access critical data from a single source enabling them to take quick
decisions. They need not waste time retrieving data from multiple sources. The business
executives can query the data themselves with minimal or no support from IT which in
turn saves money and time.
 Improved Business Intelligence: Data warehouse helps in achieving the vision for the
managers and business executives. Outcomes that affect the strategy and procedures of
an organization will be based on reliable facts and supported with evidence and
organizational data.
 Benefit of historical data: Transactional data stores data on a day to day basis or for a
very short period of duration without the inclusion of historical data. In comparison, a
data warehouse stores large amounts of historical data which enables the business to
include time-period analysis, trend analysis, and trend forecasts.
 Standardization of data: The data from heterogeneous sources are available in a single
format in a data warehouse. This simplifies the readability and accessibility of data. For
example, gender is denoted as Male/ Female in Source 1 and m/f in Source 2 but in a
data warehouse the gender is stored in a format which is common across all the
businesses i.e. M/F.
 Immense ROI (Return On Investment): Return On Investment refers to the
additional revenues or reduces expenses a business will be able to realize from any
project.

3. Characteristics of data warehousing:

Subject-oriented: A DW is always a subject-oriented one, as it provides information


about a specific theme instead of current organizational operations. On specific
themes, it can be done. That means that it is proposed to handle the data warehousing
process with a specific theme (subject) that is more defined.

Figure 3 shows Sales, Products, Customers and Account are the different themes. A
data warehouse never emphasizes only existing activities. Instead, it focuses on data
demonstration and analysis to make different decisions. It also provides an easy and
accurate demonstration of specific themes by eliminating information that is not needed
to make decisions.
Integrated: Integration involves setting up a common system to measure all similar
data from multiple systems. Data was to be shared within several database
repositories and must be stored in a secured manner to access by the data warehouse.
A data warehouse integrates data from various sources and combines it in a relational
database. It must be consistent, readable, and coded.

Time-Variant: Information may be held in various intervals such as weekly,


monthly, and yearly as shown in Figure 5. It provides a series of limited-time,
variable rate, online transactions. The data warehouse covers a broader range of data
than the operational systems. When the data stored in the data store has a certain
amount of time, it can be predictable and provide history. It has aspects of time
embedded within it. One other facet of the data warehouse is that the data cannot be
changed, modified or updated once it is stored.
Non-Volatile : The data residing in the data warehouse is permanent, as the name
non -volatile suggests. It also ensures that when new data is added, data is not erased
or removed. It requires the mammoth amount of data and analyses the data within the
technologies of warehouse. Figure 6 shows the non- volatile data warehouse vs
operational database. A data warehouse is kept separate from the operational database
and thus the data warehouse does not represent regular changes in the operational
database. Data warehouse integration manages different warehouses relevant to the
topic.

5. Data warehouse Architecture:


 Data Warehouse architecture is based on a relational database management
server that functions as the central repository for informational data.
 Operational data and Processing is completely separate from data warehousing
processing.
 Data Warehouse architecture identifies seven Components.
1. Sourcing, Acquisition, Cleanup and Transformation & tools.
2. Metadata repository.
3. Warehouse database technology.
4. Data marts.
5. Data query, reporting, analysis, and mining tools.
6. Data Warehouse administration and Management.
7. Information delivery system
 Types of Data Warehouse Architectures:
> When designing a data warehouse, there are three different types of models to
consider, based on the approach of number of tiers the architecture has.
(i) Single-tier data warehouse architecture.
(ii) Two-tier data warehouse architecture. (iii) Three-tier data warehouse architecture.

(i) Single-tier data warehouse architecture: The single-tier architecture is not a


frequently practiced approach. The main goal of having such architecture is to
remove redundancy by minimizing the amount of data stored. Its primary
disadvantage is that it doesn’t have a component that separates analytical and
transactional processing.

(ii) Two-tier data warehouse architecture: The two-tier architecture includes a


staging area for all data sources, before the data warehouse layer. By adding a
staging area between the sources and the storage repository, you ensure all data
loaded into the warehouse is cleansed and in the appropriate format.

(iii) Three-tier data warehouse architecture: The three-tier approach is the most
widely used architecture for data warehouse systems. Essentially, it consists of three
tiers:

1. The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded.

2. The middle tier is the application layer giving an abstracted view of the
database. It arranges the data to make it more suitable for analysis. This is done with
an OLAP server, implemented using the ROLAP or MOLAP model.

3. The top-tier is where the user accesses and interacts with the data. It represents
the front-end client layer. You can use reporting tools, query, analysis or data
mining tools.

6. Sourcing, Acquisition, Cleanup and Transformation:

 The data sourcing, clean up, transformation and migration tools perform all of the
conversions, summarization, key changes, structural changes and condensations needed
to transform disparate data into information that can be used by decision support tool.
The functionality includes:
 Removing unwanted data from operational database.
 Converting to common data names and definitions.
 Establishing defaults for missing data.
 Accommodating source data definition changes.
The data sourcing, clean up, transformation and migration tools have to deal with some
significant issues include:
 Database heterogeneity DBMSs are very different in data models, data access
language, data navigation operations, concurrency integrity, recovery etc.
 Data heterogeneity, this is difference in the way data is defined and used in
different models.
Principles of a Data Warehousing :
• Load Performance:

Data warehouses require increase loading of new data on a periodic basic within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of data business.

• Load Processing:

Many steps must be taken to load new or update data into the data warehouse including data
conversion, filtering, reformatting, indexing and metadata update.

• Data Quality Management: Fact-based management demands the highest data quality. The
warehouse must ensure local consistency, global consistency, and referential integrity despite
“dirty” sources and massive database size.

• Query Performance:

Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds not days.

• Terabyte Scalability: Data warehouse sizes are growing at astonishing rates. Today these range
from a few to hundreds of gigabytes and terabyte-sized data warehouses.

Data Warehouse Seven Major Components:


1. Sourcing, Acquisition, Cleanup and Transformation tools:

The data Sourcing, Cleanup and Transformation tools perform all of the
conversations, summarization, key changes, structural changes, and condensations
needed to transform disparate data into information that can be used by the decision
support tool.
Maintain meta data.
Removing unwanted data from operational databases.
Converting to common data names and definitions.
Calculating summaries and derived data.
Establishing defaults for missing data.
Accommodating source data definition changes.
2. Metadata:
Metadata is data about data that describes the data warehouse.
It is used for building, maintaining, managing, and using the data warehouse.
Metadata is divided into two main categories:
(i) Technical metadata.
(ii) Business metadata.
(i) Technical metadata: It contains information about warehouse data for
use by warehouse designers and administrators when carrying out
warehouse development and management tasks.
 Technical metadata documents include:
1. Information about data sources
2. Transformation definition.
3. Warehouse object and data structure definition for data target.
4. Data mapping operations.
5. Access authorization, backup history, data acquisition history
etc.
(ii) Business metadata: It contains information that gives users an easy-to-
understand perspective of the information stored in the data warehouse.
 Business metadata documents include:
1. Subject areas and information object type, including queries,
reports, images, video/audio clips.
2. Internet home pages.
3. Other information to support all data warehousing
components.
4. Operational information e.g., data history, ownership, usage
data etc.
(iii)Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.
Metadata Repository:

A metadata repository is a database or other storage mechanism that is


used to store metadata about data.

A metadata repository can be used to manage, organize, and maintain


metadata in a consistent and structured manner, and can facilitate the
discovery, access, and use of data.

3. Warehouse database technology:


 The central data warehouse database is a cornerstone of the data warehousing
environment, which is always implemented on the relational database
management system (RDBMS) technology.
 The different technological approach includes parallel relational database
design, an innovative approach, which is used of index structure, and multi-
dimensional database.
4. Access tools: Access tools allow users to interact with the data in your data warehouse.
Examples of access tools include: query and reporting tools, application development
tools, data mining tools, and OLAP tools.

5. Data Marts:
 A Data Mart is a subset of a directorial information store, generally oriented
to a specific purpose or primary data subject which may be distributed to
provide business needs. Data Marts are analytical record stores designed to
focus on particular business functions for a specific community within an
organization. Data marts are derived from subsets of data in a data warehouse,
though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.
 The fundamental use of a data mart is Business Intelligence
(BI) applications. BI is used to gather, store, access, and analyse record. It can
be used by smaller businesses to utilize the data they have accumulated since it
is less expensive than implementing a data warehouse.

Reasons for creating a data mart:

 Creates collective data by a group of users


 Easy access to frequently needed data
 Ease of creation
 Improves end-user response time
 Lower cost than implementing a complete data warehouse.
 Potential clients are more clearly defined than in a comprehensive data
warehouse
 It contains only essential business data and is less cluttered

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts
Dependent Data Marts

A dependent data mart is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In
this technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart are dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no
need for data mart integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts. In
this approach, as all the data marts are designed independently; therefore, the integration of
data marts is required. It is also termed as a bottom-up approach as the data marts are
integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts

It allows us to combine input from sources other than a data warehouse. This could be helpful
for many situations; especially when Adhoc integrations are needed, such as after a new group
or product is added to the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make
informed decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated
with the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.

It involves the following tasks:


1. Mapping data sources to target data sources
2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports,
charts and graphs and publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.
2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:

1. Providing secure access to the data.


2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.

6. Data query, reporting, analysis, and mining tools:


This category can be further divided into two groups:
1. Reporting tools: It is also divided into two types-
(i) Production reporting tools: Generate regular operational report.
(ii) Desktop Report writer: are inexpensive desktop tools designed for
end users.
2. Managed query tools: This tools shield end users from the complexities of
SQL and data base structures by inserting a meta layer between users and the
database.
 The meta layer is the software that provides subject-oriented views of a
database and supports point-and-click creation of SQL.
7. Data Warehouse administration and Management:
To summarize, managing data warehouse includes-
 Security and priority management
 Monitoring updates from multiple sources
 Data quality checks
 Managing and updating metadata
 Auditing and reporting
 Purging data
 Replicating, sub setting and distributing data
 Backup and recovery
 Data warehouse storage management.
8. Information delivery system:
It is used to enable the process of subscribing for data warehouse information and
having it delivered to one or more destinations of choice according to some user-
specified scheduling algorithm.

Benefits of data warehousing

Key points:
1. Saves Time

2. Improves Data Quality

3. Improves Business Intelligence

4. Leads to Data Consistency

5. Enhances Return on Investment (ROI)

6. Stores Historical Data

7. Increases Data Security


Module-2(Data Warehouse Modelling)
Data cube: A multidimensional data model: stars, snowflakes, and fact constellations:
schemas for multidimensional data models, dimensions: the role of concept hierarchies,
measures: their categorization and computation, typical OLAP operations, efficient data cube
computation, the compute cube operator and the curse of dimensionality, partial
materialization: selected computation of cuboids, indexing opal data: bitmap index and join
index

Data warehouse modelling is the process of designing the schemas of the detailed
and summarized information of the data warehouse. The goal of data warehouse
modelling is to develop a schema describing the reality, or at least a part of the fact,
which the data warehouse is needed to support.
Data warehouse modelling is an essential stage of building a data warehouse for two
main reasons. Firstly, through the schema, data warehouse clients can visualize the
relationships among the warehouse data, to use them with greater ease. Secondly, a
well-designed schema allows an effective data warehouse structure to emerge, to help
decrease the cost of implementing the warehouse and improve the efficiency of using
it.
The data within the specific warehouse itself has a particular architecture with
the emphasis on various levels of summarization, as shown in figure:
Older detail data is stored in some form of mass storage, and it is infrequently
accessed and kept at a level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the
current, detailed level and usually is stored on disk storage. When building the data
warehouse have to remember what unit of time is summarization done over and also
the components or what attributes the summarized data will contain.
Highly summarized data is compact and directly available and can even be found
outside the warehouse.
A multidimensional data model:

 Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.
 Various multidimensional models are shown: star schema, snowflake schema,
and fact constellation.
 And how they can be used in basic OLAP operations to allow interactive mining
at multiple levels of abstraction.
 “What is a data cube”? Ans- A data cube allows data to be modelled and
viewed in multiple dimensions. It is defined by dimensions and facts.
 Dimensions are the perspectives or entities with respect to which an
organization wants to keep records. For example, All Electronics may create a
sales data warehouse in order to keep records of the store’s sales with respect to
the dimensions time, item, branch, and location. These dimensions allow the
store to keep track of things like monthly sales of items and the branches and
locations at which the items were sold. Each dimension may have a table
associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes
item name, brand, and type. Dimension tables can be specified by users or
experts, or automatically generated and adjusted based on data distributions.
 A multidimensional data model is typically organized around a central theme,
such as sales. This theme is represented by a fact table.
 Facts are numeric measures.
 Think of them as the quantities by which we want to analyse relationships
between dimensions. Examples of facts for a sales data warehouse include
dollars sold (sales amount in dollars), units sold (number of units sold), and
amount budgeted. The fact table contains the names of the facts, or measures,
as well as keys to each of the related dimension tables. You will soon get a
clearer picture of how this works when we look at multidimensional schemas.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models:

 Star schema: The most common modelling paradigm is the star schema, in which the
data warehouse contains (1) a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension.
 The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
 Example 4.1 Star schema. A star schema for All Electronics Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a
central fact table for sales that contains keys to each of the four dimensions, along with
two measures: dollars sold and units sold. To minimize the size of the fact table,
dimension identifiers (e.g., time key and item key) are system-generated identifiers.
 Notice that in the star schema, each dimension is represented by only one table, and
each table contains a set of attributes. For example, the location dimension table
contains the attribute set {location key, street, city, province or state, country}. This
constraint may introduce some redundancy. For example, “Urbana” and “Chicago” are
both cities in the state of Illinois, USA. Entries for such cities in the location dimension
table will create redundancy among the attributes province or state and country; that is,
(..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes within a
dimension table may form either a hierarchy (total order) or a lattice (partial order).

 Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables.
 The resulting schema graph forms a shape similar to a snowflake. The major difference
between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form to reduce redundancies. Such a table
is easy to maintain and saves storage space.
 However, this space savings is negligible in comparison to the typical magnitude of
the fact table. Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse design.
 Example - Snowflake schema: A snowflake schema for All Electronics sales is given
in Figure 4.7. Here, the sales fact table is identical to that of the star schema in Figure
4.6. The main difference between the two schemas is in the definition of dimension
tables. The single dimension table for item in the star schema is normalized in the
snowflake schema, resulting in new item and supplier tables. For example, the item
dimension table now contains the attributes item key, item name, brand, type, and
supplier key, where supplier key is linked to the supplier dimension table, containing
supplier key and supplier type information. Similarly, the single dimension table for
location in the star schema can be normalized into two new tables: location and city.
The city key in the new location table links to the city dimension. Notice that, when
desirable, further normalization can be performed on province or state and country in
the snowflake schema.

 Fact constellation: Sophisticated applications may require multiple fact tables to


share dimension tables. This kind of schema can be viewed as a collection of stars, and
hence is called a galaxy schema or a fact constellation.
 Example-Fact constellation: A fact constellation schema is shown in Figure 4.8. This
schema specifies two fact tables, sales and shipping. The sales table definition is
identical to that of the star schema (Figure 4.6). The shipping table has five dimensions,
or keys—item key, time key, shipper key, from location, and to location—and two
measures—dollars cost and units shipped. A fact constellation schema allows
dimension tables to be shared between fact tables. For example, the dimensions tables
for time, item, and location are shared between the sales and shipping fact tables.
Dimensions: The Role of Concept Hierarchies:
 A concept hierarchy defines a sequence of mappings from a set of low-level concepts
to higher-level, more general concepts.
 Consider a concept hierarchy for the dimension location. City values for location
include Vancouver, Toronto, New York, and Chicago. Each city, however, can be
mapped to the province or state to which it belongs.
 Example: Vancouver can be mapped to British Columbia, and Chicago to Illinois. The
provinces and states can in turn be mapped to the country (e.g., Canada or the United
States) to which they belong. These mappings form a concept hierarchy for the
dimension location, mapping a set of low-level concepts (i.e., cities) to higher-level,
more general concepts (i.e., countries).
Measures: Their Categorization and Computation:
 A multidimensional point in the data cube space can be by a set of dimensions–value
pairs; for example, time = “Q1”, location = “Vancouver”, item = “computer”.
 A data cube measure is a numeric function that can be evaluated at each point in the
data cube space.
 A measure value is computed for a given point by aggregating the data corresponding
to the respective dimension–value pairs defining the given point.
 Measures can be organized into three categories—distributive, algebraic, and
holistic—based on the kind of aggregate functions used.
 Distributive: An aggregate function is distributive if it can be computed in a distributed
manner as follows. Suppose the data are partitioned into n sets. We apply the function
to each partition, resulting in ‘n’ aggregate values. For the same reason, count (), min
(), and max () are distributive aggregate functions.
 Algebraic: An aggregate function is algebraic if it can be computed by an algebraic
function with M arguments (where M is a bounded positive integer), each of which is
obtained by applying a distributive aggregate function. For example, avg() (average)
can be computed by sum()/count(), where both sum() and count() are distributive
aggregate functions.
 Holistic: An aggregate function is holistic if there is no constant bound on the storage
size needed to describe a sub aggregate. Common examples of holistic functions
include median (), mode (), and rank (). A measure is holistic if it is obtained by
applying a holistic aggregate function.

Typical OLAP Operations:


 “How are concept hierarchies useful in OLAP?”
 In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
 This organization provides users with the flexibility to view data from different
perspectives. A number of OLAP data cube operation exist to materialize these different
views, allowing interactive querying and analysis of the data at hand.
 Hence, OLAP provides a user-friendly environment for interactive data analysis.
 Some OLAP operations are: Roll-up: The roll-up operation shown aggregates the data
by ascending the location hierarchy from the level of city to the level of country. Drill-
down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy
for a dimension or introducing additional dimensions. Slice and dice: The slice
operation performs a selection on one dimension of the given cube, resulting in a sub
cube. And the dice operation defines a sub cube by performing a selection on two or
more dimensions, Pivot (rotate): Pivot (also called rotate) is a visualization operation
that rotates the data axes in view to provide an alternative data presentation.


 Efficient data cube computation: The compute cube operator and the curse
of dimensionality
 At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
 One approach to cube computation extends SQL so as to include a compute cube
operator. The Compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation. This can require excessive storage space,
especially for large numbers of dimensions.
 A major challenge related to this precomputation, however, is that the required storage
space may explode if all the cuboids in a data cube are precomputed, especially when
the cube has many dimensions. The storage requirements are even more excessive when
many of the dimensions have associated concept hierarchies, each with multiple levels.
This problem is referred to as the Curse of dimensionality.

Partial materialization: selected computation of cuboids, indexing opal data:


bitmap index and join index
 There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “non base” cuboids. This leads
to computing expensive multidimensional aggregates on-the-fly, which can be
extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice of
computed cuboids is referred to as the full cube. This choice typically requires huge
amounts of memory space in order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of
possible cuboids. Alternatively, we may compute a subset of the cube, which contains
only those cells that satisfy some user-specified criterion, such as where the tuple count
of each cell is above some threshold. We will use the term sub cube to refer to the latter
case, where only some of the cells may be precomputed for various cuboids. Partial
materialization represents an interesting trade-off between storage space and response
time. The partial materialization of cuboids or sub cubes should consider three factors:
(1) identify the subset of cuboids or sub cubes to materialize;
(2) exploit the materialized cuboids or sub cubes during query processing; and
(3) efficiently update the materialized cuboids or sub cubes during load and refresh.

Note: The difference between the cube and cuboid shapes are as follows: The sides of the
cube are equal but for cuboids they are different. The sides of the cube are square, but for the
cuboids they are in rectangular shape. All the diagonals of the cube are equal but a cuboid has
equal diagonals for only parallel sides.

Indexing OLAP Data: Bitmap Index and Join Index


 To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views (using cuboids).
 The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes. The bitmap index is an alternative representation of the record
ID (RID) list.
 In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each
value v in the attribute’s domain. If a given attribute’s domain consists of n values, then
n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the
attribute has the value v for a given row in the data table, then the bit representing that
value is set to 1 in the corresponding row of the bitmap index. All other bits for that
row are set to 0.
 Example: Bitmap indexing. In the All-Electronics data warehouse, suppose the
dimension item at the top level has four values (representing item types): “home
entertainment,” “computers,” “phone,” and “security.” Each value (e.g., “computer”) is
represented by a bit vector in the item bitmap index table. Suppose that the cube is
stored as a relation table with 100,000 rows. Because the domain of item consists of
four values, the bitmap index table requires four-bit vectors (or lists), each with 100,000
bits. Figure 4.15 shows a base (data) table containing the dimensions item and city, and
its mapping to bitmap index tables for each of the dimensions.

Join Index:
 The join indexing method gained popularity from its use in relational database query
processing.
 Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a
relational database.
 For example, if two relations R (RID, A) and S (B, SID) join on the attributes A and B,
then the join index record contains the pair (RID, SID), where RID and SID are record
identifiers from the R and S relations, respectively.
 Join indexing is especially useful for maintaining the relationship between a foreign
key2 and its matching primary keys, from the joinable relation.
 Example: Join indexing. In Example 3.4, we defined a star schema for All Electronics
of the form “sales star [time, item, branch, location]: dollars sold = sum (sales in
dollars).” An example of a join index relationship between the sales fact table and the
location and item dimension tables is shown in Figure 4.16.
 For example, the “Main Street” value in the location dimension table joins with tuples
T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item
dimension table joins with tuples T57 and T459 of the sales fact table.
Module-3
Data Warehouse design principles:
Building a data warehouse: Introduction, Critical Success Factors, Requirement Analysis,
planning for the data Warehouse-The data Warehouse design stage, Building and implementing
data marts. Building data warehouses, Backup and Recovery, Establish the data quality
framework, Operating the Warehouse, Recipe for a successful warehouse, Data warehouse
pitfalls.

Data Warehouse design:


 A data warehouse is a single data repository where a record from multiple data sources
is integrated for online business analytical processing (OLAP).
 Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as
answering management related queries. The target of the design becomes how the
record from multiple data sources should be extracted, transformed, and loaded (ETL)
to be organized in a database as the data warehouse.

There are two approaches

 "Top-down" approach
 "Bottom-up" approach

"Top-down" approach:

 In the "Top-Down" design approach, a data warehouse is described as a subject-


oriented, time-variant, non-volatile and integrated data repository for the entire
enterprise data from different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse.
 The data warehouse stores "atomic" information, the data at the lowest level of
granularity, from where dimensional data marts can be built by selecting the data
required for specific business subjects or particular departments.
 The advantage of this method is which it supports a single integrated data source.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.
Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

"Bottom-up" approach:

 In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction


data specifical architecture for query and analysis," term the star schema. In this
approach, a data mart is created first to necessary reporting and analytical capabilities
for particular business processes (or subjects). Thus, it is needed to be a business-driven
approach in contrast to Inmon's data-driven approach.
 The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time and
effort than developing an enterprise-wide data warehouse.
 The risk of failure is even less. This method is inherently incremental. This method
allows the project team to learn and grow.

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller Solves the essential low-level problem and integrates
subproblems. them into a higher one.

Inherently architected- not a union of Inherently incremental; can schedule essential data marts
several data marts. first.

Single, central storage of information about Departmental information stored.


the content.

Centralized rules and control. Departmental rules and control.


It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favourable return on investment, and
repetitions. proof of techniques.

Building a data warehouse

You might also like