Data Warehousing-Notes(Module -I & II) (1) (1)
Data Warehousing-Notes(Module -I & II) (1) (1)
Module-I:
• Topic: The Need For data warehousing, Paradigm shift, Data Warehouse Definition
and Characteristics, Data warehouse Architecture, Sourcing, Acquisition, Cleanup and
Transformation, Metadata, Access tools, Data marts, Data Warehouse administration
and Management, Building a data warehouse: business consideration, technical
consideration, design consideration, implementation consideration, integrated
solutions, Benefits of data warehousing. Data Warehouse Architecture: Two and Three
tier Data Warehouse architecture.
Subject-Oriented: A data warehouse can be used to analyse a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.
Types of Data Warehouses (DWH)
Typically, enterprise systems use three main types of data warehouses (DWH):
Figure 3 shows Sales, Products, Customers and Account are the different themes. A
data warehouse never emphasizes only existing activities. Instead, it focuses on data
demonstration and analysis to make different decisions. It also provides an easy and
accurate demonstration of specific themes by eliminating information that is not needed
to make decisions.
Integrated: Integration involves setting up a common system to measure all similar
data from multiple systems. Data was to be shared within several database
repositories and must be stored in a secured manner to access by the data warehouse.
A data warehouse integrates data from various sources and combines it in a relational
database. It must be consistent, readable, and coded.
(iii) Three-tier data warehouse architecture: The three-tier approach is the most
widely used architecture for data warehouse systems. Essentially, it consists of three
tiers:
1. The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of the
database. It arranges the data to make it more suitable for analysis. This is done with
an OLAP server, implemented using the ROLAP or MOLAP model.
3. The top-tier is where the user accesses and interacts with the data. It represents
the front-end client layer. You can use reporting tools, query, analysis or data
mining tools.
The data sourcing, clean up, transformation and migration tools perform all of the
conversions, summarization, key changes, structural changes and condensations needed
to transform disparate data into information that can be used by decision support tool.
The functionality includes:
Removing unwanted data from operational database.
Converting to common data names and definitions.
Establishing defaults for missing data.
Accommodating source data definition changes.
The data sourcing, clean up, transformation and migration tools have to deal with some
significant issues include:
Database heterogeneity DBMSs are very different in data models, data access
language, data navigation operations, concurrency integrity, recovery etc.
Data heterogeneity, this is difference in the way data is defined and used in
different models.
Principles of a Data Warehousing :
• Load Performance:
Data warehouses require increase loading of new data on a periodic basic within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of data business.
• Load Processing:
Many steps must be taken to load new or update data into the data warehouse including data
conversion, filtering, reformatting, indexing and metadata update.
• Data Quality Management: Fact-based management demands the highest data quality. The
warehouse must ensure local consistency, global consistency, and referential integrity despite
“dirty” sources and massive database size.
• Query Performance:
Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds not days.
• Terabyte Scalability: Data warehouse sizes are growing at astonishing rates. Today these range
from a few to hundreds of gigabytes and terabyte-sized data warehouses.
The data Sourcing, Cleanup and Transformation tools perform all of the
conversations, summarization, key changes, structural changes, and condensations
needed to transform disparate data into information that can be used by the decision
support tool.
Maintain meta data.
Removing unwanted data from operational databases.
Converting to common data names and definitions.
Calculating summaries and derived data.
Establishing defaults for missing data.
Accommodating source data definition changes.
2. Metadata:
Metadata is data about data that describes the data warehouse.
It is used for building, maintaining, managing, and using the data warehouse.
Metadata is divided into two main categories:
(i) Technical metadata.
(ii) Business metadata.
(i) Technical metadata: It contains information about warehouse data for
use by warehouse designers and administrators when carrying out
warehouse development and management tasks.
Technical metadata documents include:
1. Information about data sources
2. Transformation definition.
3. Warehouse object and data structure definition for data target.
4. Data mapping operations.
5. Access authorization, backup history, data acquisition history
etc.
(ii) Business metadata: It contains information that gives users an easy-to-
understand perspective of the information stored in the data warehouse.
Business metadata documents include:
1. Subject areas and information object type, including queries,
reports, images, video/audio clips.
2. Internet home pages.
3. Other information to support all data warehousing
components.
4. Operational information e.g., data history, ownership, usage
data etc.
(iii)Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.
Metadata Repository:
5. Data Marts:
A Data Mart is a subset of a directorial information store, generally oriented
to a specific purpose or primary data subject which may be distributed to
provide business needs. Data Marts are analytical record stores designed to
focus on particular business functions for a specific community within an
organization. Data marts are derived from subsets of data in a data warehouse,
though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.
The fundamental use of a data mart is Business Intelligence
(BI) applications. BI is used to gather, store, access, and analyse record. It can
be used by smaller businesses to utilize the data they have accumulated since it
is less expensive than implementing a data warehouse.
There are mainly two approaches to designing data marts. These approaches are
A dependent data mart is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In
this technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart are dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no
need for data mart integration. It is also known as a top-down approach.
The second approach is independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts. In
this approach, as all the data marts are designed independently; therefore, the integration of
data marts is required. It is also termed as a bottom-up approach as the data marts are
integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful
for many situations; especially when Adhoc integrations are needed, such as after a new group
or product is added to the organizations.
The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make
informed decisions and manage it over time. So, the steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.
1. Creating the physical database and logical structures such as tablespaces associated
with the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports,
charts and graphs and publishing them.
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.
2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:
Key points:
1. Saves Time
Data warehouse modelling is the process of designing the schemas of the detailed
and summarized information of the data warehouse. The goal of data warehouse
modelling is to develop a schema describing the reality, or at least a part of the fact,
which the data warehouse is needed to support.
Data warehouse modelling is an essential stage of building a data warehouse for two
main reasons. Firstly, through the schema, data warehouse clients can visualize the
relationships among the warehouse data, to use them with greater ease. Secondly, a
well-designed schema allows an effective data warehouse structure to emerge, to help
decrease the cost of implementing the warehouse and improve the efficiency of using
it.
The data within the specific warehouse itself has a particular architecture with
the emphasis on various levels of summarization, as shown in figure:
Older detail data is stored in some form of mass storage, and it is infrequently
accessed and kept at a level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the
current, detailed level and usually is stored on disk storage. When building the data
warehouse have to remember what unit of time is summarization done over and also
the components or what attributes the summarized data will contain.
Highly summarized data is compact and directly available and can even be found
outside the warehouse.
A multidimensional data model:
Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.
Various multidimensional models are shown: star schema, snowflake schema,
and fact constellation.
And how they can be used in basic OLAP operations to allow interactive mining
at multiple levels of abstraction.
“What is a data cube”? Ans- A data cube allows data to be modelled and
viewed in multiple dimensions. It is defined by dimensions and facts.
Dimensions are the perspectives or entities with respect to which an
organization wants to keep records. For example, All Electronics may create a
sales data warehouse in order to keep records of the store’s sales with respect to
the dimensions time, item, branch, and location. These dimensions allow the
store to keep track of things like monthly sales of items and the branches and
locations at which the items were sold. Each dimension may have a table
associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes
item name, brand, and type. Dimension tables can be specified by users or
experts, or automatically generated and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme,
such as sales. This theme is represented by a fact table.
Facts are numeric measures.
Think of them as the quantities by which we want to analyse relationships
between dimensions. Examples of facts for a sales data warehouse include
dollars sold (sales amount in dollars), units sold (number of units sold), and
amount budgeted. The fact table contains the names of the facts, or measures,
as well as keys to each of the related dimension tables. You will soon get a
clearer picture of how this works when we look at multidimensional schemas.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models:
Star schema: The most common modelling paradigm is the star schema, in which the
data warehouse contains (1) a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
Example 4.1 Star schema. A star schema for All Electronics Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a
central fact table for sales that contains keys to each of the four dimensions, along with
two measures: dollars sold and units sold. To minimize the size of the fact table,
dimension identifiers (e.g., time key and item key) are system-generated identifiers.
Notice that in the star schema, each dimension is represented by only one table, and
each table contains a set of attributes. For example, the location dimension table
contains the attribute set {location key, street, city, province or state, country}. This
constraint may introduce some redundancy. For example, “Urbana” and “Chicago” are
both cities in the state of Illinois, USA. Entries for such cities in the location dimension
table will create redundancy among the attributes province or state and country; that is,
(..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes within a
dimension table may form either a hierarchy (total order) or a lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables.
The resulting schema graph forms a shape similar to a snowflake. The major difference
between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form to reduce redundancies. Such a table
is easy to maintain and saves storage space.
However, this space savings is negligible in comparison to the typical magnitude of
the fact table. Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse design.
Example - Snowflake schema: A snowflake schema for All Electronics sales is given
in Figure 4.7. Here, the sales fact table is identical to that of the star schema in Figure
4.6. The main difference between the two schemas is in the definition of dimension
tables. The single dimension table for item in the star schema is normalized in the
snowflake schema, resulting in new item and supplier tables. For example, the item
dimension table now contains the attributes item key, item name, brand, type, and
supplier key, where supplier key is linked to the supplier dimension table, containing
supplier key and supplier type information. Similarly, the single dimension table for
location in the star schema can be normalized into two new tables: location and city.
The city key in the new location table links to the city dimension. Notice that, when
desirable, further normalization can be performed on province or state and country in
the snowflake schema.
Efficient data cube computation: The compute cube operator and the curse
of dimensionality
At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
One approach to cube computation extends SQL so as to include a compute cube
operator. The Compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation. This can require excessive storage space,
especially for large numbers of dimensions.
A major challenge related to this precomputation, however, is that the required storage
space may explode if all the cuboids in a data cube are precomputed, especially when
the cube has many dimensions. The storage requirements are even more excessive when
many of the dimensions have associated concept hierarchies, each with multiple levels.
This problem is referred to as the Curse of dimensionality.
Note: The difference between the cube and cuboid shapes are as follows: The sides of the
cube are equal but for cuboids they are different. The sides of the cube are square, but for the
cuboids they are in rectangular shape. All the diagonals of the cube are equal but a cuboid has
equal diagonals for only parallel sides.
Join Index:
The join indexing method gained popularity from its use in relational database query
processing.
Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a
relational database.
For example, if two relations R (RID, A) and S (B, SID) join on the attributes A and B,
then the join index record contains the pair (RID, SID), where RID and SID are record
identifiers from the R and S relations, respectively.
Join indexing is especially useful for maintaining the relationship between a foreign
key2 and its matching primary keys, from the joinable relation.
Example: Join indexing. In Example 3.4, we defined a star schema for All Electronics
of the form “sales star [time, item, branch, location]: dollars sold = sum (sales in
dollars).” An example of a join index relationship between the sales fact table and the
location and item dimension tables is shown in Figure 4.16.
For example, the “Main Street” value in the location dimension table joins with tuples
T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item
dimension table joins with tuples T57 and T459 of the sales fact table.
Module-3
Data Warehouse design principles:
Building a data warehouse: Introduction, Critical Success Factors, Requirement Analysis,
planning for the data Warehouse-The data Warehouse design stage, Building and implementing
data marts. Building data warehouses, Backup and Recovery, Establish the data quality
framework, Operating the Warehouse, Recipe for a successful warehouse, Data warehouse
pitfalls.
"Top-down" approach
"Bottom-up" approach
"Top-down" approach:
Developing new data mart from the data warehouse is very easy.
Disadvantages of top-down design
"Bottom-up" approach:
Breaks the vast problem into smaller Solves the essential low-level problem and integrates
subproblems. them into a higher one.
Inherently architected- not a union of Inherently incremental; can schedule essential data marts
several data marts. first.
It may see quick results if implemented with Less risk of failure, favourable return on investment, and
repetitions. proof of techniques.