Unit-II Data Warehousing
Unit-II Data Warehousing
Unit-II Data Warehousing
UNIT-II
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
1 UNIT-III 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data Warehouse
➢ A data warehouse is a collection of data marts representing
historical data from different operations in the company.
➢ It collect the data from multiple heterogeneous data base files
(flat, text and etc).
➢ It store the5 to 10years of huge amount of data. This data is
stored in a structure optimized for querying and data analysis as a
data warehouse.
2 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
3 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
4 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
5 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Information al Data:
➢Focusing on providing answers to problems posed by decision
makers
➢Summarized
➢Non updateable
Data Warehouse Characteristics
➢It is a database designed for analytical tasks
➢Its content is periodically updated
➢It contains current and historical data to provide historical
perspective of information.
6 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
DATA WAREHOUSECOMPONENTS
7 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
8 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Key components
➢Data sourcing, cleanup, transformation, and migration tools
➢Metadata repository
➢Warehouse/database technology
➢Data marts
➢Data query, reporting, analysis, and mining tools
➢Data warehouse administration and management
➢ Information delivery system
9 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
10 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
11 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
12 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Metadata repository
It is data about data. It is used for maintaining, managing and using
the data warehouse.
Technical Meta data: It contains information about data
warehouse data used by warehouse designer, administrator to
carry out development and management tasks. It includes,
➢Info about data stores.
➢Transformation descriptions.
➢That si mapping methods from operational db to warehouse db.
➢Warehouse Object and data structure definitions for target data
13 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
14 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Business Meta data: It contains info that gives info stored in data
warehouse to users. It includes,
➢Subject areas, and info object type including queries, reports,
images, video, audio clips etc.
➢ Internet home pages
➢ Info related to info delivery system
➢Data warehouse operational info such as ownerships, audit
trails etc. ,
15 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Meta data helps the users to understand content and find the data.
Meta data are stored in a separate data stores which is known as
informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse.
16 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
17 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
18 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
19 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
20 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
21 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
22 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
BUILDING A DATA WAREHOUSE
Business factors:
➢Business users want to make decision quickly and correctly
using all available data.
Technological factors:
➢To address the incompatibility of operational data stores
➢IT infrastructure is changing rapidly. Its capacity is increasing
and cost is decreasing so that building a data warehouse is easy
23 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Business factors
Top – Down Approach It collected enterprise wide business
requirements and decided to build an enterprise data warehouse
with subset data marts.
Bottom Up Approach The data marts are integrated or combined
together to form a data warehouse.
Developing and integrating data marts as and when the
requirements are clear.
The advantage of using the Bottom Up approach is that they do not
require high initial costs and have a faster implementation time;
24 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Design considerations:
➢In general a data warehouse data from multiple heterogeneous
sources into a query database this is also one of the reasons why a
data warehouse is difficult to built Data content
➢The content and structure of the data warehouse are reflected in
its data model.
➢The data model is the template that describes how information
will be organized within the integrated warehouse framework.
➢The data warehouse data must be a detailed data. It must be
formatted, cleaned up and Transformed to fit the warehouse data
model.
25 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Meta data
➢It defines the location and contents of data in the warehouse.
➢Meta data is searchable by users to find definitions or subject
areas.
Data distribution
➢Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple
servers.
➢The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year)
26 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Technological factors
Technical considerations:
Hardware platforms
➢An important consideration when choosing a data warehouse
server capacity for handling the high volumes of data.
➢It has large data and through put.
➢The modern server can also support large volumes and large
number of flexible GUI
27 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data warehouse and DBMS specialization
➢Very large size of databases and need to process complex adhoc
queries in a short time
➢The most important requirements for the data warehouse DBMS
are performance, throughput and scalability.
Communication infrastructure
➢The data warehouse user requires a relatively large band width
to interact with the data warehouse and retrieve a significant
amount of data for analysis.
➢This may mean that communication networks have to be
expanded and new hardware and software may have purchased.
28 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Access tools:
Data warehouse implementation relies on selecting suitable data
access tools. The following lists the various type of data that can be
accessed:
1. Simple tabular form data
2. Ranking data
3. Multivariable data
4. Time series data
5. Graphing, charting and pivoting data
29 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
30 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
➢The tool should have the ability to read data from data
dictionary The code generated by the tool should be completely
maintainable
➢The data warehouse database system must be able to perform
loading data directly from these tools.
31 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data replication
➢Data replication or data moves to place the data to a particular
workgroup in a localized database.
➢Most companies use data replication servers to copy their most
needed data to a separate database.
Metadata
➢It is a road map to the information stores in the warehouse is
metadata it defines all elements and their attributes.
32 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
33 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
User levels
Casual users: are most comfortable in retrieving info from
warehouse in predefined formats and running pre-existing
queries and reports.
Power Users: can use pre defined as well as user defined queries
to create simple and ad hoc reports. These users can engage in
drill down operations. These users may have the experience of
using reporting and query tools.
Expert users: These users tend to create their own complex
queries and perform standard analysis on the info they retrieve.
34 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
35 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Multi Dimensional Data Model
36 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
37 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
39 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
40 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can
be recognized on the basis of different factors such as geographical
location of firm’s workplace, products of the firm, advertisements
done, time utilized to flourish a product, etc.
41 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
42 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
43 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
44 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
45 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
46 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
47 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
48 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
49 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
50 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
OLAP operations:
Drill down: In drill-down operation, the less detailed data is
converted into highly detailed data. It can be done by:
✓ Moving down in the concept hierarchy
✓ Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).
51 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
52 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
53 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
54 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Dice:
It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube
is selected by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
55 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
56 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
57 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
58 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Three-Tier Data Warehouse Architecture
59 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
60 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
A middle-tier which consists of an OLAP server for fast querying
of the data warehouse.
(1) A Relational OLAP model, i.e., an extended relational DBMS
that maps functions on multidimensional data to standard
relational operations.
(2) A Multidimensional OLAP model, i.e., a particular purpose
server that directly implements multidimensional information and
operations.
A top-tier that contains front-end tools for displaying results
provided by OLAP, as well as additional tools for data mining of
the OLAP-generated data.
61 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
62 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
64 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Load Performance
Data warehouses require increase loading of new data periodically
basis within narrow time windows; performance on the load
process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of
data business.
Load Processing
Many phases must be taken to load new or update data into the
data warehouse, including data conversion, filtering, reformatting,
indexing, and metadata update.
65 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
66 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL
67 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Star Schema
➢ Each dimension in a star schema is represented with only one-
dimension table.
➢ This dimension table contains the set of attributes.
➢ The following diagram shows the sales data of a company with
respect to the four dimensions, namely time, item, branch, and
location.
68 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
69 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
70 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Snowflake Schema
➢Some dimension tables in the Snowflake schema are
normalized.
➢The normalization splits up the data into additional tables.
➢Unlike Star schema, the dimensions table in a snowflake schema
are normalized. For example, the item dimension table in star
schema is normalized and split into two dimension tables, namely
item and supplier table.
71 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
72 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
73 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Fact Constellation Schema
➢ A fact constellation has multiple fact tables. It is also known as
galaxy schema.
➢ The following diagram shows two fact tables, namely sales and
shipping.
➢ The sales fact table is same as that in the star schema.
➢ The shipping fact table has the five dimensions, namely
item_key, time_key, shipper_key, from_location, to_location.
➢ The shipping fact table also contains two measures, namely
dollars sold and units sold.
74 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
75 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
OLAP
Online Analytical Processing (OLAP) is a category of software
that allows users to analyze information from multiple database
systems at the same time. It is a technology that enables analysts to
extract and view business data from different points of view.
How does it work?
A Data warehouse would extract information from multiple data
sources and formats like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into
an OLAP server where information is pre-calculated in advance for
further analysis.
76 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
1) Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-
up operation can be performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a system of
grouping things based on their order or level.
77 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
78 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
➢In this example, cities New jersey and Lost Angles and rolled up
into country USA
➢The sales figure of New Jersey and Los Angeles are 440 and
1560 respectively. They become 2000 after roll-up
➢In this aggregation process, data is location hierarchy moves up
from city to the country.
➢In the roll-up process at least one or more dimensions need to
be removed. In this example, Cities dimension is removed.
79 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
2.Drill-down
In drill-down data is fragmented into smaller parts. It is the
opposite of the rollup process. It can be done via
➢Moving down the concept hierarchy
➢Increasing a dimension
Consider the diagram below
Quater Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
In this example, dimension months are added.
80 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
81 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
➢ Dimension Time is Sliced with Q1 as the filter.
➢ A new cube is created altogether.
82 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
83 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Dice:
This operation is similar to a slice. The difference in dice is you
select 2 or more dimensions that result in the creation of a sub-
cube.
84 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
85 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
4) Pivot
In Pivot, you rotate the data axes to provide a substitute
presentation of data.
In the following example, the pivot is based on item types.
86 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
87 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Types of OLAP
88 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
89 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
90 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Advantages of OLAP
➢ OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
➢ Information and calculations are consistent in an OLAP cube.
This is a crucial benefit.
➢ Quickly create and analyze “What if” scenarios
➢ Easily search OLAP database for broad or specific terms.
➢ OLAP provides the building blocks for business modeling tools,
Data mining tools, performance reporting tools.
91 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Disadvantages of OLAP
➢OLAP requires organizing data into a star or snowflake schema.
These schemas are complicated to implement and administer
➢You cannot have large number of dimensions in a single OLAP
cube
➢Transactional data cannot be accessed with OLAP system.
➢Any modification in an OLAP cube needs a full update of the
cube. This is a time-consuming process
92 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Comparisons of OLAP vs OLTP
93 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
94 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
95 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
96 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
97 UNIT-II 7/8/2024
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING