0% found this document useful (0 votes)
7 views

Module2 ADBMS

Uploaded by

abhayjha30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module2 ADBMS

Uploaded by

abhayjha30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Data

Warehousing
▪ Introduction to DW
▪ DW Architecture
▪ ETL Process
▪ Top-Down and Bottom-Up Approaches
▪ Characteristics & Benefits of Data Mart
▪ Differences between OLAP & OLTP
▪ Dimensional Analysis
▪ Drill down and Roll up
▪ OLAP Models
▪ Schemas – Star, Snowflake and Fact Constellation

2
1.
Introduction
to Data
Warehousin
g

A Data Warehouse is a subject
oriented, integrated, non-volatile and
time variant collection of data in
support of the management’s
decisions

4
Subject Oriented Data
▪ In every industry, data sets are organized
around individual applications to support
those particular operational systems.
▪ In a DW, data is stored by subjects, not by
applications.
▪ Business subjects differ from enterprise to
enterprise. Eg: in a manufacturing company
– sales, shipments, inventory are critical
business subjects
5
6
▪ In the operational systems shown, data for each
application is organized separately by application: Order
processing, customer loans, billing, accounts receivable,
claims processing and savings account.
▪ CLAIMS is a critical business subject for an insurance
company.
▪ Claims under automobile insurance policies are
processed in the Auto Insurance Application.
▪ But in the DW for the insurance company, claims data
are organized around the subject of CLAIMS (and not by
application).
▪ NOTE: Data in a DW cut across applications 7
Integrated Data
▪ For proper decision making, one needs
relevant data from various applications.
▪ The data in the DW comes from several
operational systems.
▪ Source data are in different databases and
files – these are disparate applications; so
the operational platforms and OS could be
different.
8
Integrated Data
▪ In addition to data from internal
operational systems, for many enterprises,
data from outside sources is likely to be
very important.
▪ The DW may need data from such
sources.

9
Data from
applications

10
▪ Here the data fed into the subject area of ACCOUNT in
the DW comes from 3 different operational applications
▪ There could be several variations:
▪ Naming conventions could be different
▪ Attributes for data items may be different
▪ Account number in the savings bank application may
be 8 bytes long, but only 6 bytes in the checking
application
▪ Before moving the data to the DW, it must go through the
process of: transformation, consolidation and integration
of the data source
11
Non-volatile Data
▪ Data extracted from the various operational systems
and pertinent data obtained from outside sources are
transformed, integrated and stored in the DW.
▪ Data in the DW is not intended to run the day-to-day
business.
▪ Data in the operational systems are moved to the DW
at specific intervals.
▪ Data movements are scheduled based on the
requirements of the user. 12
13
▪ Business transactions update the operational
systems databases in real time.
▪ We add, change, or delete data from an
operational database as each transaction
takes place.
▪ But we do not usually update or delete the
data in a DW.
▪ Data in a DW is not as volatile as the data in
an operational database is.
▪ It is primarily used for query and analysis.

14
Time-Variant Data
▪ For an operational system, the stored data contains
CURRENT values.
▪ Eg: In an accounts receivable system, the balance is the
current outstanding balance in the customer’s account.
▪ We store past transactions, but essentially operational
systems reflect current information because these
systems support day-to-day current operations

15
Time-Variant Data
▪ On the other hand, the data in the DW is meant for
analysis and decision making.
▪ Eg: If the user is looking at the buying pattern of a
customer, he requires data about current and past
purchases.
▪ A DW, because of the very nature of its purpose, has to
contain historical data (and not just current values).

16
Time-Variant Data
▪ Every data structure in a DW contains the time element.
▪ Eg: In a DW containing units of sale, quantity stored in
each record relates to a specific time element.
Depending on the level of details in the DW, sales
quantity may relate to a specific date, week, month,
quarter or even year.

17
COMPONENTS
OF A DATA
WAREHOUSE

18
Building blocks of a DW
▪ Include
▪ 1. Source Data Component (Production,
External, Internal, Archived)
▪ 2. Data Staging Component
▪ 3. Data Storage Component
▪ 4. Information Delivery Component
▪ 5. Metadata
▪ 6. Management & Control Component
19
1. Source Data Component
▪ Source data in the DW may be grouped into 4
broad categories

20
Production Internal
Data Data

Archived External
Data Data
21
A) Production Data

▪ This category of data comes from various operational systems of the


enterprise.
▪ There may be variations in data formats, they may reside on different
hardware platforms, they may be supported by different databases and
operating systems.
▪ Challenge is to standardize and transform disparate data from different
production systems, convert the data, and integrate the pieces into
useful data for storage in the DW.

22
B) Internal Data

▪ In every organization, users keep their “private” spreadsheets,


documents, customer profiles and sometimes departmental databases
▪ This internal data may be useful in the DW.
▪ Internal data adds additional complexity to the process of transforming
and integrating the data before it can be stored in the DW.

23
C) Archived Data

▪ In every operational system, periodically old data is taken and stored in


archived files.
▪ Sometimes data is archived after a year, sometimes even 5 years.
▪ A DW keeps historical snapshots of data which is required for analysis
over a period of time.
▪ For getting historical information, archived data must be accessed.
▪ This type of data is useful for discerning patterns and analysing trends.

24
D) External Data

▪ Most executives depend on data from external sources for a high


percentage of information that they use.
▪ Eg: They may require market share of competitors, may use standard
values of financial indicators for their business to check their
performance.
▪ Usually, data from outside sources do not conform to the organization’s
formats. Conversions must take place in order to make the external
data adhere the internal requirements.

25
Data Staging Component
▪ Now we need to prepare the data for storage in the
DW.
▪ The three major functions involved are
▪ A) Extraction
▪ B) Transformation all take place in the
Staging Area
▪ C) Loading
▪ Data staging provides an area with a set of
functions to clean, change, combine, convert,
deduplicate and prepare the source data for
storage in the DW. 26
3. Data Storage Component
▪ The DW storage requires large volumes of historical
data for analysis.
▪ Further, the data in the DW must be kept in structures
suitable for analysis.
▪ Hence, the data storage component for the DW is a
separate repository.

27
3. Data Storage Component
▪ When analysts use the data in the DW for analysis, they
need to know that the data is stable and that it
represents snapshots at specific periods.
▪ As they are working with the data, storage must not be
in a state of updating.
▪ For this reason, the DW are “read only” repositories.

28
29
4. Information Delivery Component
Novice users need prefabricated reports and preset
queries.
Casual users also need prepackaged information once in a
while, not regularly.
Business Analysts look for the ability to perform complex
queries.
Power users want to be able to navigate through the DW,
pick up interesting data, format their own queries, drill
through the data layers and create custom reports and
queries.

30
Novice/
Casual Users

Business
Analysts /
Power Users

Senior/Executive
Level Managers

KDD

31
5. Metadata
Metadata is the data about the data in the DW

32
6. Management & Control Component
This component coordinates the services and activities
within the DW.
It controls the data transformation and the data transfer to
the DW storage.
It also monitors the movement of data into the staging
area.
Interacts with the metadata component to perform its
necessary functions.

33
34
DATA
MARTS

35
Definition
A data mart is focused on a
single functional area of an
organization and contains a
subset of data stored in the
DW.
A data mart is a condensed
version of a DW and is
designed for a use by a
specific department, unit, or
a set of users in an 36
organization.
Definition
It is often controlled by a
single department in an
organization.
It usually draws data from
only a few sources compared
to a DW.
They are small in size and
more flexible than a DW.

37
TOP-DOWN & BOTTOM-UP
APPROACH

Bottom-Up (Characteristics
& Benefits of Data Mart)
38
TOP-DOWN APPROACH
This is the big-picture approach in which the
overall, big, enterprise-wide DW is built.
There is no collection of fragmented islands of
information.
The DW is large and integrated.
This approach would take longer to build and
has a high risk of failure.

39
ADVANTAGES
1) Truly corporate-effort, enterprise-view of data
2) Inherently architected – not a union of disparate
data marts.
3) Single, central repository of data about the
content.
4) Centralized rules and control.
5) May see quicker results if implemented with
iterations. 40
DISADVANTAGES

1) Takes longer to build even with an iterative method.

2) High risk of failure.

3) High-level of cross-functional skills required.

4) High outlay without proof of concept.

41
BOTTOM-UP APPROACH (Characteristics
& Benefits of Data Mart)
Here departmental data marts are built one by
one.
A priority scheme is required to determine which
data marts must be built first.
Most severe drawback of this approach is data
fragmentation.
Each independent data mart will be blind to the
overall requirements of the entire organizations.
42
ADVANTAGES

1) Faster and easier implementation of manageable pieces

2) Favourable return on investment and proof of concept.

3) Less risk of failure.

4) Inherently incremental, can schedule the important data


marts first.

5) Allows project team to learn and grow. 43


DISADVANTAGES

1) Each data mart has its own narrow view of data.

2) Permeates redundant data in every data mart.

3) Risk of inconsistent and irreconcilable data.

4) Proliferates unmanageable interfaces.

44
Data Warehouse Data Marts
1) Corporate/Enterprise Wide 1) Departmental
2) Single business
2) Union of all data marts
process
3) Data received from the
3) Star join
staging area
4) Structure for corporate view 4) Structure for
of data – strategic decision departmental view of data
making – tactical decision making
5) Data comes from many 5) Data comes from few
sources sources 45
Extraction,
Transformation &
Loading (ETL)
46
ETL functions reshape the relevant data from
the source systems into useful information to be
stored in the DW.

ETL is a 3 step process

47
1. Extraction
In this step of the ETL architecture, data is
extracted from the source data into the staging
area.
Transformations (if any) are done in the staging
area.
The staging are gives an opportunity to validate
extracted data before it moves into the DW.

48
Three data extraction methods
1. Full extraction - Involves extracting all the data from
the sources.

2. Partial Extraction (with update notification) – Easier


and faster in comparison to full extraction. It involves
extracting only modified data.

3. Partial Extraction (without update notification) – This


involves extracting the data based on certain key
features. Eg. If extracted data is already there till
yesterday, it is possible to extract today’s data and
identify changes in them. 49
Regardless of the method used, extraction
should not affect the performance and response
time of the source systems.
These source systems are live production
databases; any slow down could affect the
organization.

50
Some validations done during Extraction:
1. Reconcile records with source data.
2. Ensure no unwanted data is loaded.
3. Data type check.
4. Remove duplicated data.
5. Check whether keys are in place.

51
2. Transformation
Data that does not require any transformation it is
called as a DIRECT MOVE.
Data extracted from the source is raw and usable
in its original form.
This is an important ETL step, where important
functions are applied on the extracted data.
These operations can be customised as per
users needs:
52
2. Transformation
These operations can be customised as per
users needs:
Eg: If the user wants sum-of-sales which is not in
the original databases.
If the first name, middle name and last name are
in different fields, it is possible to concatenate
them before loading.

53
Problems that arise during transformation
1. Different spellings of the name of the same
person (Jon, John).
2. Multiple ways in which we denote a company
name (Google, Google Inc.).
3. Use different names (Cleaveland, Cleveland).
4. Different account numbers generated for the
same person.
5. Required data field left blank.
6. Invalid manual entry.
54
Validations done during this stage
1. FILTERING – 2. Use RULES for 3. Character set 7. Cleaning
select only certain data conversion. (Map
columns to load. standardization. NULL to 0,
MALE to
M,
FEMALE
to F)
4. Conversion of 5. Data Threshold 6. Required
units of validation check. columns are not
measurement. (Age cannot be blank.
more than 2 digits)

55
3. Loading
In a DW, huge volumes of data need to be
loaded in a relatively short period.
In case of a load failure, recovery mechanisms
should be configured to restart from the point of
failure.
DW administrators need to monitor, resume, or
cancel loads as per prevailing server
performance.

56
Types
of
Loading
Initial Load

Incremental Load

Full Refresh
57
1. Initial Load
Populating all the DW tables.

2. Incremental Load
Applying ongoing changes as and when
required.

3. Full Refresh
Erasing all the contents of one or more
tables and reloading with fresh data. 58
Load Verification
1. Ensure that 2. Test modelling 3. Check
the key field data views based on combined values
is neither missing the target tables. and calculated
nor NULL. measures.

4. Data checks in 5. Check the BI reports


dimension tables and generated.
history tables.

59
OLTP
(Online
Transaction Place your screenshot here

Processing
Systems)

60
Definition
Operational systems are OLTP systems.
These are systems that are used to run the
day-to-day core business of the company.
They support the basic business processes of
the company.
These systems typically get data IN the system.

61
Decision Support Systems (DSS)
On the other hand, specially designed DSS are
not meant to run the core business processes.
They are used to watch how the business runs,
and then to make strategic decisions to improve
the business.
DSS are developed to get strategic information
OUT of the database.

62
GET THE DATA IN

Take an order
Process a claim
Make a shipment
Generate an invoice Operational Systems
Receive cash
Reserve an airline seat GET THE INFORMATION OUT
Show me the top selling products
Show me the problem regions
Tell me why (Drill Down)
Let me see other data (Drill Across)
Show me the highest margins
DSS Alert me when sales in a district goes below a
certain level 63
Operational Informational

Data Content Current Values Archived, Derived,


Summarized
Data Structure Optimized for transactions Optimized for complex queries

Access Frequency High Medium to Low

Access Type Read, Update, Delete Read

Usage Predictable, Repetitive Ad-Hoc, Random, Heuristic

Response Time Sub seconds Several seconds to minutes

Users Large number Relatively smaller number of


users

64
OLAP
(Online Place your screenshot here

Analytical
Processing)

65
OLAP is a category of software technology
that enables analysts, managers and executives
to gain insight into data
Through fast, consistent and interactive access
With a wide variety of possible views of
information
(that has been transformed from raw data)
To reflect the real dimensionality of the
enterprise.

66
1. Lets users have a multidimensional and logical view of
the data in the DW
2. Facilitates interactives queries and complex analysis by the
users
3. Allows users to drill down for greater details or roll up for
aggregations along a single business dimension or across
multiple dimensions.
4. Provides the ability to perform intricate calculations and
comparisons.
5. Present results in meaningful ways – include charts and
graphs.
67
Data Cube

68
Definition
A data cube in a DW is a multidimensional
structure used to store data.
Data cubes represent the data in terms of
dimensions and facts.
In a DW, we can implement an n-dimensional
data cube.

69
Dimensions are the attributes with respect to
which an organization wants to keep records.
For example, AllElectronics may create a sales
data warehouse in order to keep records with
respect to the dimensions time, item, branch,
and location.
These dimensions allow the store to keep track
of things like monthly sales of items, and the
branches and locations at which the items were
sold.
Each dimension may have a table associated
with it, called a dimension table. 70
A multidimensional data model is typically organized
around a central theme, like sales, for instance.
This theme is represented by a fact table.
Facts are numerical measures.
Facts are the quantities used to analyze relationships
between dimensions.
Examples of facts for a sales data warehouse include
sales amount in dollars, number of units sold and
amount budgeted.
The fact table contains the names of the facts, or
measures, as well as keys to each of the related
dimension tables 71
2-D View of Sales Data
2-D view of sales details for the city Vancouver
with respect to the dimensions time and item .
Key:
Home Entertainment – HE
Computers – C
Phone – P
Security - S

72
73
3-D View of Sales Data
Suppose we would like to view the data with a
3rd dimension – cities.
The 3D view is as shown below:
Key:
Chicago – Ch
New York – NY
Toronto – T
Vancouver - V
74
75
3-D View of Sales Data
Conceptually the same data may be represented
in the form of 3D data cubes.

76
77
4-D Cuboid
If we now want to view the sales data with an additional
fourth dimension, such as supplier.
In the example given below, the 4D cuboid is a
representation of sales data according to the dimensions
– time, item, location, supplier.

78
79
The topmost 0-D cuboid, which holds the
highest level of approximation is known as the
APEX CUBOID.
Here, the total sales is summarized over all the 4
dimensions.

80
Operations on
Data Cubes
81
Operations

Roll Up Drill Down Slice & Dice Pivot/Rotation

82
Roll Up

83
Drill Down

84
▪ When the drill down operation is performed on any
dimension, the data (on that dimension) is
fragmented into granular form.
▪ Figure above shows the drill down operation on the
time dimension where each quarter is fragmented
into months.

85
Slice

86
▪ The Slice operations picks up one dimension of the
data cube and forms a subcube out of it.
▪ In the figure above, the slice operation has been
performed on the data cube on the basis of the time
dimension.

87
Dice

88
▪ The Dice operation selects more than one
dimension to form a subcube.
▪ The figure above shows the subcube formed by
selecting the dimensions – location, item and time.

89
Pivot/Rotation

90
▪ The Pivot operation rotates the data cube in order
to view it from a different dimension.

91
Summary
A data cube is a multidimensional data structure model for
storing data in the data warehouse.
Data cube can be 2D, 3D or n-dimensional in structure.
Data cube represent data in terms of dimensions and
facts.
Dimension in a data cube represents attributes in the data
set.
Each cell of a data cube has aggregated data.
Data cube provides fast computation and easy access to
data and thereby increases the efficiency of the data cube
Data cube performs indexing to access dimensions. 92
Summary
Data cube can be categorized into two main types such as
multidimensional data cube and relational data cube.
Multidimensional data cube has fast computation.
The relational data cube is scalable and is efficient for
growing data.
Roll-up, drill-down, slice and dice, pivoting are the
operations that can be performed on a data cube.

93
MOLAP &
ROLAP
94
MOLAP (Multidimensional OLAP)
Multidimensional arrays are used to store data
that assures a multidimensional view of data.
Multidimensional data cube helps in storing a
large amount of data.
It implements indexing to represent each
dimension of a data cube.
This improves the accessing, retrieving and
storing of data in a data cube.

95
ROLAP (Relational OLAP)
The relational data cube is an extended version of
the relational DBMS (RDBMS).
Relational tables are used to store data and each
relational table represents a dimension of the data
cube.
When it comes to performance, the relational data
cube is slower than the multidimensional data
cube.
But the relational data cube is scalable for
steadily increasing data. 96
HOLAP (Hybrid OLAP)
It is possible to get a combination of both the
relational data cube and the multidimensional
cube, which is called as the hybrid data cube
(HOLAP).
This has the scalability of the relational data
cube and the performance of the
multidimensional data cube.

97
OLAP vs. OLTP
98
99
OLTP OLAP
Functionality Manage transactions that Used for analytical and
modify data in the reporting purposes.
databases.
Source Real-time transactions of Data is consolidated from
organizations. various OLTP databases.
Storage Format Tabular form in RDBMS. Multidimensional form in
OLAP cubes.
Operation Read & Write. Read only.
Response Time Fast processing since Slower than OLTP.
queries are simple.
Users Executives. Programmers,
professionals. 100
SCHEMAS
101
Schemas in a DW
Like a database, a DW also requires schemas.
The three types of schemas are:
1. Star Schema
2. Snowflake Schema
3. Fact Constellation Schema (Galaxy Schema)

102
Star Schema
Each dimension in a star schema is represented
with only 1-d tables.
This dimension table contains the set of attributes
Following diagram shows the sales data of a
company with respect to 4 dimensions – time,
item, branch and location.
There is a fact table at the centre containing keys
to each of the 4 dimensions.
Further it contains attributes dollars_sold and
103
units_sold
104
Snowflake Schema
Unlike the star schema, dimension tables in the
snowflake schema are normalized.
The normalization splits the data into additional
tables.
Eg: The ITEM dimension is normalized and split
into 2 dimensional tables – ITEM and SUPPLIER.
LOCATION dimension is split into LOCATION
and CITY tables.
Due to normalization, the redundancy is reduced
105
and it becomes easy to maintain and store.
106
Fact Constellation Schema (Galaxy
Schema)
A fact constellation schema has multiple fact
tables.
It is also possible to share dimension tables
between fact tables.
Following example shows two fact tables – sales
and shipping.
Here, TIME and ITEM are shared between the fact
tables.

107
108
Role of Concept Hierarchies in defining
dimensions of a data cube
A concept hierarchy is a set of variables which
represent different levels of aggregation of the same
dimension and are linked with a mapping
For example:
City - > State -> Region -> Country

109
110
In a multidimensional database different cubes are stored,
each of which is defined with different dimensions.
The roll-up operator decreases the detail of the measure,
aggregating it along the concept hierarchy.
In the example, it allows to change the level from City to State
– recomputing values of the measure.
The drill-down operator increases the detail of the measure –
by moving to the lower level of the concept hierarchy.
In the example, one can move from State to City, retrieving
the values of the measure that were previously stored in the
cube. 111

You might also like