Demystifying The IBM EDW Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Demystifying the

IBM Enterprise Data Warehouse (EDW)


Author: Dhruva Sen Gupta
Contents

Introduction ....................................................................................................................................................... 2
ER Diagram ........................................................................................................................................................ 3
Fundamental entity ........................................................................................................................................... 4
Anchor entity ..................................................................................................................................................... 4
Types & Subtypes .............................................................................................................................................. 5
Nature ................................................................................................................................................................ 6
Data load mechanism ........................................................................................................................................ 7
Data updates ..................................................................................................................................................... 8
Pros and Cons .................................................................................................................................................... 9

Page 1 of 9
Introduction
The IBM EDW is an off-the-shelf industry standard data model, available for a variety of business verticals.
Organisations can purchase the pre-built model, thereby saving on hundreds of man-hours of analysis and
modelling time and then customise or extend the model to tailor to their specific data requirements.

In order to understand, extend and harness the power of the model there are some very unique
terminologies and methodologies implemented in the IBM EDW which a data modeller needs to be aware
of. This article is not meant to explain the basics of data modelling, nor is it a comprehensive overview of
the EDW. It is instead intended to be useful for data modellers who would ever come across the
opportunity to work on an IBM EDW and be able to make a head start without getting lost in the jargon
they would come across.

Page 2 of 9
ER Diagram
This is a representative subset of the EDW that describes a Customer entity and an Address entity and how
they are related to each other. The elements in the diagram and their functionality are described in the
following pages.

Page 3 of 9
Fundamental entity
A fundamental entity in the EDW is a basic entity in conventional modelling terms, e.g. Customer, Address,
etc. A fundamental entity such as Customer will comprise of attributes like Id (a surrogate key), External
Reference (a business facing id, different from the surrogate Id), Type Id (described later), First Name, Last
Name, Effective From Date & Effective To Date. There are a few other attributes too in the physical model
which need not appear in the logical model:
Valid From Date & Valid To Date: These attributes are used for versioning and designate the technical
validity of a record in the warehouse. Note that these are different from the Effective From/To Dates,
which specify the business validity of the record as per the source application.
Source System Id: Tells us from which source application the data from sourced. So if a billing address is
captured in a billing system, it will have a different source system id than, for example, the goods
delivery system. The list of source systems is maintained as a separate entity and is referenced by the
different entities.
Source System Unique Id: Contains the identification of the row in the source database, to allow for
tracing an EDW row to its source row. Note that in many source applications, versioning may create
new Ids in the source database itself, and when the updated row in brought into EDW, the Source
System Unique Id should reflect this new Id in the latest valid record in order to have a one-to-one
association with the source record.
ETL Id: This is a designator for each ETL process that has been executed, so that for each row in the
table, we are able to identify which ETL job loaded it.

Anchor entity
There is one anchor entity per fundamental entity in the EDW. The purpose of an anchor entity is to create
a persistent identity of a fundamental entity. So a Customer fundamental entity will also have a Customer
Anchor entity. Very minimal information will be stored in this entity: an Id and Create Timestamp. This
anchor entitys primary key will appear as a foreign key in all the rows created in the Customer fundamental
entity for a given customer, as we will see later. A row in the anchor entity is created when (just before) its
corresponding row in the fundamental entity is created. Any updates to the fundamental entity have no
effect on the anchor entity.

Page 4 of 9
Types & Subtypes
A Type is a list of reference data that allows for the classification of different fundamental entities. E.g. a
Customer fundamental entity would have an attribute Type Id which might identify values such as Retail
Customer, Corporate Customer, Sovereign Customer, etc. to differentiate between various types of
customers. An Address fundamental entity might have Billing Address, Delivery Address, etc. The list of
valid types is maintained as a separate entity and is referenced by the fundamental entities.

Types could also be divided into subtypes, which are types in their own right.

So, in the database, the Types table will have the following rows:
Id Description Parent Type Id Valid From Date Valid To Date
100 Customer 01/12/2016 00:00:00 31/12/9999 23:59:59
101 Retail Customer 100 01/12/2016 00:00:00 31/12/9999 23:59:59
102 Corporate Customer 100 01/12/2016 00:00:00 31/12/9999 23:59:59
103 Sovereign Customer 100 01/12/2016 00:00:00 31/12/9999 23:59:59
200 Address 01/12/2016 00:00:00 31/12/9999 23:59:59
201 Billing Address 200 01/12/2016 00:00:00 31/12/9999 23:59:59
202 Delivery Address 200 01/12/2016 00:00:00 31/12/9999 23:59:59

Subtypes can be implemented in the same table or in different tables in the database. So we could either
have the Customers table implement the Customer entity with a Type Id differentiating between the
different types of customers, or we could have three tables, Retail Customers, Corporate Customers and
Sovereign Customers, depending on the number of common attributes and other factors.

Relationship (or Associative) entity

The IBM EDW model does not have direct FK relationship between fundamental entities. Instead, there is
an additional type of entity that holds the relationship information between two other anchor entities
(note: not the fundamental entities). E.g., there is a Customer Address Relationship entity which will have
two attributes with foreign key reference to the Customer Anchor and the Address Anchor entities, and will
also have an additional attribute identifying the type of the relationship, i.e. a Nature Id (explained below)
and other attributes Valid From Date and Valid To Date.

Page 5 of 9
Nature
A Nature Id attribute in a relationship entity classifies the relationship. If a Retail Customer has a Billing
Address, then the value of the Nature Id in the row in the relationship entity will identify the value Retail
Customer Billing Address. Similarly, there will be other possible values Retail Customer Delivery Address,
Corporate Customer Billing Address and so on. In simple terms, the various possible values of a nature
attribute will comprise the list of the various possible varieties or combinations of relationships that can
between the two fundamental entities. Whilst this may seem superficial at first because the nature of a
relationship can be implied using the Types of the fundamental entities it is related to, it can be very
useful to gauge the metrics of the data once it is loaded and also for deletes and reloads, as we will see
later. The list of valid natures is maintained as a separate entity and is referenced by the relationship
entities. Each row in the Nature table includes the possible combinations of Types it relates to.

The Nature table will have the following rows:


Id Description Left Type Id Right Type Id Valid From Date Valid To Date
1001 Retail Customer Billing Address 101 201 01/12/2016 00:00:00 31/12/9999 23:59:59
1002 Retail Customer Delivery Address 101 202 01/12/2016 00:00:00 31/12/9999 23:59:59
1003 Corporate Customer Billing Address 102 201 01/12/2016 00:00:00 31/12/9999 23:59:59
1004 Corporate Customer Delivery Address 102 202 01/12/2016 00:00:00 31/12/9999 23:59:59

Page 6 of 9
Data load mechanism
Lets use an example where we are loading Retail Customers and their addresses from the Billing system.
Once the ETL design is complete, there will be three ETL jobs in all:

One for loading the Customer anchor and fundamental entity, and
One for loading the Address anchor and fundamental entity, and
One for loading the Customer Address Relationship entity

An example of how the data would look like:

Customer Anchor
Id Valid From Date Valid To Date ETL Id
1 01/12/2016 23:05:00 31/12/9999 23:59:59 1001

Customer
Id Customer Id Type Id External First Last Date of Birth Valid From Date Valid To Date Source System Source ETL Id
(Anchor Id) Reference Name Name Unique Id System Id
1 1 101 (Retail Customer) JD0001 John Doe 01/01/1900 01/12/2016 23:05:30 31/12/9999 23:59:59 12001 101 1001

Address Anchor
Id Valid From Date Valid To Date ETL Id
1 01/12/2016 23:10:00 31/12/9999 23:59:59 1002

Address
Id Address Id Type Id Address Address Valid From Date Valid To Date Source System Source ETL Id
(Anchor Id) Line 1 Line 2 Unique Id System Id
1 1 201 (Billing Address) 101 Mont Pleasant London 01/12/2016 23:10:30 31/12/9999 23:59:59 10001 101 1002

Customer Address Relationship


Id Customer Id Address Id Nature Id Valid From Date Valid To Date Source System Source ETL Id
(Anchor Id) (Anchor Id) Unique Id System Id
1 1 1 1001 (Retail Customer Billing Address) 01/12/2016 23:15:00 31/12/9999 23:59:59 10001 101 1003

At this point in time, the SQL to list the RETAIL customers along with their BILLING address would be:

SELECT col1, col2, ...


FROM Customers, Addresses, Customer Address Relationship
WHERE Customers.Type Id=101 (Retail Customer)
AND Addresses.Type Id=201 (Billing Address)
AND Customer Address Relationship.Nature Id=1001 (Retail Customer/Billing Address)
AND Customers.Valid To Date='31/12/9999 23:59:59'
AND Addresses.Valid To Date='31/12/9999 23:59:59'
AND Customer Address Relationship.Valid To Date='31/12/9999 23:59:59'
AND Customers.Customer Id=Customer Address Relationship.Customer Id
AND Addresses.Address Id=Customer Address Relationship.Address Id

Note that now there is no need to refer to the anchor tables.

Page 7 of 9
Data updates

In a scenario where the customer data is updated in the source system, a new identifier would be created
in the source system. Once the incremental job for loading the updated customers has been run, a new Id
will be generated for the new active record in EDW, but the anchor Id will stay the same as the expired
record. The Customer fundamental entity will now look like:

Id Customer Id Type Id External First Last Date of Birth Valid From Date Valid To Date Source System Source ETL Id
(Anchor Id) Reference Name Name Unique Id System Id
1 1 101 JD0001 John Doe 01/01/1900 01/12/2016 23:05:30 10/12/2016 23:29:59 12001 101 1001
2 1 101 JD0001 John Doe 04/07/1975 10/12/2016 23:30:00 31/12/9999 23:59:59 12002 101 1101

After this has happened, our SQL will still work and fetch the latest customer record which is still connected
to the current address. Note that nothing has changed in the relationship table.

In the scenario where the address data is updated, again a new Id will be generated (in the source and
EDW), but the anchor Id will stay the same and the relationship entity would be untouched as before. The
Address fundamental entity will look like:
Id Address Id Type Id Address Address Valid From Date Valid To Date Source System Source ETL Id
(Anchor Id) Line 1 Line 2 Unique Id System Id
1 1 201 101 Mont Pleasant London 01/12/2016 23:10:30 11/12/2016 23:44:29 10001 101 1002
2 1 201 101 Mount Pleasant London 11/12/2016 23:45:30 31/12/9999 23:59:59 10002 101 1102

Our existing SQL will continue to work fine, fetching the latest customer details along with the latest
address details.

Page 8 of 9
Pros and Cons
Data updates

Normally, a modeller would be compelled to create a direct PK/FK relationship between the Customer and
the Address entities. If a customer record is updated, a new version of the Customer record would be
created with a new Id and that new Id would need to be re-referenced in the Address table, which would
require an additional process.

Clearly, the usage of anchor Ids eliminate this issue, albeit at the cost of an extra relationship table, in turn
making the queries more expensive to run.

ETL granularity

In case of a direct PK/FK relationship between the Customer and Address entities, only two ETL jobs would
be required, one for loading Customer and one for loading Address (including the FK from Customer). For
the IBM EDW, three jobs are required as described above, resulting in additional load on the source
systems (i.e. reading the source address table twice, once for loading the Address anchor/fundamental
tables and once for loading the Customer Address Relationship table), the ETL applications and the EDW
database itself. But that provides an advantage in the following ways:
All the fundamental entities can be loaded in one go without having to worry about how they relate to
each other. In fact, this modular approach can also be used for designing and building the ETL processes
for loading the fundamental entities. This allows for greater flexibility in design and development and
also in ETL job control.
Useful metrics can be drawn from the data once it has been loaded. So 100 Customers should
correspond to 100 Addresses and finally result in 100 rows in the Customer Address Relationship table.
More often than not, we will find fewer rows in the relationship table which might indicate data or logic
discrepancies in the source data or in the ETL logic because of which an address could not be matched
to its corresponding customer record. Once that has been fixed, we can simply re-run the relationship
job which will load the remaining missing relationship rows, and re-generate the metrics to be certain.
In case there was an error in the logic of loading the customer FK correctly into the Address table for
Billing Addresses for Retail Customers in conventional modelling, wed have to delete and load the
Address table again, this time with the correct Customer foreign keys but with new Address Ids. This
would then mean that any other tables that are referencing the Address Id would need to be reloaded
or updated with the new Address Ids. On the other hand, in the IBM EDW, we can simply delete the
data from the Customer Address Relationship table with a Nature Id=1001 (Retail Customer/Billing
Address) and re-run only the Customer Address relationship job to load the data into this relationship
table, without having any effect on the Customer or the Address fundamental tables, or any of the
other tables in the warehouse.

As we can understand from these examples, the IBM EDW allows a lot of flexibility and design and ETL job
control and is very valuable in scenarios which comprise of hundreds of ETL jobs. Of course, it comes with
the overhead of additional table reads and writes which could be reduced by using caching mechanisms,
etc. We hope this has been a good read and will help you get started with EDW modelling.

Page 9 of 9

You might also like