Data Warehouse Principles
Data Warehouse Principles
Datawarehouse
Principles
INTRODUCTION 3
OVERVIEW 9
WHAT IS DW ARCHITECTURE? 12
WHAT IS IT ARCHITECTURE? 26
Introduction
Licence
As these are generic software documentation standards, they will be covered by the 'Creative
Commons Zero v1.0 Universal' CC0 licence.
Warranty
The author does not make any warranty, express or implied, that any statements in this document
are free of error, or are consistent with particular standard of merchantability, or they will meet the
requirements for any particular application or environment. They should not be relied on for solving
a problem whose incorrect solution could result in injury or loss of property. If you do use this
material in such a manner, it is at your own risk. The author disclaims all liability for direct or
consequential damage resulting from its use.
Any set of principles like these are fundamentally incomplete. The whole problem space cannot be
fully defined, so the underlying principles cannot be complete. They are, as well, only a summary of
my own reading and experience, which is also incomplete. According to Socrates, ‘True knowledge
exists in knowing that you know nothing’. That certainly applies to this endeavour. So, these
principles need to be taken with a grain of salt. After all, Moses started with 10 laws, and ended up
with 613. Anyway, hopefully they may be useful as guidelines, or as a comparison with other sets of
similar principles.
Purpose
This document defines a series of DW Architecture principles. These principles can be used to
determine if a solution complies with the DW Architecture.
Audience
Management or staff who want to understand how the DW Architecture principles were derived.
Business Intelligence or technical staff who need to assess whether a proposed solution is compliant
with the DW Architecture.
Assumptions
It is assumed that the readers have general knowledge of Economics and exposure to IT systems. All
terms should be defined in the Definitions below.
Approach
This is my own work, based on an integration of principles from books I have read. It uses a top
down, first principles approach. As far as possible, the following text deliberately avoids technical
language. Instead, the arguments are built using the language of economics, in order to create a
framework which will allow the principles to make sense to the business user. The following
sections are a summary of the main references used. There is no attempt to repeat their arguments.
Instead, their conclusions will be used to support the principles at the end of the document. If
further detail is required, then these books are well worth study.
Related Documents
The main references are shown below.
Definitions
What does Metadata mean?
In Greek, Meta means "among" or "together with". This suggests the idea of "fellow traveller". So,
Metadata is often be defined as "data about data". Metadata can be split into two main types:
guide metadata or structural metadata.
Guide metadata helps decisions makers find concrete, immediate data. It is often derived directly
from the data instance. For example, the search string entered into Google, the summary data in a
hierarchy, or the actual library classification scheme value for a book.
Structural metadata helps define data. For example, DDL defines tables, columns and indices.
Structural metadata often exists independently of the data, and can be used to classify it. For
example, the library classification scheme or process.
Tags
Business Intelligence ; Data Mapping ; Metadata ; Metadata Dictionary ; Standards ; Data Architect
; Data Architecture ; Data Integration ; Data Lineage ; Data Principles ; Data Traceability ; Data
Transformation ; Data Warehouse ; Database ; Fact / Dimension ; Master Data Management ;
Subject Orientation ; Time Variant ;
Overview
This document derives a set of DW principles based on information economics. These principles can
be used to assess a DW solution.
DW Architecture can be defined as the art and science of designing decisionable Information. In
economics, Perfect Information means that all consumers (decision makers) know all things, about
all products, at all times, and are therefore always able to make the best decision regarding
purchase or sale. Clearly, this is an ideal state.
Therefore, DW Architecture needs to enable the firm to produce information that is as complete as
possible, as timely as possible, as ubiquitous as possible and as inexpensive as possible for all
internal decision makers.
Value decision making is based on "good enough" information. Decision makers often
(Decision avoid using Systems, and make their decisions anyway. The Decision Cycle or
Cycle) OODA (Observe, Orient, Decide and Act) loop can help to define optimal decisions.
That is, optimal information is that information which enables the decision maker
to move inside their competitor's decision cycle.
From these information characteristics, we can derive the following high level DW Architecture
Principles:
What is DW Architecture?
DW Architecture can be defined as the art and science of designing decisionable Information.
So before we can discus Information Architecture, we need to ask what gives Information value?
In economics, Perfect Information means that all consumers know all things, about all products, at
all times, and are therefore always able to make the best decision regarding purchase or sale.
Clearly, this is an ideal state and is unobtainable. But it does list the key characteristics of
Information.
The fourth characteristic is expense which for perfect information would be zero or free.
This definition will help to define DW Architecture, and help derive a set of DW Architecture
principles.
This can be very costly, especially if there is endless repetition. Most of the time, we have to rely on
incomplete data, or just hunches.
- as complete as possible
- as timely as possible
- as inexpensive as possible
The first three characteristics – completeness, timeliness and ubiquity - represent the value that
information provides. Obviously, the fourth represents the cost. The fifth element, the decision
maker draws everything together, giving it actual value and providing purpose.
Purpose means that the requirements and decisions are defined for the data. Meaning means the
data should describe all, and only, currently true propositions in the real world. This data about data
is called metadata. Integration is also critical, as any decision maker needs to be able join like with
like.
Completeness means that the decision maker will be able to base decision making on this
information. Consequently, information that lacks context (i.e. adequate metadata) can be
considered incomplete.
Can you trust what you are told? What is the cost and value of checking the information?
Integration often uses subject orientation is a solution pattern. An accounting transaction is the
same thing across different business units, but may look different when executed at the operational
level. The differences need to be removed so that a top down view can be created. Management
need some way to change the operational perspective to a subject area perspective in order to
properly make overarching decisions.
The bottom line of Completeness is: can the decision maker rely on the information to make a
decision, in terms of sufficiency and trustworthiness?
All information has either an implicit or explicit time aspect. In most cases, for operational systems,
the only time value is the current time. However, for reporting systems, time needs to be explicitly
defined for all data. This is required as the reporting system often has to provide historical reports,
so they must keep track of valid past values.
The bottom line of Timeliness is: Is the information available before the decision maker must make a
decision? By default, for management decisions, information should be available on a daily basis, as
most management decisions do not require a finer time period. Naturally, for operational decisions,
the information should be available on per minute or per second basis.
Risk is often used as an excuse to unduly restrict information. Security restrictions are often applied
in a stove piped manner which reflects organisational unit structure, rather than value adding
processes. This leads to data duplication, which leads to data inconsistencies, a reduction in the
potential information value, and ultimately to poor decisions.
You know the ones in the other departments. Why do they need to know? It’s our data.
One observation is that it is theoretically impossible for any single person to properly define all the
possible mappings between all users and all data, in order that the value of the information is
maximised. This is especially so, as value of the information cannot be fully defined in advance. On
the other hand, it should be possible to map between defined risks, defined data subsets and
defined user groups. This then leads to an exclusion principle. So, by default, the information
should be available as widely as possible. However, wherever there are specific risks, then these
should restrict some information access in order to reduce the specific risk.
Our information is secure, but it has little value, because we won’t share it. A classic double-bind or
no-win situation.
This exclusion principle resolves the natural tension between the maximisation of information value
and the need to reduce information risk. Security will be applied on an exception basis, rather than
the reverse. That is, the default for all information access should be open, and only when there are
specific well defined risks should security be imposed on specific well defined sets of data.
So, for example, information such as passwords, pay rates, etc needs to be controlled, based on risk
requirements such as Data privacy, Audit and other controls. But these rules should be applied to
specific attributes and or rows, rather than to all attributes or all rows. Another example is
embargoing, so that information is restricted until publication, and it is open afterwards.
The implication of non-excludability is that the security profile of data decays over time until it is all
public knowledge. This process could be explicitly recognised by a security grandfathering process.
This would make it simpler to share older and no longer controversial data. For example, sometime
after a project is completed, all project documentation could be made available. This would help in
passing on potentially valuable historical lessons.
Cost of Usage
An ironic characteristic is that the information does not exhibit high degrees of transparency. That
is, to evaluate the information you have to know it, so you have to invest in learning in order to
evaluate it. To evaluate a bit of software you have to learn to use it; to evaluate a movie you have to
watch it. In other words, it is an experience good.
Usability is the ease with which people employ a particular tool to achieve a particular goal. This is
often resolved through change management and ongoing education support. It is critical to support
new users through training and assistance. This lowers an important barrier to entry for the proper
exploitation of the information for decision making.
Cost of Discovery
Findability is the quality of being locatable or navigable. Structuring of data and using well
understood, common terms (aka "old words") can reduce significantly this cost. Conversely,
egregious renaming of data elements can increase the cost significantly. Similarly, egregious
replication of data can easily make the usage cost rise, due to the confusion engendered. A key
support for findability is high quality “data about data” or metadata.
Cost of Production
Information can be expensive to produce initially. Therefore, using standardisation, simplification,
consistency, common components, etc, will drive down the cost of information production. Clearly,
if the information is expensive to collect, this will reduce Completeness.
However, business information is never static. Some change examples include: decentralisation,
new services, new competitors, supply changes, outsourcing, in sourcing, etc. These need to be
supported by new KPIs. And how can the impact of these changes be properly measured? The
reference data that decisions are based upon is constantly shifting, making comparisons to previous
periods very difficult or impossible. This increases uncertainty as nobody ever is sure why something
happened, as the basic parameters shift from period to period. This makes the cost of producing
sensible multi-period comparison data very high.
Another cost of production is the need to retain history. This means ensuring that the data is Time
variant and non-volatile, so that research can be done across time.
Cost of Reproduction
A related characteristic that alters information markets is that information has almost zero marginal
cost. This means that once the first copy exists, it costs nothing or almost nothing to make a second
copy. But there is a dark side to reproduction in that it can create the basis for confusion by
reducing trust. This applies especially when the same information is copied at different times, and
they become out of sync, even though they were initially identical.
For the reproduced information to be credible, the information must be of adequate quality and
verifiable. This means the who, what, where, when and how all need to be known as well. That is,
who was the source, who reviewed it, etc. There also needs to be confidence that the data is
reconciled back to source. If these are missing, then the data will lack credibility, and be considered
untrustworthy.
Greek Latin
Antioch
30 AD
Alexandria Latin
100 AD
Vaticanus
300 AD
Waldensia
1000 AD n
English
Erasmus
1500 AD
King
Stephens
1600 AD James
Revised
Greisbach Version
1800 AD
Multiple versions of the Bible have been a constant source of confusion, with significant historical
impacts.
There is always difficulty is assessing the value of some new information system. The standard
approach is as follows:
Value = Average Decision Value (with new system) - Average Decision Value (without new system).
Even when there is a clear understanding of the information’s potential value, each decision maker
does not know what the value is until they use it. This is because Information is an Experience good.
Using information is akin to driving a car. It is better to offer regular users the capability to drive
their own car, rather than to provide a taxi service. That way, they can take the vehicle in any
direction they choose. But it does depend on knowing how to drive. There is still some value in the
taxi service, for new users, or users who have a specific well-defined request.
Information is needed to support all decision making from the creation of the mission statement to
the support needed for a help desk. How can a firm determine the optimal information set, given a
set of decisions?
Information Optimality
Decision makers are satisficers. (Herbert Simon) That is, decision making will be based on "good
enough" information, rather than optimal or complete information. They can always walk away
from the Technology and make the decision based on available information, which could be just a
hunch.
"An information retrieval system will tend not to be used whenever it is more painful and
troublesome for a customer to have information than for him not to have it."
It is possible to use the OODA loop to define a form of optimality. That is, optimal information is
that information set that enables the decision maker to move inside their competitor’s decision
cycle. Clearly, this provides a means to organise and rank the set of defined decisions for
subsequent support. This leads into the Decision Cycle or OODA loop.
Business
Decision
Maker
Observe Data Collection Orient Synthesis Decide Choice Act
Feedback
1. Observe: This is the collection of data by means of the senses. In Dewey terms, this would
be the formulation of the question.
2. Orient: This is the analysis and synthesis of data to form one's current mental perspective.
In Dewey terms, this would mean determining the options, the criteria, and collecting
enough data to distinguish them.
3. Decide: This is the determination of a course of action based on one's current mental
perspective. In Dewey terms, this means selecting the least worst of the options, based on
the criteria and data collected.
4. Act: This is the physical playing-out of decisions. The action provides feedback to the next
observation.
According to this idea, the key to victory is to be able to create situations wherein one can make
appropriate decisions more quickly than one's opponent. The construct was originally a theory for
achieving success in air-to-air combat, developed out of Boyd's Energy-Manoeuvrability theory and
his observations on air combat between MiGs and F-86s in Korea.
Note that decision making can create the OO-OO-OO cycle. What this means is that the decision
maker has to constantly re-observe and re-orient. For example, this could be looking at some event
in Australia at a summary level. Then they look to see that it has occurred in Sydney. They drill
down further to see that it has occurred in George St. At each step, they are drilling down a
hierarchy or summary view, using data at one level that becomes the metadata of the next level.
This can continue without limit until the decision maker has found sufficient detail to decide what to
do.
Part of the OO-OO-OO cycle is the ability of the decision maker to discover new things about the
data, as they do research. This implies that they will need to be able to create new metadata as new
discoveries emerge. This means that they are using the current metadata, and also adding to it as
they go down the learning curve. Any decision making tool must be able to support both the display
and the collection of new metadata.
Another key idea is that the OODA cycle starts with Observation. What this implies is that the
situation could be either new or previously defined. If it is previously defined, then the decision is
known and all the decision maker needs is to drill down to the detail until the decision becomes
obvious. However, if the situation is new, then the decision maker is in research mode. By
definition, they cannot define what they need. Instead, they need a research tool to be able to
explore all available data until the reasons for current state are known. This is why reporting
systems often cannot be fully specified unlike operational systems. The requirements are
fundamentally undefinable, as most people lack the gift of prophecy.
Decision Types
How can we classify decisions?
Almost as a corollary, scarce information is still only a potential advantage. The advantage can only
be gained if it is acted on.
Syndrome from Incredibles: “In a world where everybody is special, nobody is special.”
Network Effect
In economics and business, a network effect is the effect that one user of a good or service has on
the value of that product to other users. The classic example is the telephone. The more people
own telephones, the more valuable the telephone is to each owner. This creates a positive
externality because a user may purchase their phone without intending to create value for other
users, but does so in any case. The expression "network effect" is applied most commonly to
positive network externalities as in the case of the telephone. Negative network externalities can
also occur, where more users make a product less valuable, but are more commonly referred to as
"congestion" (as in traffic congestion or network congestion). Note that this is an economics idea,
not a technology concept. It often is enabled by the introduction of a new technology. Metcalfe's
law states that the value of a telecommunications network is proportional to the square of the
number of users of the system. The real value appears to be nlog(n), but the idea is still valid.
Wouldn’t it be great to be a member of the “Pudding Owners Collective” so that you can own a
pudding that never runs out.
Messaging hub.
Hub
Report
Spoke
The choice between Point to Point (P2P) and Hub and Spoke pattern occurs often.
What should be clear is that success is not easy. There are many examples where the innovating
firm was trying to exploit the Network Effect, but they only achieved partial success. Many issues
prevented full success, including inadequate standards, vendor lock in, set up issues, etc. See
Shapiro and Varian for more details.
What is IT Architecture?
TO BE
The “TO BE” IT Architecture state is that ideal IT that enables all decision makers to make better,
faster, cheaper decisions than their competitors. A corollary of the Carr thesis is that the “TO BE” IT
Architecture will see the uncoupling of the Technology Architecture (commodity) from the
Information Architecture (value add), in order to deliver lowest overall cost. So, we have a very high
level picture of this end state. IT is separated into big I little t inside the firm and little I Big T outside
the firm.
AS IS
The “AS IS” is like a Big Ball of Mud. See the Foote and Yoder article which gives a good explanation
for how the current mess comes about. In general, this is the accumulated outcome of a series of
unintegrated implicit and explicit decisions.
Often times Vendors are not ready for the future either. Product offerings are poorly thought out,
and often mix Information and Technology together. That is, they sometimes attempt to sell
services to a firm that cover both what is generic and what is particular to a firm. By not separating
the two parts, the Vendor becomes too involved in what a firm is doing.
But what is the first thing that actors in a disaster movie like the Titanic reach for? The blueprints, of
course. Even though it is a sprawl, the current state Architecture does need to be documented.
Role of IT Architect
This is where IA rubber hits the technology road. Architects should facilitate the movement from
current (AS IS) to end (TO BE) state. They do this by:
2. Helping the firm separate the Information component from the Technology. This means
identifying and promoting the outsourcing of Infrastructural Technology, while retaining proprietary
Information.
This will help the business to retain what is unique (scarce) from what is generic (commodity).
Restated, this is the separation of the what (in source) from the how (outsource).
How soon will the firm reach the end state? There are clearly a series of interim intermediate states,
as the Technology industry is still evolving.
Also note that the other types of IT Architecture have this dual Information and Technology
Architecture split.
The following principles set out the characteristics of an ideal state. It is not expected that each new
system implemented will comply fully with all principles. However, if they do not, then the
exceptions should be documented and rationalised. Any new system must appreciably reduce the
level of architectural non-compliance, even if the new system does not tick all boxes.
Note that the goal of this DW Architecture is not necessarily to collect up all the variety of databases
into one large system. It is more to provide a utility/environment whereby all key decisions are
based on information that is available to decision makers, in accordance with the principles. An
inventory of all current databases/systems will help in defining the current information needed to
support this, and is useful as a kind of requirements catalog.
All information must be available to all decision makers except for defined
risk exceptions.
The fundamental principle for almost all information is openness. This is based on the fundamental
assumption of trust that is made between all decision makers within a firm. All decision makers
should must be able to view all non-confidential data. This ensures that maximisation of potential
information value. All access restrictions need to be justified on specific risk cases. Valid examples
include passwords, financial data, customer privacy, salary, etc. This implies that there is risk
traceability. All tables, columns and row sets should have a risk allocation, where the default is
OPEN. Similarly, risk sets can be grouped into roles. To support this, all decision makers must be
authenticable. This means that the decision maker has been identified before access to data. All
roles must be authorisable, so that the decision maker role has been checked, before access to data.
All target information must reconcile to source and shows its lineage.
The fundamental issue is that every data replication increases data cost, and reduces data value.
There are two ways to negate this, either avoid copying (one source) or provide a means to ensure
the target data is identical to the source data. Negating value reduction is costly, and it must be
borne by target data store. So the target store should reconcile to source, so when the source
changes, this change will automatically appear in target. The target store should also show lineage
from source, and any transformations. Views are fine, as these automatically satisfy this principle.
In other words, mappings and traceability are crucial.