0% found this document useful (0 votes)
482 views30 pages

Data Warehouse Principles

This document defines a series of Data Warehouse Architecture principles. These principles were derived from my own experience, and the writings of C Shapiro, Hal Varian and Nicholas Carr, and are based on Economics and Computer Science. Any set of principles like these are fundamentally incomplete. According to Socrates, ‘True knowledge exists in knowing that you know nothing'. That certainly applies to this document. So, please take these principles with a grain of salt.

Uploaded by

Matthew Lawler
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
482 views30 pages

Data Warehouse Principles

This document defines a series of Data Warehouse Architecture principles. These principles were derived from my own experience, and the writings of C Shapiro, Hal Varian and Nicholas Carr, and are based on Economics and Computer Science. Any set of principles like these are fundamentally incomplete. According to Socrates, ‘True knowledge exists in knowing that you know nothing'. That certainly applies to this document. So, please take these principles with a grain of salt.

Uploaded by

Matthew Lawler
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Matthew Lawler lawlermj1@gmail.

com Datawarehouse Principles

Datawarehouse
Principles

Matthew Lawler [email protected]

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 1 of 30


Matthew Lawler [email protected] Datawarehouse Principles

INTRODUCTION 3

OVERVIEW 9

WHAT IS DW ARCHITECTURE? 12

WHAT IS TECHNOLOGY (IT)? 23

WHAT IS IT ARCHITECTURE? 26

WHAT ARE THE HIGH LEVEL PRINCIPLES? 28

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 2 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Introduction
Licence
As these are generic software documentation standards, they will be covered by the 'Creative
Commons Zero v1.0 Universal' CC0 licence.

Warranty
The author does not make any warranty, express or implied, that any statements in this document
are free of error, or are consistent with particular standard of merchantability, or they will meet the
requirements for any particular application or environment. They should not be relied on for solving
a problem whose incorrect solution could result in injury or loss of property. If you do use this
material in such a manner, it is at your own risk. The author disclaims all liability for direct or
consequential damage resulting from its use.

Any set of principles like these are fundamentally incomplete. The whole problem space cannot be
fully defined, so the underlying principles cannot be complete. They are, as well, only a summary of
my own reading and experience, which is also incomplete. According to Socrates, ‘True knowledge
exists in knowing that you know nothing’. That certainly applies to this endeavour. So, these
principles need to be taken with a grain of salt. After all, Moses started with 10 laws, and ended up
with 613. Anyway, hopefully they may be useful as guidelines, or as a comparison with other sets of
similar principles.

Purpose
This document defines a series of DW Architecture principles. These principles can be used to
determine if a solution complies with the DW Architecture.

Audience
Management or staff who want to understand how the DW Architecture principles were derived.

Business Intelligence or technical staff who need to assess whether a proposed solution is compliant
with the DW Architecture.

Assumptions
It is assumed that the readers have general knowledge of Economics and exposure to IT systems. All
terms should be defined in the Definitions below.

Approach
This is my own work, based on an integration of principles from books I have read. It uses a top
down, first principles approach. As far as possible, the following text deliberately avoids technical
language. Instead, the arguments are built using the language of economics, in order to create a
framework which will allow the principles to make sense to the business user. The following
sections are a summary of the main references used. There is no attempt to repeat their arguments.
Instead, their conclusions will be used to support the principles at the end of the document. If
further detail is required, then these books are well worth study.

Related Documents
The main references are shown below.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 3 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Author Reference Publisher Year


Brian Foote and Joseph Yoder Big Ball of Mud U. Illinois 1999
www.laputan.org/mud/mud.html
C Shapiro and H Varian Information Rules Harvard Press 1999
Nicholas Carr Does IT Matter? Harvard Press 2004

Definitions
What does Metadata mean?
In Greek, Meta means "among" or "together with". This suggests the idea of "fellow traveller". So,
Metadata is often be defined as "data about data". Metadata can be split into two main types:
guide metadata or structural metadata.

Guide metadata helps decisions makers find concrete, immediate data. It is often derived directly
from the data instance. For example, the search string entered into Google, the summary data in a
hierarchy, or the actual library classification scheme value for a book.

Structural metadata helps define data. For example, DDL defines tables, columns and indices.
Structural metadata often exists independently of the data, and can be used to classify it. For
example, the library classification scheme or process.

Term Definition Example


Accessibility This is a general term used to describe the degree to which a product (e.g.,
device, service, environment) is accessible by as many people as possible.
Attribute An informal term for Attribute value. This is For example, it could be a
the smallest element of data. It generally measure such as formula, a
represents a single data value and its related, column, an XML tag, a field, a
definitional metadata. One or more blob, PICK attribute, etc.
Attributes make up a tuple.
Characterisation This is a set that represents something in the For example, the attributes for a
real world. given table structure can
represent something real. An
employee table can have
attributes such as name,
birthday and job. This table is
then the characterisation of a
real employee.
Closed World Values held in a Database describe all, and For example, a tuple represents
Assumption only, currently true propositions in the real a proposition in the real world.
world. At a point in time, the employee
table has 2 employees: John and
Joan Smith. In the real world, at
the same point in time, these
are the only 2 employees of the
firm.
Common Data Any subset of Table Universes in different For example, organisation
databases that have the same meaning or structure or location data.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 4 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Term Definition Example


definition within an Information Space. That
is, the common data has the same semantic
metadata.
Confidential Information, which if viewed by unauthorized persons, would increase a defined
risk.
Cost, Fixed Fixed costs are business expenses that are not dependent on the level of
production or sales. They tend to be time-related, such as salaries or rents being
paid per month. This is in contrast to Variable costs, which are volume-related.
Cost, Marginal In economics and finance, marginal cost is the change in total cost that arises
when the quantity produced changes by one unit.
Database This is any persistent data store that has For example, RDBMS systems,
information needed for decision making. The unstructured data, directories,
term implies a technology platform, hardware, PICK Account or Directory, etc.
data access layer, etc. (An informal term for a
Database Value set.)
Database, Any Database that has an Attribute value set For example, Customer data.
Primary Source which is master data.
Database, Any Database that has a value set that is replicated to a target database. This can
Source be either primary or secondary.
Database, Any Database that has a value set that is
Target replicated from a source database.
DW Data Warehouse
Data See Database above.
Warehouse
Decision The outcome of a set of Observation and Orientation pairs that will then lead to
an Action.
Decision Cycle The timeframe that a decision must be made For example, Board decisions
in. have a monthly decision cycle,
whereas operational decisions
have a minute decision cycle.
Decision Maker Any agent of the firm who needs information For example, this could refer to
to make decisions. the CEO or to a call direct
operator.
Dimension Another term used for Common data in a data warehouse context, that is used to
group and join with other data.
Event Based This is data this is true at a point in time, rather than over a period of time. An
example is a transaction such as a posting.
Excludability In economics, a good or service is said to be excludable when it is possible to
prevent people who have not paid for it from enjoying its benefits, and non-
excludable when it is not possible to do so.
Experience In economics, an experience good is a product or service where product
good characteristics such as quality or price are difficult to observe in advance, but
these characteristics can be ascertained upon consumption.
Hierarchy Any data that has a parent and child
relationship.
Information The information is fit for the purpose of making a decision. That is, is the purpose
Completeness is known and does it describes true information about the world?
Information Identically semantic information has been transformed into syntactically identical
Integration information.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 5 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Term Definition Example


Information This refers to the quality of being navigable. In this context, it will refer to the
Locatability ability to use Structural metadata to find the location.
Information This refers to the quality of being able to find For example, this could be a
Searchability something. In this context, it means being grep or Google type search.
able to find strings or values across all
databases.
Information The information is available before the
Timeliness decision maker must make a decision.
Information The information is everywhere at the same
Ubiquity time, that is, omnipresent.
Instance A statement about the real world, based on
some principle or idea.
Joinable Two data functions are joinable if and only if for every first co-ordinate that the
two functions have in common, the corresponding second coordinates match.
Only if two functions are compatible will the union of those two functions be
compatible. aka A relation.
Lineage Path This is the definition of all transactions from For example, what columns are
all source attributes to a particular target used to create a sum?
attribute, or vice versa. This depends on
mappings.
Master Data This is common data that it is in the Primary Source Database. All other common
data instances will be called secondary Data.
Metadata Data about Data.
Metadata, This is metadata created from actual value For example, the search string
Guide sets, rather than possible value sets. The entered into Google, the library
guide metadata is used for searching. See classification for a book, or the
Database Value Set and Searchability. summary data in a hierarchy.
Metadata, This is metadata used to define a database. For example, DBMS catalog
Structural See Database Universe and Locatability. information such as tables,
columns and indices.
Non- Information, which if viewed by unauthorized persons, no longer changes a
confidential defined risk.
Non-persistent Data that does not outlive the execution of the program or function that created
Data it.
Persistent Data Data that outlives the execution of the
program or function that created it.
Predicate A predicate is something that has the form of For example, x has the value 4.
a declarative sentence. It holds embedded
variables whose values are unknown. The
truth of the predicate cannot be determined
unless the values are known.
Proposition This is a declarative sentence that is either For example, All Cretans are
true or false. liars.
Reconciliation This is the transaction that is needed to ensure consistency between a primary
source and a replicated target.
Replication Any copy of the data between a source database and a persistent target
database. It represents a set of transactions that maps data from a source
database to a persistent target database.
Report This is a view over the data that supports a
specific decision.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 6 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Term Definition Example


Requirement A requirement is a singular documented need Requirements are either true or
of what a particular product or service should false, as they are satisfied or not
be or do. It is most commonly used in a satisfied. Therefore,
formal sense in systems engineering or requirement value is fixed at a
software engineering. It is a statement that point in time.
identifies a necessary attribute, capability,
characteristic, or quality of a system in order
for it to have value and utility to a user.
Alternately, a proposition that, if satisfied, will
increase value for a firm. That is, there is no
sense of uncertainty about a requirement.
Requirements Requirements Traceability refers to the ability to link requirements back to
Traceability stakeholders' rationales and forward to corresponding design artefacts, code, and
test cases. Traceability supports numerous software engineering activities such
as change impact analysis, compliance verification of code, regression test
selection, and requirements validation. This depends on mappings.
Risk Risk is a concept that denotes the precise Risk = Event * Probability *
probability of specific eventualities. Event Value.
Technically, the notion of risk is independent
from the notion of value and, as such,
eventualities may have both beneficial and
adverse consequences. However, in general
usage the convention is to focus only on
potential negative impact to some
characteristic of value that may arise from a
future event.
Role A mapping between decision makers and resources such as data functions (tables,
attributes, tuples, etc).
Slowly Changing This is data that is true over a period of time, For example, an account balance
rather than just at one point in time. is stable between postings.
Structured Structured means that the data or process has been subject to formal analysis,
and is defined in terms of some principle such as normalisation or optimisation.
Subject Data that is organised into a common information area, rather than around a
Orientation process like a product system.
Supported Supported means data that is managed or created by staff as part of their normal
responsibilities. This recognises that there is also informal data within the
organisation, which is not supported. This implies that the value set has some
defined attributes that allows them to included in views; that is, found, joined,
used, etc.
Table A table is often a characterisation of a real For example, an RDBMS table,
world concept, such as Employee. This is an File, PICK File or subfile.
intermediate grouping of data. (An informal
term for Table value set.)
Third Normal A simple way of defining this is that every non-key attribute must provide a fact
Form about the key, the whole key, and nothing but the key.
Time Variant This is data that covers historical and current data, rather than just current data.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 7 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Term Definition Example


Traceability Traceability is ability to chronologically For example, traceability implies
interrelate the uniquely identifiable entities in use of a unique piece of data
a way that is verifiable. Traceability is the (e.g., order date/time or a
ability to verify the history, location, or serialized sequence number)
application of an item by means of which can be traced through the
documented recorded identification. this entire software flow of all
depends on mappings. relevant application programs.
Messages and files at any point
in the system can then be
audited for correctness and
completeness, using the
traceability key to find the
particular transaction.
Transaction This is a function over one or more databases. For example, Replications,
It changes the database value set. Reconciliations, etc.
Tuple The elements of a table are called Tuples. The Other examples of Tuples are
ordered pairs of a tuple are called attribute rows, records, PICK record, etc.
value pairs.
Unstructured Unstructured means that the data or process has not been subject to formal
analysis, but is needed as part of the decision making. Unstructured data can
have a defined attribute set, that is generically defined, rather than formally
analysed.
Update If the same information is expressed on multiple rows, and the data changes in
Anomaly the real world, but the system only changes it on one row, there is a logical
inconsistency called the update anomaly.
Usability Usability is a term used to denote the ease with which people can employ a
particular tool or other human-made object in order to achieve a particular goal.
In human-computer interaction and computer science, usability usually refers to
the elegance and clarity with which the interaction with a computer program or a
web site is designed.
View Any function over a set of tables that produces a non-persistent table. Generally,
this can be represented as a set of attributes with aggregation and filtering.

Tags
Business Intelligence ; Data Mapping ; Metadata ; Metadata Dictionary ; Standards ; Data Architect
; Data Architecture ; Data Integration ; Data Lineage ; Data Principles ; Data Traceability ; Data
Transformation ; Data Warehouse ; Database ; Fact / Dimension ; Master Data Management ;
Subject Orientation ; Time Variant ;

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 8 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Overview
This document derives a set of DW principles based on information economics. These principles can
be used to assess a DW solution.

DW Architecture can be defined as the art and science of designing decisionable Information. In
economics, Perfect Information means that all consumers (decision makers) know all things, about
all products, at all times, and are therefore always able to make the best decision regarding
purchase or sale. Clearly, this is an ideal state.

Therefore, DW Architecture needs to enable the firm to produce information that is as complete as
possible, as timely as possible, as ubiquitous as possible and as inexpensive as possible for all
internal decision makers.

Information DW Architecture Goal


Characteristic
Completeness This implies that the information is fit for purpose (decision making), which
(All things) implies that the purpose is defined, the data has meaning, and can be linked to
related data. Purpose means that the requirements and decisions are defined for
the data. Meaning means the data should describe all, and only, currently true
propositions in the real world. This data about data is called metadata.
Integration is also critical, as any decision maker needs to be able join like with
like.
Timeliness This implies that the information is accessible when it is needed. It also implies
(All times) that the time is an integral part of all information.
Ubiquity (All This means being everywhere at the same time, that is, omnipresent. This implies
decision that the information is accessible wherever it is needed. Risk is often used as an
makers) excuse to unduly restrict information. Security restrictions are often applied in a
stove piped manner which reflects organisational unit structure, rather than value
adding processes. It is theoretically impossible for any single person to properly
define all the possible mappings between all users and all data, in order that the
value of the information is maximised. Therefore, to avoid this, we can employ an
Openness principle. So, by default, the information should be available as widely
as possible. However, wherever there are specific risks, then these should restrict
some information access in order to reduce the specific risk. This principle
resolves the natural tension between the maximisation of information value and
the need to reduce information risk.
Expense (Zero As an experience good, information imposes a cost of usage. The decision maker
cost) must either rely on an interpreter, or access the information directly. Training or
support is often needed.
Information also imposes search costs. This includes verifying information quality.
Good quality information is expensive to produce initially. Business rules need to
be enforced.
Unfortunately, information is cheap to reproduce but each copy reduces value
and increases hidden costs. To retain credibility, the source who, what, where,
when and how are needed.
Optimal Value is created by the decision maker when they make the decision. This implies
Potential that information only has potential value. Decision makers are satisficers. That is,

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 9 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Value decision making is based on "good enough" information. Decision makers often
(Decision avoid using Systems, and make their decisions anyway. The Decision Cycle or
Cycle) OODA (Observe, Orient, Decide and Act) loop can help to define optimal decisions.
That is, optimal information is that information which enables the decision maker
to move inside their competitor's decision cycle.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 10 of 30


Matthew Lawler [email protected] Datawarehouse Principles

From these information characteristics, we can derive the following high level DW Architecture
Principles:

O Name DW Principle Supports


1 Optimality All information must be optimised for the decision maker. Value
2 Independence All information must be available to decision makers Expense
directly.
3 Openness All information must be available to decision makers except Ubiquity
for defined risk exceptions.
4 Purpose All information must exist to support a decision or other Completeness
requirement.
5 Characterisation All information must represent something in the real world. Completeness
6 Time Variance All information must represent a true state at a point in Timeliness
time.
7 Locatable All information must be locatable. Expense
8 Joinable Common information must be joinable across all Completeness
information.
9 Source Quality All source information must have high quality. Expense
10 Reconciliation All target information must reconcile to source and shows Expense
its lineage.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 11 of 30


Matthew Lawler [email protected] Datawarehouse Principles

What is DW Architecture?
DW Architecture can be defined as the art and science of designing decisionable Information.

The discussion below will focus on the economic aspects of information.

So before we can discus Information Architecture, we need to ask what gives Information value?

What is the Economics of Information?


Information has two fundamental Economic impacts:

- it reduces uncertainty (of decision making about prices) and

- it can be sold as a commodity.

In this document, we will focus on its ability to improve decision making.

In economics, Perfect Information means that all consumers know all things, about all products, at
all times, and are therefore always able to make the best decision regarding purchase or sale.
Clearly, this is an ideal state and is unobtainable. But it does list the key characteristics of
Information.

all consumers -> ubiquity

all things -> completeness

all times -> timeliness

The fourth characteristic is expense which for perfect information would be zero or free.

This definition will help to define DW Architecture, and help derive a set of DW Architecture
principles.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 12 of 30


Matthew Lawler [email protected] Datawarehouse Principles

This can be very costly, especially if there is endless repetition. Most of the time, we have to rely on
incomplete data, or just hunches.

What is the primary goal of DW Architecture?


Based on the Economics of Information, DW Architecture needs to enable the firm to produce
information that is

- as complete as possible

- as timely as possible

- as ubiquitous as possible and

- as inexpensive as possible

for all internal decision makers.

Similarly, DW Architecture enables the firm to produce information that is

appropriate, timed correctly, delivered correctly and priced correctly

for all external decision makers.

The first three characteristics – completeness, timeliness and ubiquity - represent the value that
information provides. Obviously, the fourth represents the cost. The fifth element, the decision
maker draws everything together, giving it actual value and providing purpose.

Each of these characteristics will be examined in turn.

What is Information Completeness?


The information is fit for the purpose of making a decision. That is, is the purpose is known and does
it describes true information about the world?

Purpose means that the requirements and decisions are defined for the data. Meaning means the
data should describe all, and only, currently true propositions in the real world. This data about data
is called metadata. Integration is also critical, as any decision maker needs to be able join like with
like.

Completeness means that the decision maker will be able to base decision making on this
information. Consequently, information that lacks context (i.e. adequate metadata) can be
considered incomplete.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 13 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Can you trust what you are told? What is the cost and value of checking the information?

Another quality of completeness is integration. It is important for a decision maker to be able to


compare like with like, so that the true cause of an event can be determined. This is only possible if
there is no inconsistency in how things are defined and measured. That is, the decision maker
should be able to separate out the set of causes and effects from the background without difficulty.
On the other hand, if the frame of reference is changing at the same time as everything else, then
the cause and effect will be lost in noise. For example, when dealing with competing requests from
different business units, are the KPIs measured in the same way in both business units? Or more
simply, do both business units measure headcount in the same way?

Integration often uses subject orientation is a solution pattern. An accounting transaction is the
same thing across different business units, but may look different when executed at the operational
level. The differences need to be removed so that a top down view can be created. Management
need some way to change the operational perspective to a subject area perspective in order to
properly make overarching decisions.

The bottom line of Completeness is: can the decision maker rely on the information to make a
decision, in terms of sufficiency and trustworthiness?

What does Information Timeliness mean?


This implies that the information is accessible when it is needed. If the decision cycle is monthly,
then the information provision does not need to be available every second. However for minute by
minute decisions, current information needs to be constantly updated. The value of information is
only realised if it is available within the decision cycle/window.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 14 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Why am I the last to know? Who knew before me?

All information has either an implicit or explicit time aspect. In most cases, for operational systems,
the only time value is the current time. However, for reporting systems, time needs to be explicitly
defined for all data. This is required as the reporting system often has to provide historical reports,
so they must keep track of valid past values.

The bottom line of Timeliness is: Is the information available before the decision maker must make a
decision? By default, for management decisions, information should be available on a daily basis, as
most management decisions do not require a finer time period. Naturally, for operational decisions,
the information should be available on per minute or per second basis.

What does Information Ubiquity mean?


This means being everywhere at the same time, that is, omnipresent. This implies that the
information is accessible wherever it is needed. Information should also generally be available not
just to decision makers, but to all who are affected by the decisions. For example, this also includes
any employee who can verify the information, which helps completeness and trust above. Another
term for Ubiquity is Accessibility. This is the degree to which the information is available to as many
people as possible.

Risk is often used as an excuse to unduly restrict information. Security restrictions are often applied
in a stove piped manner which reflects organisational unit structure, rather than value adding
processes. This leads to data duplication, which leads to data inconsistencies, a reduction in the
potential information value, and ultimately to poor decisions.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 15 of 30


Matthew Lawler [email protected] Datawarehouse Principles

You know the ones in the other departments. Why do they need to know? It’s our data.

One observation is that it is theoretically impossible for any single person to properly define all the
possible mappings between all users and all data, in order that the value of the information is
maximised. This is especially so, as value of the information cannot be fully defined in advance. On
the other hand, it should be possible to map between defined risks, defined data subsets and
defined user groups. This then leads to an exclusion principle. So, by default, the information
should be available as widely as possible. However, wherever there are specific risks, then these
should restrict some information access in order to reduce the specific risk.

Our information is secure, but it has little value, because we won’t share it. A classic double-bind or
no-win situation.

This exclusion principle resolves the natural tension between the maximisation of information value
and the need to reduce information risk. Security will be applied on an exception basis, rather than
the reverse. That is, the default for all information access should be open, and only when there are
specific well defined risks should security be imposed on specific well defined sets of data.

So, for example, information such as passwords, pay rates, etc needs to be controlled, based on risk
requirements such as Data privacy, Audit and other controls. But these rules should be applied to
specific attributes and or rows, rather than to all attributes or all rows. Another example is
embargoing, so that information is restricted until publication, and it is open afterwards.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 16 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Excludability is not a natural property of information. In economics, a good or service is said to be


excludable when it is possible to prevent people who have not paid for it from enjoying its benefits,
and non-excludable when it is not possible to do so. In other words, information has a tendency to
'leak', which means that exclusion is not sustainable in the long term. This applies to information
such as Intellectual Property, etc. Eventually, this information escapes, and can be used by other
decision makers.

All information leaks. It is just a matter of time

The implication of non-excludability is that the security profile of data decays over time until it is all
public knowledge. This process could be explicitly recognised by a security grandfathering process.
This would make it simpler to share older and no longer controversial data. For example, sometime
after a project is completed, all project documentation could be made available. This would help in
passing on potentially valuable historical lessons.

What are the Information cost drivers?


What are the main expenses incurred collecting, reproducing and using the information?

Cost of Usage
An ironic characteristic is that the information does not exhibit high degrees of transparency. That
is, to evaluate the information you have to know it, so you have to invest in learning in order to
evaluate it. To evaluate a bit of software you have to learn to use it; to evaluate a movie you have to
watch it. In other words, it is an experience good.

Usability is the ease with which people employ a particular tool to achieve a particular goal. This is
often resolved through change management and ongoing education support. It is critical to support
new users through training and assistance. This lowers an important barrier to entry for the proper
exploitation of the information for decision making.

Cost of Discovery
Findability is the quality of being locatable or navigable. Structuring of data and using well
understood, common terms (aka "old words") can reduce significantly this cost. Conversely,
egregious renaming of data elements can increase the cost significantly. Similarly, egregious

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 17 of 30


Matthew Lawler [email protected] Datawarehouse Principles

replication of data can easily make the usage cost rise, due to the confusion engendered. A key
support for findability is high quality “data about data” or metadata.

Cost of Production
Information can be expensive to produce initially. Therefore, using standardisation, simplification,
consistency, common components, etc, will drive down the cost of information production. Clearly,
if the information is expensive to collect, this will reduce Completeness.

However, business information is never static. Some change examples include: decentralisation,
new services, new competitors, supply changes, outsourcing, in sourcing, etc. These need to be
supported by new KPIs. And how can the impact of these changes be properly measured? The
reference data that decisions are based upon is constantly shifting, making comparisons to previous
periods very difficult or impossible. This increases uncertainty as nobody ever is sure why something
happened, as the basic parameters shift from period to period. This makes the cost of producing
sensible multi-period comparison data very high.

Another cost of production is the need to retain history. This means ensuring that the data is Time
variant and non-volatile, so that research can be done across time.

Cost of Reproduction
A related characteristic that alters information markets is that information has almost zero marginal
cost. This means that once the first copy exists, it costs nothing or almost nothing to make a second
copy. But there is a dark side to reproduction in that it can create the basis for confusion by
reducing trust. This applies especially when the same information is copied at different times, and
they become out of sync, even though they were initially identical.

For the reproduced information to be credible, the information must be of adequate quality and
verifiable. This means the who, what, where, when and how all need to be known as well. That is,
who was the source, who reviewed it, etc. There also needs to be confidence that the data is
reconciled back to source. If these are missing, then the data will lack credibility, and be considered
untrustworthy.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 18 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Greek Latin

Antioch
30 AD

Alexandria Latin
100 AD

Vaticanus
300 AD

Waldensia
1000 AD n

English
Erasmus
1500 AD

King
Stephens
1600 AD James

Revised
Greisbach Version
1800 AD

Multiple versions of the Bible have been a constant source of confusion, with significant historical
impacts.

How do the cost drivers relate to Information characteristics?


The production and usage of information will generally have a positive impact on the Completeness,
Timeliness, Ubiquity and Value of information. However, it is the reproduction of information that
causes the reduction in the value of information. An increase in Ubiquity (or readership) is most
often used to justify reproduction. Similarly, Timeliness is also used to justify reproduction. But
each time information is reproduced, there is a reduction in quality and as the data is rarely
reconciled back to source. This, by definition, reduces the value of the reproduced information.

Cost\Characteristic Production Usage Discovery Reproduction


Completeness Positive Positive Negative
Timeliness Positive Positive Positive
Ubiquity Positive Positive Positive
Value Positive Positive Negative

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 19 of 30


Matthew Lawler [email protected] Datawarehouse Principles

How does Information create Value?


Value is created by the decision maker when they make a decision. This implies that information
only has potential value.

There is always difficulty is assessing the value of some new information system. The standard
approach is as follows:

Value = Average Decision Value (with new system) - Average Decision Value (without new system).

Even when there is a clear understanding of the information’s potential value, each decision maker
does not know what the value is until they use it. This is because Information is an Experience good.
Using information is akin to driving a car. It is better to offer regular users the capability to drive
their own car, rather than to provide a taxi service. That way, they can take the vehicle in any
direction they choose. But it does depend on knowing how to drive. There is still some value in the
taxi service, for new users, or users who have a specific well-defined request.

It is hard to explain. You have to see the movie. (2001)

Information is needed to support all decision making from the creation of the mission statement to
the support needed for a help desk. How can a firm determine the optimal information set, given a
set of decisions?

Information Optimality
Decision makers are satisficers. (Herbert Simon) That is, decision making will be based on "good
enough" information, rather than optimal or complete information. They can always walk away
from the Technology and make the decision based on available information, which could be just a
hunch.

Another way to express this is Mooers Law:

"An information retrieval system will tend not to be used whenever it is more painful and
troublesome for a customer to have information than for him not to have it."

It is possible to use the OODA loop to define a form of optimality. That is, optimal information is
that information set that enables the decision maker to move inside their competitor’s decision
cycle. Clearly, this provides a means to organise and rank the set of defined decisions for
subsequent support. This leads into the Decision Cycle or OODA loop.

Decision cycle (Observe, Orient, Decide, Act OODA Loop)

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 20 of 30


Matthew Lawler [email protected] Datawarehouse Principles

Business
Decision
Maker
Observe Data Collection Orient Synthesis Decide Choice Act

Feedback

1. Observe: This is the collection of data by means of the senses. In Dewey terms, this would
be the formulation of the question.

2. Orient: This is the analysis and synthesis of data to form one's current mental perspective.
In Dewey terms, this would mean determining the options, the criteria, and collecting
enough data to distinguish them.

3. Decide: This is the determination of a course of action based on one's current mental
perspective. In Dewey terms, this means selecting the least worst of the options, based on
the criteria and data collected.

4. Act: This is the physical playing-out of decisions. The action provides feedback to the next
observation.

According to this idea, the key to victory is to be able to create situations wherein one can make
appropriate decisions more quickly than one's opponent. The construct was originally a theory for
achieving success in air-to-air combat, developed out of Boyd's Energy-Manoeuvrability theory and
his observations on air combat between MiGs and F-86s in Korea.

Note that decision making can create the OO-OO-OO cycle. What this means is that the decision
maker has to constantly re-observe and re-orient. For example, this could be looking at some event
in Australia at a summary level. Then they look to see that it has occurred in Sydney. They drill
down further to see that it has occurred in George St. At each step, they are drilling down a
hierarchy or summary view, using data at one level that becomes the metadata of the next level.
This can continue without limit until the decision maker has found sufficient detail to decide what to
do.

Part of the OO-OO-OO cycle is the ability of the decision maker to discover new things about the
data, as they do research. This implies that they will need to be able to create new metadata as new
discoveries emerge. This means that they are using the current metadata, and also adding to it as
they go down the learning curve. Any decision making tool must be able to support both the display
and the collection of new metadata.

Another key idea is that the OODA cycle starts with Observation. What this implies is that the
situation could be either new or previously defined. If it is previously defined, then the decision is
known and all the decision maker needs is to drill down to the detail until the decision becomes
obvious. However, if the situation is new, then the decision maker is in research mode. By
definition, they cannot define what they need. Instead, they need a research tool to be able to

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 21 of 30


Matthew Lawler [email protected] Datawarehouse Principles

explore all available data until the reasons for current state are known. This is why reporting
systems often cannot be fully specified unlike operational systems. The requirements are
fundamentally undefinable, as most people lack the gift of prophecy.

Decision Types
How can we classify decisions?

Defined Observation (Cause) Undefined Observation (Cause)


Defined Orientation Known Decision Type Establish link between new Cause and known
(Effect) Effect
Undefined Establish link between known Research cause and effect until they are
Orientation (Effect) Cause and new Effect understood and a correct response is
defined.
This classification can be used to determine DW architecture choices.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 22 of 30


Matthew Lawler [email protected] Datawarehouse Principles

What is Technology (IT)?


Nicholas Carr defines IT to mean "all the technology, both software and hardware, used to store,
process and transport information in digital form." Note that this does not encompass the
information itself that uses the technology. Information will always form the basis of business
advantage. But the T part of IT itself is a commodity, which will inevitably be supported outside the
firm as an infrastructural technology.

What is the basis of competitive advantage?


Scarcity, not ubiquity, is the source of competitive advantage. But the core capabilities of IT are
available to all firms. Is it not enough for IT to deliver better service and reduce cost, etc?
Distinctiveness is what determines a company's profitability, and ensures its survival. If all
companies are using the same IT to do this, then they can only compete on price. Any increases in
productivity that commodity resources produce will in time be competed away. This means that IT
has changed from a source of advantage to a cost of doing business. Most companies will need to
focus on mitigating risk rather than pursuing innovation, and reducing costs over making new
investments.

Almost as a corollary, scarce information is still only a potential advantage. The advantage can only
be gained if it is acted on.

Syndrome from Incredibles: “In a world where everybody is special, nobody is special.”

Network Effect
In economics and business, a network effect is the effect that one user of a good or service has on
the value of that product to other users. The classic example is the telephone. The more people
own telephones, the more valuable the telephone is to each owner. This creates a positive
externality because a user may purchase their phone without intending to create value for other
users, but does so in any case. The expression "network effect" is applied most commonly to
positive network externalities as in the case of the telephone. Negative network externalities can
also occur, where more users make a product less valuable, but are more commonly referred to as
"congestion" (as in traffic congestion or network congestion). Note that this is an economics idea,
not a technology concept. It often is enabled by the introduction of a new technology. Metcalfe's

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 23 of 30


Matthew Lawler [email protected] Datawarehouse Principles

law states that the value of a telecommunications network is proportional to the square of the
number of users of the system. The real value appears to be nlog(n), but the idea is still valid.

Wouldn’t it be great to be a member of the “Pudding Owners Collective” so that you can own a
pudding that never runs out.

Network Effect Examples


The IT industry has many examples of the Network Effect. These can be either software or hardware
based, or both. See some examples below.

 Externally published standards. (e.g. This enables reuse of commodity software)

 Data warehousing Hub and spoke design.

 Messaging hub.

 Standard Office Suite.

 Common Operating System.

 Company wide Dictionary/Thesaurus.

 Workflow management hub.

Hub & Spoke Contents


P2P: Point to Point
Source

Hub

Report

Spoke

The choice between Point to Point (P2P) and Hub and Spoke pattern occurs often.

And Hub is not always the answer.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 24 of 30


Matthew Lawler [email protected] Datawarehouse Principles

What should be clear is that success is not easy. There are many examples where the innovating
firm was trying to exploit the Network Effect, but they only achieved partial success. Many issues
prevented full success, including inadequate standards, vendor lock in, set up issues, etc. See
Shapiro and Varian for more details.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 25 of 30


Matthew Lawler [email protected] Datawarehouse Principles

What is IT Architecture?
TO BE
The “TO BE” IT Architecture state is that ideal IT that enables all decision makers to make better,
faster, cheaper decisions than their competitors. A corollary of the Carr thesis is that the “TO BE” IT
Architecture will see the uncoupling of the Technology Architecture (commodity) from the
Information Architecture (value add), in order to deliver lowest overall cost. So, we have a very high
level picture of this end state. IT is separated into big I little t inside the firm and little I Big T outside
the firm.

AS IS
The “AS IS” is like a Big Ball of Mud. See the Foote and Yoder article which gives a good explanation
for how the current mess comes about. In general, this is the accumulated outcome of a series of
unintegrated implicit and explicit decisions.

Often times Vendors are not ready for the future either. Product offerings are poorly thought out,
and often mix Information and Technology together. That is, they sometimes attempt to sell
services to a firm that cover both what is generic and what is particular to a firm. By not separating
the two parts, the Vendor becomes too involved in what a firm is doing.

Situation Normal: Another Favela Urban

But what is the first thing that actors in a disaster movie like the Titanic reach for? The blueprints, of
course. Even though it is a sprawl, the current state Architecture does need to be documented.

So how does a firm get to the ideal end state?

Role of IT Architect
This is where IA rubber hits the technology road. Architects should facilitate the movement from
current (AS IS) to end (TO BE) state. They do this by:

1. Helping the firm fully exploit Information for decision making.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 26 of 30


Matthew Lawler [email protected] Datawarehouse Principles

2. Helping the firm separate the Information component from the Technology. This means
identifying and promoting the outsourcing of Infrastructural Technology, while retaining proprietary
Information.

3. Helping the firm discover and develop network effects.

This will help the business to retain what is unique (scarce) from what is generic (commodity).
Restated, this is the separation of the what (in source) from the how (outsource).

Are you part of my Architecture?

How soon will the firm reach the end state? There are clearly a series of interim intermediate states,
as the Technology industry is still evolving.

Also note that the other types of IT Architecture have this dual Information and Technology
Architecture split.

How will the DW Architecture principles be used?


The intent of the DW Architecture principles below is to help Data Architects and others to assess
whether the solution is moving towards the desired end state as defined above.

The following principles set out the characteristics of an ideal state. It is not expected that each new
system implemented will comply fully with all principles. However, if they do not, then the
exceptions should be documented and rationalised. Any new system must appreciably reduce the
level of architectural non-compliance, even if the new system does not tick all boxes.

Note that the goal of this DW Architecture is not necessarily to collect up all the variety of databases
into one large system. It is more to provide a utility/environment whereby all key decisions are
based on information that is available to decision makers, in accordance with the principles. An
inventory of all current databases/systems will help in defining the current information needed to
support this, and is useful as a kind of requirements catalog.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 27 of 30


Matthew Lawler [email protected] Datawarehouse Principles

What are the High Level Principles?

All information must be optimised for the decision maker.


Only the decision maker can determine when the information is optimal. The information must be
optimised from end to end for decision making. That is, optimality is defined by the completeness
and timeliness of the information, from the decision makers point of view. If, for any reason, this
expectation cannot be met, then the reason needs to be addressed. Common issues arise when
there is poor design, or artificial boundaries imposed by vendors. There will be a natural tension
with Service Provision here. For example, Vendor packages must provide a means of publishing their
structure, and providing open interfaces into the data. Similarly, Vendor service providers cannot
slow down information provision for their business reasons. Another consequence of this is that
more complete information is available as events occur. There is much less massaging (delaying) of
data as the some situation crosses a threshold, requiring other decision makers to be involved.

All information must be available to the decision maker directly.


The fundamental principle for all decision makers is independence. All decision makers should be
able to access the information without any intermediary. This implies that access to the information
is usable, and that sufficient support is provided. This also implies that sophisticated decision
makers will be able to directly control the data.

All information must be available to all decision makers except for defined
risk exceptions.
The fundamental principle for almost all information is openness. This is based on the fundamental
assumption of trust that is made between all decision makers within a firm. All decision makers
should must be able to view all non-confidential data. This ensures that maximisation of potential
information value. All access restrictions need to be justified on specific risk cases. Valid examples
include passwords, financial data, customer privacy, salary, etc. This implies that there is risk
traceability. All tables, columns and row sets should have a risk allocation, where the default is
OPEN. Similarly, risk sets can be grouped into roles. To support this, all decision makers must be
authenticable. This means that the decision maker has been identified before access to data. All
roles must be authorisable, so that the decision maker role has been checked, before access to data.

All information must exist to support a decision or other requirement.


The fundamental principle for all information is requirements traceability. All data functions should
be linked to either external requirements such as Legal and Regulatory controls, or to internal
requirements, including set of decisions supported. Practically, these requirements are linked to
artefacts, rather than each specific instance or table row. However, they should be linkable to
specific instances if needed.

All information must represent something in the real world.


The Closed World Assumption states "All tuples should describe all, and only, currently true
propositions in the real world." This helps to guide what is definable in a system. It also implies
properties such as persistence, uniqueness, identity, etc. This also assumes that all information is
defined, which means that a dictionary or metadata registry is needed. These definitions form the
common set of shared concepts that the decision makers require as context. A subset of these
definitions would also be the business measures used to determine performance, etc.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 28 of 30


Matthew Lawler [email protected] Datawarehouse Principles

All information must represent a true state at a point in time.


The Closed World Assumption Over Time states "All tuples should describe all, and only, true
propositions in the real world over all points in time." This is separate from current view, as it
ensures the time dimension is explicitly satisfied. There are three useful patterns to making data
consistent with this principle. Firstly, Time Variance means that All tuples that represent slowly
changing data will be captured incrementally. That is, history of updates will be captured. For
example, when a customers address changes, a new row is added, so that old address is retained in
records, rather than being over written. Secondly, Non-volatility means that All tuples that
represent slowly changing data will be preserved for period that they were valid. That is, previous
valid values will not be changed, so that past periods can be correctly analysed. So the old address
record above is valid between start and end dates. Thirdly, Snapshot means that All tuples that
represent event type data will be captured in full, and retained. For example, an event could be a
financial posting, phone call, etc.

All information must be locatable.


The fundamental principle for all information is findability and searchability. For example, can
decision maker easily find a list of all databases, tables and columns that they have access to? This
implies some way to view structural metadata (table, column). In addition, can the decision maker
search across all data using a google like interface?

Common information must be joinable across all information.


The fundamental principle for all information is integration. That is, there should not be any
technical, design or other barriers to joining across multiple data stores. This may not be possible in
the medium term, but it should remain a long term goal. One solution pattern for this is called
Subject Orientation, which is the extraction, transformation and loading of data from multiple
source databases into a single target database, that can be queried against. The transformations
convert semantically identical data into syntactically identical data. Part of the cleansing is the
allocation of unique identifiers. This common data are sometimes called Dimensions. Examples
include Time, Party, Product, Location, Staff, Organisation Unit and Account. Another solution
pattern is Master Data Management where common reference data is placed into a centrally
defined repository, which is then used to ensure that the duplicated data is synchronised.

All source information must have high quality.


The fundamental principle is that primary source is best place to increase data value, and reduce
data cost. There are some useful solutions to making data consistent with this principle. Firstly, all
data must be updatable in primary source database. Note that this is a tautology. That is, if there is
no mechanism to update the primary source, then it cannot be the primary source. Secondly, there
should not be any duplicate data in the primary source, in order to avoid the update anomaly. The
best approach to eliminating this is to ensure data is in third normal form (3NF). This also assumes
that 3NF rules are enforced, at database level. Next, as far as possible, there should not be nullable
(that is, blank) data. Next, all data should be cleansed in the primary source, rather than in a copy.
This prevents inconsistent data propagation. The primary source is sometimes called the System of
Record. Finally, the source system must provide a means to expose its data so that all decision
makers can create views over data. This should be provided at zero or near zero marginal cost by
the source system. Part of this is ensuring the data is highly available.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 29 of 30


Matthew Lawler [email protected] Datawarehouse Principles

All target information must reconcile to source and shows its lineage.
The fundamental issue is that every data replication increases data cost, and reduces data value.
There are two ways to negate this, either avoid copying (one source) or provide a means to ensure
the target data is identical to the source data. Negating value reduction is costly, and it must be
borne by target data store. So the target store should reconcile to source, so when the source
changes, this change will automatically appear in target. The target store should also show lineage
from source, and any transformations. Views are fine, as these automatically satisfy this principle.
In other words, mappings and traceability are crucial.

D:\D\Documents\DW Me\0 Publish\DW Principles.docx February 13, 2018 30 of 30

You might also like