0% found this document useful (0 votes)
128 views

See How Talend Helped Domino's: Integrate Data From 85,000 Sources

This document discusses data modeling methodologies for enterprise data warehouses and business intelligence solutions. It argues that traditional relational and star schema approaches do not adequately address the challenges of big data and changing business needs. The document then introduces the Data Vault modeling methodology as a more flexible and adaptable alternative that is better suited for integrating diverse data sources and adapting to changes over time. Key advantages of the Data Vault approach include simplifying the data ingestion process, providing auditability, and focusing on business needs rather than programming constraints.

Uploaded by

Dastagiri Saheb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

See How Talend Helped Domino's: Integrate Data From 85,000 Sources

This document discusses data modeling methodologies for enterprise data warehouses and business intelligence solutions. It argues that traditional relational and star schema approaches do not adequately address the challenges of big data and changing business needs. The document then introduces the Data Vault modeling methodology as a more flexible and adaptable alternative that is better suited for integrating diverse data sources and adapting to changes over time. Key advantages of the Data Vault approach include simplifying the data ingestion process, providing auditability, and focusing on business needs rather than programming constraints.

Uploaded by

Dastagiri Saheb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

For anything you might want to do, understanding the problem and using the right tools is

essential. Resulting methodologies and best practices that inevitably arise become the catalyst
for innovation and superior accomplishments. Database systems, particularly data warehouse
systems are no exception, yet does the best data modeling methodologies of the past offer the
best solution today?
Big Data, agreeably a very hot topic, will clearly play a significant part in the future of business
intelligence solutions. Frankly the reliance upon Inmon’s Relational 3NF and Kimball’s STAR
schema strategies simply no longer apply. Using and knowing how to use the best data
modeling methodology is a key design priority and has become critical to successful
implementations. Persisting with outdated data modeling methodologies is like putting wagon
wheels on a Ferrari.

See how Talend helped Domino's integrate data from 85,000 sources.
Today, virtually all businesses make money using the Internet. Harvesting the data they create
in an efficient way and making sense of it has become a considerable IT challenge. One can
easily debate the pros and cons involved in the data modeling methodologies of the past, but
that will not be the focus of this blog. Instead let’s talk about something relatively new that
offers a way to easily craft adaptable, sensible, data models that energize your data warehouse:
The Data Vault!

Enterprise Data Warehouse (EDW) systems aim to provide true Business Intelligence (BI) for the
data-driven enterprise. Companies must address critical metrics ingrained in this vital, vibrant
data. Providing an essential data integration process that eventually supports a variety of
reporting requirements is a key goal for these Enterprise Data Warehouse systems. Building
them involves significant design, development, administration, and operational effort. When
upstream business systems, structures, or rules change, fail to provide consistent data, or
require new systems integration solutions, the minimum reengineering requirements present
us with problem #1: The one constant is change; so how well can an EDW/BI solution adapt?
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the
one that is the most adaptable to change." Charles Darwin
Consumption and analysis of business data by diverse user communities has become a critical
reality to maintain a competitive edge yet technological realities today often require highly
trained end-users. Capturing, processing, transforming, cleansing, and reporting on this data
may be understandable, but in most cases the sheer volume of data can be overwhelming; Yup,
problem #2: Really Big Data; often characterized as: Volume, Velocity, Variety, Variability,
Veracity, Visualization, & Value!
Crafting effective and efficient EDW/BI systems, simplified for usability and reporting on this
data, quickly becomes a daunting and often difficult technical ordeal even for veteran
engineering teams. Several integrated technologies are required from database systems, data
processing (ETL) tools like Talend, various programming languages, administration, reporting,
and interactive graphics software to high performance networks and powerful computers
having very large storage capacities. The design, creation, delivery, and support of robust,
effortless EDW/BI systems for simplified, intelligent use are, you guessed it; problem #3:
Complexity!
Often we see comprehensive and elegant solutions delivered to the business user that fails to
understand the true needs of the business. We’re told that’s just the way it is due to technical
requirements (limitations; wink, wink) and/or design parameters (lack of features; nudge,
nudge). Hence; problem #4: The Business Domain; fit the data to meet the needs of the
business, not the other way around!
Furthermore, as upstream systems change (and they will), as EDW/BI technology plows ahead
(and they must), as the dynamic complexities involved prevail (relentlessly), every so often new
data sources need to be added to the mix. These are usually unpredicted and unplanned for.
The integration impact can be enormous often requiring complete regeneration of the
aggregated data; hence, problem #5: Flexibility; or the lack there of!

So how do we solve these problems? Well …


Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:
“A subject oriented, nonvolatile, time-variant collection of data in support of management’s
decisions”
(http://en.wikipedia.org/wiki/Bill_Inmon)

Ralph Kimball (http://en.wikipedia.org/wiki/Ralph_Kimball), a


pioneering data warehousing architect, developed the “dimensional modeling” methodology
now regarded as the de-facto standard in the area of decision support. The Dimensional Model
(called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a
“snowflake schema”) methodology. In Kimball’s Star Schema, transactional data is partitioned
into aggregated “facts” with referential “dimensions” surrounding and providing descriptors
that define the facts. The Normalized Model (3NF or “third normal form”) stores data in related
“tables” following relational database design rules established by E. F. Codd and Raymond F.
Boyce in the early 1970’s that eliminate data redundancy. Fostering vigorous debate amongst
EDW/BI Architects as to which methodology is best, both have weakness when dealing with
inevitable changes in the systems feeding the data warehouse and in cleansing data to conform
to strict methodology requirements.
Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast
analysis of data from multiple perspectives. The cube structure is created from either a Star or
Snowflake Schema stored as metadata from which one can view or “pivot” the data in various
ways. Generally cubes have one time based dimension that supports a historical representation
of data. Creating OLAP cubes can be very expensive and often create a significant amount of
data that is of little or no use. The 80/20 rule appears in many cases to hold true (where only
20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional
architecture does an OLAP cube truly deliver sufficient ROI? Often, the answer is a resounding,
NO! Durable EDW/BI systems must deliver real value.

Learn how Talend helped Tipico transform oceans of data into cutting edge
business intelligence.

A Fresh Approach
The Data Vault is a hybrid data modeling methodology providing historical data representation
from multiple sources designed to be resilient to environmental changes. Originally conceived
in 1990 and released in 2000 as a public domain modeling methodology, Dan Linstedt, its
creator, describes a resulting Data Vault database as:
“A detail oriented, historical tracking and uniquely linked set of normalized tables that support
one or more functional areas of business. It is a hybrid approach encompassing the best of
breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable
to the needs of the enterprise.”
(http://en.wikipedia.org/wiki/Data_Vault_Modeling)
Focused on the business process, the Data Vault as a data integration architecture, has robust
standards and definitional methods which unite information in order to make sense if it. The
Data Vault model is comprised of three basic table types:
HUB (blue): containing a list of unique business keys having its
own surrogate key. Metadata describing the origin of the business key, or record ‘source’ is also
stored to track where and when the data originated.
LNK (red): establishing relationships between business keys (typically hubs, but links can link to
other links); essentially describing a many-to-many relationship. Links are often used to deal
with changes in data granularity reducing the impact of adding a new business key to a linked
Hub.
SAT (yellow): holding descriptive attributes that can change over time (similar to a Kimball Type
II slowly changing dimension). Where Hubs and Links form the structure of the data model,
Satellites contain temporal and descriptive attributes including metadata linking them to their
parent Hub or Link tables. Metadata attributes within a Satellite table containing a date the
record became valid and a date it expired provide powerful historical capabilities enabling
queries that can go ‘back-in-time’.
There are several key advantages to the Data Vault approach:

- Simplifies the data ingestion process


- Removes the cleansing requirement of a Star Schema
- Instantly provides auditability for HIPPA and other regulations
- Puts the focus on the real problem instead of programming around it
- Easily allows for the addition of new data sources without disruption to existing schema

Simply put, the Data Vault is both a data modeling technique and methodology which
accommodates historical data, auditing, and tracking of data.
“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework”
Bill Inmon

Adaptable
Through the separation of business keys (as they are generally static) and the associations
between them from their descriptive attributes, a Data Vault confronts the problem of change
in the environment. Using these keys as the structural backbone of a data warehouse all related
data can be organized around them. These Hubs (business keys), Links (associations), and SAT
(descriptive attributes) support a highly adaptable data structure while maintaining a high
degree of data integrity. Dan Linstedt often correlates the Data Vault to a simplistic view of the
brain where neurons are associated with Hubs and Satellites and where dendrites are Links
(vectors of information). Some Links are like synapses (vectors in the opposite direction). They
can be created or dropped on the fly as business relationships change automatically morphing
the data model as needed without impact to the existing data structures. Problem #1 Solved!

Big Data

Data Vault v2.0 arrived on the scene in 2013 and incorporates seamless
integration of Big Data technologies along with methodology, architecture, and best practice
implementations. Through this adoption, very large amounts of data can easily be incorporated
into a Data Vault designed to store using products like Hadoop, Infobright, MongoDB and many
other NoSQL options. Eliminating the cleansing requirements of a Star Schema design, the Data
Vault excels when dealing with huge data sets by decreasing ingestion times, and enabling
parallel insertions which leverages the power of Big Data systems. Problem #2 Solved!

Simplification
Crafting an effective and efficient Data Vault model can be done quickly once you understand
the basics of the 3 table types: Hub, Satellite, and Link! Identifying the business keys 1 st and
defining the Hubs is always the best place to start. From there Hub-Satellites represent source
table columns that can change, and finally Links tie it all up together. Remember it is also
possible to have Link-Satellite tables too. Once you’ve got these concepts, it’s easy. After you’ve
completed your Data Vault model the next common thing to do is build the ETL data integration
process to populate it. While a Data Vault data model is not limited to EDW/BI solutions,
anytime you need to get data out of some data source and into some target, a data integration
process is generally required. Talend’s mission is to connect the data-driven enterprise.

With its suite of integration software, Talend simplifies


the development process, reduces the learning curve, and decreases total cost of ownership
with a unified, open, and predictable ETL platform. A proven ETL technology, Talend can
certainly be used to populate and maintain a robust EDW/BI system built upon a Data Vault
data model. Problem #3 Solved!

Your Business
The Data Vault essentially defines the Ontology of an Enterprise in that it describes the business
domain and relationships within it. Processing business rules must occur before populating a
Star Schema. With a Data Vault you can push them downstream, post EDW ingestion. An
additional Data Vault philosophy is that all data is relevant, even if it is wrong. Dan Linstedt
suggests that data being wrong is a business problem, not a technical one. I agree! An EDW is
really not the right place to fix (cleanse) bad data. The simple premise of the Data Vault is to
ingest 100% of the source data 100% of the time; good, bad, or ugly. Relevant in today’s world,
auditability and traceability of all the data in the data warehouse thus become a standard
requirement. This data model is architected specifically to meet the needs of today’s EDW/BI
systems. Problem #4 Solved!
“To understand the Data Vault is to understand the business”
(http://danlinstedt.com)

Flexible
The Data Vault methodology is based on SEI/CMMI Level 5 best practices and includes many of
its components combining them with best practices from Six Sigma, TQM, and SDLC (Agile).
Data Vault projects have short controlled release cycles and can consist of a production release
every 2 or 3 weeks automatically adopting the repeatable, consistent, and measurable projects
expected at CMMI Level 5. When new data sources need to be added, similar business keys are
likely, new Hubs-Satellites-Links can be added and then further linked to existing Data Vault
structures without any change to the existing data model. Problem #5 Solved!

Conclusion
In conclusion, the Data Vault modeling and methodology addresses the elements of the
problems we identified above:
- It adapts to a changing business environment
- It supports very large data sets

- It simplifies the EDW/BI design complexities


- It increases usability by business users because it is modeled after the business domain
- It allows for new data sources to be added without impacting the existing design
This technological advancement is already proving to be highly effective and efficient. Easy to
design, build, populate, and change, the Data Vault is a clear winner. Very Cool! Do you want
one?
Visit http://learndatavault.com or http://www.keyldv.com/lms for much more on Data Vault
modeling and methodology.

While you're at it, download a free trial of Talend Cloud Integration Platform to
what your data can really do.

You might also like