IDERA Creating A Data Fabric For Analytical Data
IDERA Creating A Data Fabric For Analytical Data
May/2021
Sponsored By:
ER/Studio
Table of Contents
Introduction ............................................................... 1
What is a Data Fabric?............................................. 2
Data Fabric Processes .............................................. 6
Implementing the Data Fabric .............................. 10
Wrap-Up ................................................................... 13
Introduction
If 2020 – the year of the pandemic – taught us anything, it was that
change happens and it happens suddenly, unexpectedly, and with
significant impact to every aspect of our world. 2020 also taught us that,
to counter these seismic changes, our enterprises must be agile. They
must be able to change their entire set of objectives and goals, along
with supporting operating processes and decision-making capabilities
almost overnight.
For decision-making specifically, agility comes from the ability for
business users to easily find and use analytical data and analytical assets
for the many decisions – strategic and tactical – that must be made
every day. Unfettered access to analytical data comes in the shape of
a data fabric. Data fabric is a new term for an old idea – an all-
encompassing analytics architecture that is easily understood and
accessed, easily implemented and maintained, and ensures that all
analytic data can flow easily throughout the entire enterprise.
A tall order but, thankfully, with the advances in data management,
storage, analysis, and access technologies, the creation of a data fabric
is not only possible, but it has become mandatory for enterprises in
today’s post-pandemic marketplaces.
This document begins by defining what a data fabric is, along with its
benefits. The Extended Analytics Architecture (XAA) is used to illustrate
the analytical components within the fabric, followed by a diagram of
how data must flow (either physically or virtually) from one component
to another seamlessly, thus supporting access to and the creation of the
ultimate analytical assets.
The next section walks the reader through the set of processes or
capabilities needed to give the user a sane and rational way to access
the environment, understand and create needed analytical assets, and
the ability to quickly make decisions with confidence. Within each
process, the technological functionality needed to create and maintain
a fully functioning data fabric environment is discussed.
The last section of the paper discusses things to think about when
implementing a data fabric for all analytical needs. Using the proper set
of technology can mitigate, if not eliminate, many of these challenges.
1 “The Corporate Information Factory, 2nd Edition”, by William H. Inmon, Claudia Imhoff, and
Ryan Sousa, ISBN:978-0-471-39961-2, Publisher John Wiley & Sons, Inc.
2 “The Data Warehouse Tool Kit, 3rd Edition”, by Ralph Kimball and Margie Ross, ISBN: 978-
To understand the XAA and the Data Fabric in general, here are brief
descriptions for each component, starting from the bottom:
1. Operational systems – the internal applications that run the day-to-
day operations of the enterprise. Within these systems are callable BI
Services (Embedded BI) generally called from the EDW. Examples
include fraud detection, location-based offers, and contact center
optimizations.
2. Real-time analysis platform – the first analytical component that
analyzes streams of data (transactions, IoT streams, etc.) coming into
the enterprise in real time. Examples include traffic flow optimizations,
web event analyses, other correlations of unrelated data streams
(e.g., weather effects on campaigns).
3. Other internal and external structured and multi-structured data –
sources of data that are not in the normal streams for this architecture
and include IoT data, social media data, and purchased data.
4. Data integration platform – the process of extracting, transforming to
a standard format, and loading structured data (ETL or ELT) into the
EDW. The process also invokes data quality processing where needed
and is considered the “trusted” data stored in the EDW.
5. Data refinery – the process of ingesting raw structured and multi-
structured data, distilling it into useful formats in the Investigative
computing platform for advanced analyses by data scientists. It is also
called data prep or “data wrangling”.
6. Traditional EDW – the second analytical component where routine,
analyses, reports, KPIs, customer analyses, etc., are produced on a
regular basis. The EDW is considered the production analytics
environment using trusted, reliable data.
7. Investigative computing platform (ICP) – the third analytical
component used for data exploration, data mining, modeling, cause
and effect analyses, and general, unplanned investigations of data.
It is also known as the data lake and is the “playground” of data
scientists and others having unknown or unexpected queries.
8. Analytic tools and applications – the variety of technologies that
create the reports, perform the analyses, display the results, and
increase the productivity of analysts and data scientists.
understood. Figure 2 depicts the two forms of data flows: physical flows
and virtual ones. The physical flows come from the data integration
platform and the data refinery and deposit their data into either the EDW
or the ICP. The physically separated data and analytical assets can be
brought together to appear as if they were physically together for
analysis or sharing by data virtualization. In this case though, no data is
actually moved.
Figure 2: Data Flows in the Data Fabric
situation is one in which they can quickly and easily access the data they
need and then analyze it. How it is integrated and where it is stored may
not be of great interest. Their usage of the environment should be made
as easy and uncomplicated as possible.
Their processes for using the environment consist of the following:
1. Discover – The business community needs access to a data catalog
function that is an easy-to-use entry point where they can quickly
ascertain what data is available, in unambiguous, non-technical
language that is relevant to the business, and what analytical assets
exist (reports, visualizations, dashboard widgets, advanced predictive
and other models, etc.). The catalog should supply browsable
ontologies that give this community the context of all data and assets.
It should then show the assets that contain the data with useful
information such as who created the data or asset, how current it is,
what its quality measurements are, what source supplied it, what
technology created the asset, and so on. The data catalog should
include the meaning of business rules and explanations of security
and privacy policies. And finally, the catalog should be able to
recommend other data or assets similar to the requested one(s) much
like a cohort recommendation would (“people like you also liked
these items…”).
2. Data Availability – Again from the catalog, a business user can quickly
determine if the data is available for their decision-making or not. If
not available, then they must submit a request to the technical
personnel to bring that data into the environment. If the need is
critical, the technical personnel may create a temporary workaround
(e.g., a documented one-time extract or virtualized access to data)
to satisfy their need until the data is made available through the
proper channels (the technical process). Once the data is made
available, the temporary workaround is removed.
If the data or asset is available, then the business users must determine
if they already have permission to access it or who they should
contact for permission to access the needed information. If a request
is made, a governance process must be in place to ensure that the
request is valid, it is appropriate for the requester, and all security and
privacy policies are enforced.
3. Analyze – Once the business user has permission to access the
information, the next step is to use the information in a decision-
making capacity. The business user may create their own analytical
asset using the data or access an existing asset, perhaps tweaking it
(e.g., changing certain parameters) to fit their specific needs.
Again, the business community should ensure that any new analytical
asset is noted in the data catalog feature for others to discover.
4. Iterate – After completing their analysis, the business user may
continue to examine data and assets in their area or find other
information by returning to the catalog. If the catalog supports the
invoking of the analytical technology from within the catalog itself,
then the business user does not need to have the extra step of
returning to the catalog – they are already there.
interface. The catalog can perform certain analyses like data lineage
and impact analysis. Through its AI/ML capabilities, it can determine
sensitive data and present recommendations to business users.
Idera’s ER/Studio as well as the company’s partners fulfill this role of
the data catalog.
2. Data modeling – Each repository in the architecture must have its
schema documented through a data modeling technology. The
different levels of data models (logical and physical) are used in
designing the EDW and the ICP. Data models of the operational
systems are used in the data integration and prep technologies as
well. Data modeling technology is also key to supplying much of the
information found in the data catalog, including any changes to
database design, the existence of data and its location, definitions
and other glossary items. It is vital that data models are connected to
the Business Glossary to ensure a well-managed data catalog.
Idera’s ER/Studio has been around for many years and is a proven
modern modeling technology that allows data architects to connect
models into a data catalog.
3. Data Integration – A key factor in the creation of the first repository of
analytical data (the EDW) is data integration technology, which
performs the heavy lifting of extracting the data from its sources,
transforming it into the single version of the data that is then loaded
into the data warehouse. This ETL or ELT process is what creates the
trusted data used in many production reports and analytics. A best
practice here is to automate as much of this process as you can using
technologies like Idera’s WhereScape, which can automate the
entire process from curation to design and deployment of the EDW
providing complete documentation immediately.
4. Data preparation – The creation of the second repository of analytical
data, the data lake or ICP, is supported by the data prep technology
that extracts raw data from sources and reformats, lightly integrates,
and loads it into the repository for exploration or experimentation. This
repository is meant for the unplanned, general queries performed by
data scientists and the like. Idera’s Qubole technology is an example
of such a data prep tool that automates this process and produces
the much-needed documentation for those using this repository.
5. Data virtualization – The ability to bring data together virtually rather
that physically moving it all around the architecture is a great boon
to the analytical personnel. Access to all data, regardless of its
location, is a major step toward data democratization. The ease with
which data can be virtualized is also a great productivity tool for the
technical personnel. It can also be used to prototype new solutions or
additions to the environment.
Wrap-Up
The Extended Analytics Architecture is an example of a Data Fabric that
encompasses all forms of data, welcomes all analysts accessing the data
to create the analytical assets, and supports the technical personnel
responsible for its implementation and maintenance. It is a big
undertaking that requires many different technologies working together
in harmony. Successful environments greatly enhance the enterprise’s
fact-based decision making, creates a much sharper competitive
stance, and improves marketing, communications, operations, and
customer relationships.
Before embarking on this endeavor, here are some things to think about:
No single technology or technology company has the complete set of
capabilities listed in this paper. Some come very close to having it all, but
it is important to understand where there may be gaps or holes in the
offerings. If that is the case, partnerships are critical. These partnerships
are far more than just a handshake; they require a deep understanding
of each company’s offering and must interface seamlessly between the
technologies.
For any architecture to work, the enterprise must stand solidly behind the
initiative creating the new environment. That means the executives may
have to step in when rogue attempts are made to go around the
architectural standards and components. Silos created as workarounds
must be eliminated when the workaround is no longer needed.
Making the business community’s interfaces simple and easily
understood can make it more complicated for the technical personnel
creating the infrastructure supporting these interfaces.
Much of the value of the data fabric depends on the robustness of the
information gathered in the data catalog. Out-of-date, stale, or just plain
inaccurate metadata leads to distrust and disuse.
Finally, simply fork-lifting legacy analytic components (like an aging data
warehouse) into this fabric may be expedient but may cause problems
in terms of integrating it into the whole fabric. If possible, these legacy
components should be reviewed and redesigned.
Many technology companies claim they can do it all. Few can. Pick a
vendor that meets the most of your technological needs and has strong
partnerships with others to complete the architecture.