Rise of The Knowledge Graph
Rise of The Knowledge Graph
Graph
Toward Modern Data Integration and the Data Fabric
Architecture
Executive Summary
While data has always been important to business across industries, in
recent years the essential role of data in all aspects of business has become
increasingly clear. The availability of data in everyday life—from the
ability to find any information on the web in the blink of an eye to the
voice-driven support of automated personal assistants—has raised the
expectations of what data can deliver for businesses. It is not uncommon for
a company leader to say, “Why can’t I have my data at my fingertips, the
way Google does it for the web?”
This is where a structure called a knowledge graph comes into play.
A knowledge graph is a combination of two things: business data in a
graph, and an explicit representation of knowledge. Businesses manage data
so that they can understand the connections between their customers,
products or services, features, markets, and anything else that impacts the
enterprise. A graph represents these connections directly, allowing us to
analyze and understand the relationships that drive business forward.
Knowledge provides background information such as what kinds of things
are important to the company and how they relate to one another. An
explicit representation of business knowledge allows different data sets to
share a common reference. A knowledge graph combines the business data
and the business knowledge to provide a more complete and integrated
experience with the organization’s data.
What does a knowledge graph do? To answer that question, let’s consider
an example. Knowledge graph technology allows Google to include oral
surgeons in a list when you ask for “dentists”; Google manages the data of
all businesses, their addresses, and what they do in a graph. The fact that
“oral surgeons” are a kind of “dentist” is knowledge that Google combines
with this data to present a fully integrated search experience. Knowledge
graph technology is essential for achieving this kind of data integration.
We’ll start by looking at how enterprises currently use data, and how that
has been changing over the past couple of decades.
Introduction
Around 2010, a sea change occurred with respect to how we think about
and value data. The next decade saw the rise of the chief data officer in
many enterprises, and later the data scientist joined other categories of
scientists and engineers as an important contributor to both the sum of
human knowledge and the bottom line. Google capitalized on the
unreasonable effectiveness of data, shattering expectations of what was
possible with it, while blazing the trail for the transformation of enterprises
into data-driven entities.
Data has increasingly come to play a significant role in everyday life as
more decision making becomes data-directed. We expect the machines
around us to know things about us: our shopping habits, our tastes, our
preferences (sometimes to a disturbing extent). Data is used in the
enterprise to optimize production, product design, product quality, logistics,
and sales, and even as the basis of new products. Data made its way into the
daily news, propelling our understanding of business, public health, and
political events. This decade has truly seen a data revolution.
More than ever, we expect all of our data to be connected, and connected to
us, regardless of its source. We don’t want to be bothered with gathering
and connecting data; we simply want answers that are informed by all the
data that can be available. We expect data to be smoothly woven into the
very fabric of our lives. We expect our data to do for the enterprise what the
World Wide Web has done for news, media, government, and everything
else.
But such a unified data experience does not happen by itself. While the
final product appears seamless, it is the result of significant efforts by data
engineers, as well as crowds of data contributors.
When data from all over the enterprise, and even the industry, is
woven together to create a whole that is greater than the sum of its
parts, we call this a data fabric.
Figure 1. Parallel history of knowledge-based technology and data management technology, merging
into the data fabric.
How do we get from the current state of affairs, where our data is locked
within specific applications, to a data fabric, where data can interact
throughout the enterprise?
The temptation is to treat this problem like any other application
development challenge: find the right technology, build an application, and
solve it. Data integration has been approached as if it is a one-off
application, but we can’t just build a new integration application each time
we have a data unification itch to scratch. Instead, we need to rethink the
way we approach data in the enterprise; we have to think of the data assets
as valuable in their own right, even separate from any particular application.
When we do this, our data assets can serve multiple applications, past,
present, and future. This is how we make a data architecture truly scalable;
it has to be durable from one application to the next.
Data strategists who want to replicate the success of everyday search
applications in the enterprise need to understand the combination of
technological and cultural change that led to this success.
Weaving a data fabric is at its heart a community effort, and the quality of
the data experience depends on how well that community’s contributions
are integrated. A community of data makes very specific technological
demands. So it comes as no surprise that the data revolution came to life
only when certain technological advances came together. These advances
are:
Distributed data
This refers to data that is not in one place, but distributed across the
enterprise or even the world. Many enterprise data strategists don’t
think they have a distributed data problem. They’re wrong—in any but
the most trivial businesses, data is distributed across the enterprise, with
different governance structures, different stakeholders, and different
quality standards.
Semantic metadata
This tells us what our data and the relationships that connect them
mean. We don’t need to be demanding about meaning to get more value
from it; we just need enough meaning to allow us to navigate from one
data set to another in useful ways.
Connected data
This refers to an awareness that no data set stands on its own. The
meaning of any datum comes from its connection to other data, and the
meaning of any data set comes from its connections to other data.
Semantic Systems
Knowledge graphs draw heavily on semantic nets. A semantic net is, as its
name implies, a network of meaningful concepts. Concepts are given
meaningful names, as are their interlinkages. Figure 3 shows an example of
a semantic net about animals, their attributes, their classifications, and their
environments. Various kinds of animals are related to one another with
labeled links. The idea behind a semantic net is that it is possible to
represent knowledge about some domain in this form, and use it to drive
expert-level performance for answering questions and other tasks.
Data Representation
One of the drivers of the information revolution was the ability of
computers to store and manage massive amounts of data; from an early
date, database systems were the main driver of information technology in
business.
Early data management systems were based on a synergy between how
people and machines organize data. When people organize data on their
own, there is a strong push toward managing it in a tabular form; similar
things should be described in a similar way, the thinking goes. Library
index cards, for example, all have a book title, an author, and a Dewey
Decimal classification, and every customer has a contact address, a name, a
list of things they order, etc. At the same time, technology for dealing with
orderly, grouped, and linked tabular data, in the form of relational databases
(RDBMS), was fast, efficient, and easily optimized. This led to advertising
these data systems by saying that “they work in tables, just the way people
think!”
Tables turned out to be amenable to a number of analytic approaches, as
online analytical processing systems allowed analysts to collect, integrate,
and then slice and dice data in a variety of ways. This provided a multitude
of insights beyond what was apparent in the initial discrete data sets. A
whole discipline of formal data modeling, now also known as schema on
write, grew up, based on the premise that data can be represented as
interlinked tables.
RDBMS are particularly good at managing well-structured data at an
application level; that is, managing data and relationships that are pertinent
to a particular well-defined task, and a particular application that helps a
person accomplish that specific task. Their main drawback is the
inflexibility that results because the data model required to facilitate data
storage, and to support the queries necessary to retrieve that data, must be
designed up front. But businesses need data from multiple applications to
be aggregated, for reporting, analytics, and strategic planning purposes.
Thus enterprises have become aware of a need for enterprise-level data
management.
Efforts to extend the RDBMS approach to accumulate enterprise-level data
culminated in reporting technology that went by the name enterprise data
warehouse (EDW), based on the analogy that packing a lot of tables into a
single place for consistent management is akin to packing large amounts of
goods into a warehouse. The deployment of EDWs was not as widespread
as the use of tabular representations in transaction-oriented applications
(OLTP), as fewer EDWs were required to sweep up the data from many
applications. They were, however, far larger and more complex
implementations. Eventually, EDWs somewhat fell out of favor with some
businesses due to high costs, high risk, inflexibility, and the relatively poor
support for entity-rich data models. Despite these issues, EDW continues to
be the most successful technology for large-scale structured data integration
for reporting and analytics.
Data managers who saw the drawbacks of EDW began to react against its
dominance. Eventually this movement picked up the name NoSQL,
attracting people who were disillusioned with the ubiquity of tabular
representations in data management, and who wanted to move beyond SQL,
the structured query language that was used with tabular representations.
The name NoSQL appeared to promise a new approach, one that didn’t use
SQL at all. Very quickly, adherents of NoSQL realized that there was
indeed a place for SQL in whatever data future they were imagining, and
the name “NoSQL” was redefined to mean “Not Only SQL”—to allow for
other data paradigms to exist alongside SQL as well as to provide
interoperability with popular business intelligence software tools.
The NoSQL movement included a wide range of different data paradigms,
including document stores, object stores, data lakes, and even graph stores.
These quickly gained popularity with developers of data-backed
applications and data scientists, as they enabled rapid application
development and deployment. Enterprise data guru Martin Fowler
conjectured that the popularity of these data stores among developers
stemmed to a large extent from the fact that these schemaless stores allowed
developers to make an end run around the business analysts and enterprise
data managers, who wanted to keep control of enterprise data. These data
managers needed formal data schemas, controlled vocabularies, and
consistent metadata in order to use their traditional relational database and
modeling tools. The schemaless data systems provided robust data
backends without the need for enterprise-level oversight. This was
wonderful for developers often frustrated by the rigidity and the difficulty
of modeling complex representations offered by RDBMS schema, but not
so wonderful for the enterprise data managers.
This movement resulted in a large number of data-backed applications, but
at the expense of an enterprise-level understanding of what data and
applications were managing the enterprise. Similarly, data scientists found
that NoSQL systems more suited to mass data, like Hadoop or HDFS
storage, were fast, direct means for them to get access to data they needed
—if they could find it, rather than waiting for it to eventually end up in an
EDW. This approach was often described as “schema on read.”
We can read this as describing three people, whom we can identify by the
first column. Each cell reflects different information about them. We can
reflect this in a graph in a straightforward way, as seen in Figure 5.
Figure 5. The same data from Figure 4 shown as a graph, represented visually.
Each row in the table becomes a starburst in the graph. In this sense, a
graph is a lowest common denominator for data representation; whatever
you can represent in a table, a document, a list, or whatever, you can also
represent as a graph.
The power of graph data, and the thing that made it so popular with AI
researchers back in the 1960s and with developers today, is that a graph
isn’t limited to a particular shape. As we saw in Figure 2, a graph can
connect different types of things in a variety of ways. There’s no restriction
on structure of the graph; it can include loops and self-references. Multiple
links can point to the same resource, or from the same resource. Two links
with different labels can connect the same two nodes. With a graph, the
sky’s the limit.
All sorts of networked data can be represented easily in this way. In finance,
companies are represented in what is called a legal hierarchy, a networked
structure of companies and the ownership/control relationships between
them. In biochemistry, metabolic pathways are represented as networks of
basic chemical reactions. As one begins to think about it, it becomes
apparent that almost all data can be represented as a graph, as any data that
describes a number of different concepts or entity types with relationships
that link them is a natural network.
Figure 7. Two tables covering some common information. How can you combine these?
Figure 8. Tables simply aren’t easy to combine. Doing so with these two produces something that isn’t
even a proper table. How do we line up columns? Rows?
Figure 10. Merging two graphs when they have a node in common.
Anti-money laundering
In finance today, money laundering is big business. On a surprisingly large
scale, so-called bad actors conceal ill-gotten gains in the legitimate banking
system. Despite widespread regulation for detecting and managing money
laundering, only a small fraction of cases are detected.
In general, money laundering is carried out in several stages. This typically
involves the creation of a large number of legal organizations (corporations,
LLCs, trust funds, etc.) and passing the money from one to another,
eventually returning it, with an air of legitimacy, to the original owner.
Graph data representations are particularly well suited to tracking money
laundering because of their unique capabilities. First, the known
connections between individuals and corporations naturally form a network;
one organization owns another, a particular person sits on the board or is an
officer of a corporation, and individual persons are related to one another
through family, social, or professional relationships. Tracking money
laundering uses all of these relationships.
Second, certain patterns of money laundering are well known to financial
investigators. These patterns can be represented in a straightforward manner
with graph queries and typically include deep paths: one corporation is a
subsidiary, part owner, or controlling party of another corporation, over and
over again (remember Kevin Bacon?), and transactions pass money from
one of these corporations to another, often many steps deep. Cracking a
money-laundering activity involves finding these long paths, matching
some pattern.
Finally, the data involved in these patterns typically comes from multiple
sources. A number of high-profile cases were detected from data in the
Panama Papers, which, when combined with other published information
(corporate registrations, board memberships, citizenship, arrest records,
etc.), provided the links needed to detect money laundering.
The application of graph data in finance has achieved a number of results
that were infeasible using legacy data management techniques.
Collaborative filtering
In ecommerce, it is important to be able to determine what products a
particular shopper is likely to buy. Advertising combs to bald men or
diapers to teenagers is not likely to result in sales. Egregious errors of this
sort can be avoided by categorizing products and customers, but highly
targeted promotions are much more likely to be effective.
A successful approach to this problem is called collaborative filtering. The
idea of collaborative filtering is that we can predict what one customer will
buy by examining the buying habits of other customers. Customers who
bought this item also bought that item. You bought this item—maybe you’d
like that item, too?
Graph data makes it possible to be more sophisticated with collaborative
filtering; in addition to filtering based on common purchases, we can filter
based on more elaborate patterns, including features of the products (brand
names, functionality, style) and information about the shopper (stated
preferences, membership in certain organizations, subscriptions). High-
quality collaborative filtering involves analysis of complex data patterns,
which is only practical with graph-based representations. Just as in the case
of money laundering, much of the data used in collaborative filtering is
external to the ecommerce system; product information, brand specifics,
and user demographics are all examples of distributed data that must be
integrated.
Identity resolution
Identity resolution isn’t really a use case in its own right, but rather a high-
level capability of graph data. The basic problem of identity resolution is
that, when you combine data from multiple sources, you need to be able to
determine when an identity in one source refers to the same identity in
another source. We saw an example of identity resolution already in Figure
8; how do we know that Chris Pope in one table refers to the same Chris
Pope in the other? In that example, we used the fact that they had the same
name to infer that they were the same person. Clearly, in a large-scale data
system, this will not be a robust solution. How do we tell when two entities
are indeed the same?
One approach to this is to combine cross-references from multiple sources.
Suppose we had one source that includes the Social Security number (SSN)
of individuals, along with their passport number (when they have one).
Another data source includes the SSN along with employee numbers. If we
combine all of these together, we can determine that Chris Pope (with a
particular passport number) is the same person as Chris Pope (with a
particular employee number). It is not uncommon for this sort of cross-
referencing to go several steps before we can resolve the identity of a
person or corporation.
Identity resolution has clear applications in money laundering (how do we
determine that the money has returned to the same person who started the
money laundering path?), and fraud detection (often, fraud involves
masquerading as a different entity with intent to mislead). A similar issue
happens with drug discovery: if we want to reuse a result from an
experiment that was performed years ago, can we guarantee that the test
was done on the same chemical compound that we are studying today? But
identity resolution is also important for internal data integration. If we want
to support a customer 360 application (that is, gathering all information
about a customer in a single place), we need to resolve the identity of that
customer across multiple systems. Identity resolution requires the ability to
merge data from multiple sources and to trace long paths in that data. Graph
data is well suited to this requirement.
Industry codes
The US census bureau maintains a list called the NAICS (North
American Industry Classification System). This is a list of industries
that a company might be involved in. NAICS codes have a wide variety
of uses: for example, banks use them to know their customers better,
and funding agents use them to find appropriate candidates for projects.
Product classifications
The United Nations maintains the UNSPSC, the UN Standard Products
and Services Code, a set of codes for describing products and services.
This is used by many commercial organizations to match products to
manufacturing and customers.
All of these examples can benefit from graph representations; each of them
is structured in a multilayer hierarchy of general categories (or countries, in
the case of the geography standards), with subdivisions. In the case of
NAICS and UNSPSC, the hierarchies are several layers deep. This is
typical of such industry standards, that there are hundreds or even
thousands of coded entities, and some structure to organize them.
With literally tens of thousands of companies using these externally
controlled reference data structures, there is a need to publish them as
shared resources. In many cases, these have been published in
commonplace but nonstandard formats such as comma-separated files. But
if we want to represent standardized data in a consistent way, we need a
standard language for writing down a graph. What are the requirements for
such a language?
Figure 12. Two graphs with similar information that look very different. We already saw part (A) in
Figure 3. Part (B) has exactly the same relationships between the same entities as (A), but is laid out
differently. What criteria should we use to say that these two are the “same” graph?
Technology Independence
When an enterprise invests in any information resource—documents, data,
software—an important concern is the durability of the underlying
technology. Will my business continue to be supported by the technology I
am using for the foreseeable future? But if I do continue to be supported,
does that chain me to a particular technology vendor, so that I can never
change, allowing that vendor to potentially institute abusive pricing or
terms going forward? Can I move my data to a competitor’s system?
For many data technologies, this sort of flexibility is possible in theory, but
not in practice. Shifting a relational database system from one major vendor
to another is an onerous and risky task that can be time-consuming and
expensive. If, on the other hand, we have a standard way of determining
when the graph in one system is the same as the graph in another, and a
standard way to write a graph down and load it again, then we can transfer
our data, with guaranteed fidelity, from one vendor system to another. In
such a marketplace, vendors compete on features, performance, support,
customer service, etc., rather than on lock-in, increasing innovation for the
entire industry.
Once we have identified the triples, we can sort them in any order, such as
alphabetical order, as shown in Figure 14, without affecting the description
of the graph.
Figure 14. The triples can be listed in any order.
While there are of course more technical details to RDF, this is basically all
you need to know: RDF represents a graph by identifying each node and
link with a URI (i.e., a web-global identifier), and breaks the graph down
into its smallest possible parts, the triple.
Figure 15. Data about authors, their gender, and the topics of their publications (A). The pattern (B)
finds “Female authors who have published in Science.”
Graph data on its own provides powerful capabilities that have made it a
popular choice of application developers, who want to build applications
that go beyond what has been possible with existing data management
paradigms, in particular, relational databases. While graph data systems
have been very successful in this regard, they have not addressed many of
the enterprise-wide data management needs that many businesses are facing
today. For example, with the advent of data regulations like GDPR and the
California Consumer Privacy Act (CCPA), enterprises need to have an
explicit catalog of data; they need to know what data they have and where
they can find it. They also need to be able to align data with resources
outside the enterprise, either to satisfy regulations or to be able to expand
their analyses to include market data beyond their control. Graphs can do a
lot, but accomplishing these goals requires another innovation—the explicit
management of knowledge.
What Is a Vocabulary?
A vocabulary is simply a controlled set of terms for a collection of things.
Vocabularies can have very general uses, like lists of countries or states,
lists of currencies, or lists of units of measurement. Vocabularies can be
focused on a specific domain, like lists of medical conditions or lists of
legal topics. Vocabularies can have very small audiences, like lists of
product categories in a single company. Vocabularies can also be quite
small—even just a handful of terms—or very large, including thousands of
categories. Regardless of the scope of coverage or the size and interests of
the audience, all vocabularies are made up of a fixed number of items, each
with some identifying code and common name.
Some examples of vocabularies:
A list of the states in the United States
There are 50 of them, and each one has some identifying information,
including the name of the state and the two-letter postal code. This
vocabulary can be used to organize mailing addresses or to identify the
location of company offices. This list is of general interest to a wide
variety of businesses; anyone who needs to manage locations of entities
in the United States could make use of this vocabulary.
The preceding vocabularies are of general interest, and could be (and often
are) shared by multiple businesses. Vocabularies can also be useful within a
single enterprise, such as:
A list of customer loyalty levels
An airline provides rewards to returning customers, and organizes this
into a loyalty program. There are a handful of loyalty levels that a
customer can reach; some of them are based on travel in the current year
(and might have names like “Silver,” “Gold,” “Platinum,” and
“Diamond”); others for lifetime loyalty (“Million miler club”). The
airline maintains a small list of these levels, with a unique identifier for
each and a name.
A list of genders
Even apparently trivial lists can and often are managed as controlled
vocabularies. A bank might want to allow a personal client to specify
gender as Male or Female, but also as an unspecified value that is not
one of those (“Other”), or allow them not to respond at all
(“Unspecified”). A medical clinic, on the other hand, might want to
include various medically significant variations of gender in their list.
Lists of this sort can reflect knowledge about the world on which there is an
agreement (the list of US states is not very controversial), but often reflect a
statement of policy. This is particularly common in vocabularies that are
specific to a single enterprise. The list of customer loyalty levels is specific
to that airline, and represents its policy for rewarding returning customers.
The list of regulatory requirements reflects this bank’s interpretation of the
reporting requirements imposed by a regulation; this might need to change
if a new court case establishes a precedent that is at odds with this
classification. Even lists of gender can be a reflection of policy; a bank may
not recognize nonbinary genders, or even allow a customer to decline to
provide an answer, but it must deal with any consequences of such a policy.
Controlled vocabularies, such as those examples mentioned previously,
provide several advantages:
Disambiguation
If multiple data sets refer to the same thing (a state within the United
States, or a topic in law), a controlled vocabulary provides an
unambiguous reference point. If each of them uses the same vocabulary,
and hence the same keys, then there is no confusion that “ME” refers to
the state of Maine.
Standardizing references
When someone designs a data set, they have a number of choices when
referring to some standard entity; they could spell out the name “Maine”
or use a standard code like “ME.” In this example, having a standard
vocabulary can provide a policy for using the code “ME.”
Expressing policy
Does an enterprise want to recognize a controversial nation as an
official country? Does it want to recognize more genders than just two?
A managed vocabulary expresses this kind of policy by including all the
distinctions that are of interest to the enterprise.
What Is an Ontology?
We have already seen, in our discussion of controlled vocabularies, a
distinction between an enterprise-wide vocabulary and the representations
of that vocabulary in various data systems. The same tension happens on an
even larger scale with the structure of the various data systems in an
enterprise; each data system embodies its own reflection of the important
entities for the business, with important commonalities repeated from one
system to the next.
We’ll demonstrate an ontology with a simple example. Suppose you wanted
to describe a business. We’ll use a simple example of an online bookstore.
How would you describe the business? One way you might approach this
would be to say that the bookstore has customers and products.
Furthermore, there are several types of products—books, of course, but also
periodicals, videos, and music. There are also accounts, and the customers
have accounts of various sorts. Some are simple purchase accounts, where
the customer places an order, and the bookstore fulfills it. Others might be
subscription accounts, where the customer has the right to download or
stream some products.
All these same things—customers, products, accounts, orders,
subscriptions, etc.—exist in the business. They are related to one another in
various ways; an order is for a product, for a particular customer. There are
different types of products, and the ways in which they are fulfilled are
different for each type. An ontology is a structure that describes all of these
types of things and how they are related to one another.
More generally, an ontology is a representation of all the different kinds of
things that exist in the business, along with their interrelationships. An
ontology for one business can be very different from an ontology for
another. For example, an ontology for a retailer might describe customers,
accounts and products. For a clinic, there are patients, treatments and visits.
The simple bookstore ontology is intended for illustrative purposes only; it
clearly has serious gaps, even for a simplistic understanding of the business
of an online bookstore. For example, it has no provision for order payment
or fulfillment. It includes no consideration of multiple orders or quantities
of a particular product in an order. It includes no provision for an order
from one customer being sent to the address of another (e.g., as a gift). But,
simple as it is, it shows some of the capabilities of an ontology, in
particular:
An ontology can describe how commonalities and differences
among entities can be managed. All products are treated in the
same way with respect to placing an order, but they will be treated
differently when it comes to fulfillment.
An ontology can describe all the intermediate steps in a
transaction. If we just look at an interaction between a customer
and the online ordering website, we might think that a customer
requests a product. But in fact, the website will not place a request
for a product without an account for that person, and, since we will
want to use that account again for future requests, each order is
separate and associated with the account.
All of these distinctions are represented explicitly in an ontology. For
example, see the ontology in Figure 16. The first thing you’ll notice about
Figure 16 is probably that it looks a lot like graph data; this is not an
accident. As we saw earlier, graphs are a sort of universal data
representation mechanism; they can be used for a wide variety of purposes.
One purpose they are suited for is to represent ontologies, so our ontologies
will be represented as graphs, just like our data. In the following section,
when we combine ontologies with graph data to form a knowledge graph,
this uniformity in representation (i.e., both data and ontologies are
represented as graphs) simplifies the combination process greatly. The
nodes in this diagram indicate the types of things in the business (Product,
Account, Customer); the linkages represent the connections between things
of these types (an Account belongs to a Customer; an Order requests a
Product). Dotted lines in this and subsequent figures indicate more specific
types of things: Books, Periodicals, Videos, and Audio are more specific
types of Product; Purchase Account and Subscription are more specific
types of Account.
Figure 16. A simple ontology that reflects the enterprise data structure of an online bookstore.
Put yourself in the shoes of a data manager, and have a look at Figure 16;
you might be tempted to say that this is some sort of summarized
representation of the schema of a database. This is a natural observation to
make, since the sort of information in Figure 16 is very similar to what you
might find in a database schema. One key feature of an ontology like the
one in Figure 16 is that every item (i.e., every node, every link) can be
referenced from outside the ontology. In contrast to a schema for a
relational database, which only describes the structure of that database and
is implicit in that database implementation, the ontology in Figure 16 is
represented explicitly for use by any application, in the enterprise or
outside. A knowledge graph relies on the explicit representation of
ontologies, which can be used and reused, both within the enterprise and
beyond.
Just as was the case for vocabularies, we can have ontologies that have a
very general purpose and could be used by many applications, as well as
very specific ontologies for a particular enterprise or industry. Some
examples of general ontologies include:
The Financial Industry Business Ontology (FIBO)
The Enterprise Data Management Council has produced an ontology
that describes the business of the finance industry. FIBO includes
descriptions of financial instruments, the parties that participate in those
instruments, the markets that trade them, their currencies and other
monetary instruments, and so on. FIBO is offered as a public resource,
for describing and harmonizing data across the financial industry.
Data catalog
Many modern enterprises have grown in recent years through mergers and
acquisitions of multiple companies. After the financial crisis of 2008, many
banks merged and consolidated into large conglomerates. Similar mergers
happened in other industries, including life sciences, manufacturing, and
media. Since each component company brought its own enterprise data
landscape to the merge, this resulted in complex combined data systems. An
immediate result of this was that many of the new companies did not know
what data they had, where it was, or how it was represented. In short, these
companies didn’t know what they knew.
The first step toward a solution to this problem is creating a data catalog.
Just as with a product catalog, a data catalog is simply an annotated list of
data sources. But the annotation of a data source includes information about
the entities it represents and the connections between them—that is, exactly
the information that an ontology holds.
Some recent uses for a data catalog have put an emphasis on privacy
regulations such as GDPR and CCPA. Among the guarantees that these
regulations provide is the “right to be forgotten,” meaning an individual
may request that personal information about them be removed entirely from
the enterprise data record. In order to comply with such a request, an
enterprise has to know where such information is stored and how it is
represented. A data catalog provides a road map for satisfying such
requests.
Data harmonization
When a large organization has many data sets (it is not uncommon for a
large bank to have many thousands of databases, with millions of columns),
how can we compare data from one to another? There are two issues that
can confuse this situation. The first is terminology. If you ask people from
different parts of the organization what a “customer” is, you’ll get many
answers. For some, a customer is someone who pays money; for others, a
customer is someone they deliver goods or services to. For some, a
customer might be internal to the organization; for others, a customer must
be external. How do we know which meaning of a word like “customer” a
particular database is referring to?
The second issue has to do with the relationships between entities. For
example, suppose you have an order for a product that is being sent to
someone other than your account holder (say, as a gift). What do you call
the person who is paying for the order? What do you call the person who is
receiving the order? Different systems are likely to refer to these
relationships by different names. How can we know what they are referring
to?
An explicit representation of business knowledge disambiguates
inconsistencies of this sort. This kind of alignment is called data
harmonization; we don’t change how these data sources refer to these terms
and relationships, but we do match them to a reference knowledge
representation.
Data validation
As an enterprise continues to do business, it of course gathers new data.
This might be in the form of new customers, new sales to old customers,
new products to sell, new titles (for a bookstore or media company), new
services, new research, ongoing monitoring, etc. A thriving business will
generate new data all the time.
But this business will also have data from its past business: information
about old products (some of which may have been discontinued), order
history from long-term customers, and so on. All of this data is important
for product and market development, customer service, and other aspects of
a continuing business.
It would be nice if all the data we have ever collected had been organized
the same way, and collected with close attention to quality. But the reality
for many organizations is that there is a jumble of data, a lot of which is of
questionable quality. An explicit knowledge model can express structural
information about our data, providing a framework for data validation. For
example, if our terminology knowledge says that “gender” has to be one of
M, F, or “not provided,” we can check data sets that claim to specify gender.
Values like “0,” “1,” or “ ” are suspect; we can examine those sources to see
how to map these values to the controlled values.
Having an explicit representation of knowledge also allows us to make sure
that different validation efforts are consistent; we want to use the same
criteria for vetting new data that is coming in (say, from an order form on a
website) as we do for vetting data from earlier databases. A shared
knowledge representation provides a single point of reference for validation
information.
altLabel
An alternative lexical label for a resource
notation
A string of characters such as “T58.5” or “303.4833” used to uniquely
identify a concept; also known as a classification code
In this example, we have used the prefLabel to provide the official name of
the state; we use altLabel to specify some common names that don’t match
the official name (in the case of Rhode Island, the long name including
“Providence Plantations” was the official name until November 2020; we
list it as an “alternative” name since there are certainly still some data sets
that refer to it that way).
A notation in SKOS is a string used to identify the entity; here, we have
used this for the official US postal code for the state. The postal service
guarantees that these are indeed unique, so these qualify as an appropriate
use of notation.
We have also added into our graph a link using a property we have called
has type.1 This allows us to group all 50 of these entities together and tell
our knowledge customers what type of thing these are; in this case, they are
US states.
Any data set that wants to refer to a state can refer to the knowledge
representation itself (by using the URI in the graph in Figure 17, e.g.,
State21), or it can use the notation (which is guaranteed to be unambiguous,
as long as we know that we are talking about a state), or the official name,
or any of the alternate names (which are not guaranteed to be unique). All
of that information, including the caveats about which ones are guaranteed
to be unique and in what context, is represented in the vocabulary, which in
turn is represented in SKOS.
There are of course more details to how SKOS works, but this example
demonstrates the basic principles. Terms are indexed as URIs in a graph,
and each of them is annotated with a handful of standard annotations. These
annotations are used to describe how data sets in the enterprise refer to the
controlled terms.
Standardizing Concepts: RDFS
RDFS is the schema language for RDF; that is, it is the standard way to talk
about the structure of data in RDF. Just like SKOS, RDFS uses the
infrastructure of RDF. This means that RDFS itself is represented as a
graph, and everything in RDFS has a URI and can be referenced from
outside.
RDFS describes the types of entities in a conceptual model and how they
are related; in other words, it describes an ontology like the one in Figure
16. Since RDFS is expressed in RDF, each type is a URI. Using Figure 16
as an example, this means that we have a URI for Customer, Order,
Account, Product, etc., but also that we have a URI for the relationships
between these things—requests, on behalf of, and belongs to are all URIs.
In RDFS, the types of things are called Classes. There are 10 classes in
Figure 16; they are shown in blue ovals. We see some business-specific
links between classes (like the link labeled requests between Order and
Product). We also see some unlabeled links, shown with dashed lines.
These are special links that have a name in the RDFS standard (as opposed
to names from the business, like requests, on behalf of, and belongs to); that
name is subclass of. In Figure 16, subclass of means that everything that is
a Purchase Account is also an Account, everything that is a Subscription is
also an Account, everything that is a Book is also a Product, and so on.
How does an ontology like this answer business questions? Let’s consider a
simple example. Chris Pope is a customer of our bookstore, and registers
for a subscription to the periodical Modeler’s Illustrated. Pope is a
Customer (that is, his type, listed in the ontology, is Customer). The thing
he purchased has a type given in the ontology as well; it is a Subscription.
Now for a (very simple) business question: what is the relationship between
Chris’s subscription and Chris? There’s no direct link between Subscription
and Customer in the diagram. But since Subscription is a more specific
Account, Chris’s subscription is also an Account. How is an Account related
to a Customer? This relationship is given in the ontology; it is called
belongs to. So the answer to our question is that Chris’s subscription
belongs to Chris.
Now let’s see how a structure like this can help with data harmonization.
Suppose we have an application that just deals with streaming video and
audio. It needs to know the difference between video and audio, but doesn’t
need to know about books or periodicals at all (since we never stream
those). Another application deals with books and how to deliver them; this
application knows about book readers and even about paper copies of
books. There’s not a lot of need or opportunity for harmonization so far.
But let’s look a bit closer. Both of these applications have to treat these
things as Products; that is, they need to understand that it is possible to have
an Order request one, and that eventually, the delivery is on behalf of an
account, which belongs to a customer. These things are common to books,
periodicals, videos, and audio. The RDFS structure expresses this
commonality by saying that Books, Periodicals, Videos, and Audio are all
subclasses of Product.
Like SKOS, RDFS is a simple representation and can be used to represent
simple ontologies. Fortunately, a little bit of ontology goes a long way; we
don’t need to represent complex ontologies in order to satisfy the data
management goals of the enterprise.
Between the two of them, SKOS and RDFS cover the bases for explicit
knowledge representation in the enterprise. SKOS provides a mechanism
for representing vocabularies, while RDFS provides a mechanism for
representing conceptual models. Both of them are built on RDF, so every
concept or term is identified by a URI and can be referenced from outside.
This structure allows us to combine graph data with explicit knowledge
representation. In other words, this structure allows us to create a
knowledge graph.
Figure 18. Ontology of an online bookstore, showing relations from Book to number of pages and
from Video to runtime.
Figure 20. The same data from Figure 19, now represented as a graph.
The labels on the arrows in Figure 20 are the same as the column headers in
the table in Figure 19. Notice that the N/A values are not shown at all,
since, unlike a tabular representation, a graph does not insist that every
property have a value.
What is the connection between the data and the ontology? We can link the
data graph in Figure 20 to the ontology graph in Figure 16 simply by
connecting nodes in one graph to another, as shown in Figure 21. There is a
single triple that connects the two graphs, labeled has type. This triple
simply states that SKU1234 is an instance of the class Product. In many
cases, combining knowledge and data can be as simple as this; the rows in a
table correspond directly to instances of a class, whereas the class itself
corresponds to the table. This connection can be seen graphically in Figure
21: the data is in the bottom of the diagram (SKU1234 and its links), the
ontology in the top of the diagram (a copy of Figure 16), and a single link
between them, shown in bold in the diagram and labeled with “has type.”
But even in this simple example, we have some room for refinement. Our
product table includes information about the format of the product. A quick
glance at the possible values for format suggests that these actually
correspond to different types of Product, represented in the ontology as
classes in their own right, and related to the class Product as subclasses. So,
instead of just saying that SKU1234 is a Product, we will say, more
specifically, that it is a Book. The result is shown in Figure 22.
There are a few lessons we can take from this admittedly simplistic
example. First, a row from a table can correspond to an instance of more
than one class; this just means that more than one class in the ontology can
describe it. But more importantly, when we are more specific about the
class that the record is an instance of, we can be more specific about the
data that is represented. In this example, the ontology includes the
knowledge that Books have pages (and hence, numbers of pages), whereas
Videos have runtimes. Armed with this information, we could determine
that SKU2468 (The Modeled Falcon) has an error; it claims to be a video,
but it also has specified a number of pages. Videos don’t have pages, they
have runtimes. The ontology, when combined with the data, can detect data
quality issues.
Figure 22. Data and knowledge in one graph. In this case, we have interpreted the format field as
indicating more specifically what type of product the SKU is an instance of. We include SKU2468 as
an instance of video as well as SKU1234 an instance of Book.
In this example, we have started with a table, since it is one of the simplest
and most familiar ways to represent data. This same construction works
equally well for other data formats, such as XML and JSON documents and
relational databases (generalizing from the simple, spreadsheet-like table
shown in this example to include foreign keys and database schemas). If we
have another data source that lists books, or videos, or any of the things we
already see here, we can combine them into an ever-larger graph, based on
the same construction that gave us the graph in Figure 22.
Self-describing data
When we can map our metadata knowledge directly to the data, we can
describe the data in business-friendly terms, but also in a machine-readable
way. The data becomes self-describing as its meaning travels with the data,
in the sense that we can query the knowledge and the data all in one place.
Customer 360
Knowledge graphs play a role in anything 360, really: product 360,
competition 360, supply chain 360, etc. It is quite common in a large
enterprise to have a wide variety of data sources that describe customers.
This might be the result of mergers and acquisitions, or simply because
various data systems date to a time when the business was simpler and do
not cover the full range of customers that the business deals with today. The
same can be said about products, supply chains, and pretty much anything
the business needs to know about.
We have already seen how an explicit representation of knowledge can
provide a catalog of data sources in an enterprise. A data catalog tells us
where we go to find all the information about a customer: you go here to
find demographic information, somewhere else to find purchase history, and
still somewhere else to find profile information about user preferences. The
ontology then provides a sort of road map through the landscape of data
schemas in the enterprise. The ontology allows the organization to know
what it knows, that is, to have an explicit representation of the knowledge in
the business. When we combine that road map with the data itself, as we do
in a knowledge graph, we can extend those capabilities to provide insights
not just about the structure of the data but about the data itself.
A knowledge graph links customer data to provide a complete picture of
their relationship with the business—all accounts, their types, purchase
histories, interaction records, preferences, and anything else. This facility is
often called “customer 360” because it allows the business to view the
customer from every angle.
This is possible with a knowledge graph, because the explicit knowledge
harmonized the metadata, clearing a path for a graph query (in a language
like SPARQL) to recover all the data about a particular individual, with an
eye to serving them better.
Right to privacy
A knowledge graph builds on top of the capabilities of a data catalog. As
we discussed earlier, a request to be forgotten as specified by GDPR or
CCPA requires some sort of data catalog, to find where appropriate data
might be kept. Having a catalog that indicates where sensitive data might be
located in your enterprise data is the first step in satisfying such a request,
but to complete the request, we need to examine the data itself. Just because
a database has customer PII does not mean a particular customer’s PII is in
that database; we need to look at the actual instances themselves. This is
where the capabilities of a knowledge graph extend the facility of a simple
data catalog. In addition to an enterprise data catalog, the knowledge graph
includes the data from the original sources as well, allowing it to find which
PII is actually stored in each database.
Sustainable extensibility
The use cases we have explored all have an Achilles’ heel: how do you
build the ontology, and how do you link it to the databases in the
organization? The value of having a knowledge graph that links all the data
in the enterprise should be apparent by now. But how do you get there? A
straightforward strategy whereby you build an ontology, and then
painstakingly map it to all the data sets in the enterprise, is attractive in its
simplicity but isn’t very practical, since the value of having the knowledge
graph doesn’t begin to show up until a significant amount of data has been
mapped. This sort of delayed value makes it difficult to formulate a
business case.
A much more attractive strategy goes by the name sustainable extensibility.
It is an iterative approach, where you begin with a simple ontology and map
a few datasets to it. Apply this small knowledge graph to one of the many
use cases we’ve outlined here, or any others that will bring quick,
demonstrable business value and provide context. Then extend the
knowledge graph, along any of various dimensions; refine the ontology to
make distinctions that are useful for organizing the business (as we saw in
Figures 21 and 22); map the ontology to new data sets; or extend the
mapping you already have to old data sets, possibly by enhancing the
ontology. Each of these enhancements should follow some business need.
Maybe a new line of business wants to make use of the data in the
knowledge graph, or perhaps someone with important corporate data wants
to make it available to other parts of the company. At each stage, the
enhancement to the ontology or to the mappings should provide incremental
increase in value to the enterprise. This dynamic is extensible because it
extends the knowledge graph, either through new knowledge or new data. It
is sustainable because the extensions are incremental and can continue
indefinitely.
Data mesh
This is an architecture for managing data as a distributed network of
self-describing products in the enterprise, where data provisioning is
taken as seriously as any other product in the enterprise.
Data-centric revolution
Some data management analysts2 see the issues with enterprise data
management we have described here, and conclude that there must be a
significant change in how we think about enterprise data. The change is
so significant that they refer to it as a revolution in data management.
The fundamental observation of the data-centric revolution is that
business applications come and go, but the data of a business retains its
value indefinitely. Emphasizing the role of durable data in an enterprise
is a fundamental change in how we view enterprise data architecture.
FAIR data
Going beyond just the enterprise, the FAIR data movement (findable,
accessible, interoperable, reusable data) outlines practices for data
sharing on a global scale that encourages an interoperable data
landscape.
Let’s take a look at how this new paradigm deals with our simple example
of NAICS codes. A large part of the value of a standardized coding system
like NAICS is the fact that it is a standard; making ad hoc changes to it
damages that value. But clearly, many users of the NAICS codes have
found it useful to extend and modify the codes in various ways. The NAICS
codes have to be flexible in the face of these needs; they have to
simultaneously satisfy the conflicting needs of standardization and
extensibility. Our data landscape needs to be able to satisfy this sort of
apparently contradictory requirements in a consistent way.
The NAICS codes have many applications in an enterprise, which means
that the reusable NAICS data set will play a different role in combination
with other data sets in various settings. A flexible data landscape will need
to express the relationship between NAICS codes and other data sets; is the
code describing a company and its business, or a market, or is it linked to a
product category?
The problems with management of the NAICS codes become evident when
we compare the typical way they are managed with a data-centric view. The
reason why we have so many different representations of NAICS codes is
that each application has a particular use for them, and hence maintains
them in a form that is suitable for that use. An XML-based application will
keep them as a document, a database will embed them in a table, and a
publisher will have them as a spreadsheet for review by the business. Each
application maintains them separately, without any connection between
them. There is no indication about whether these are the same version,
whether one extends the codes, and in what way. In short, the enterprise
does not know what it knows about NAICS codes, and doesn’t know how
they are managed.
If we view NAICS codes as a data product, we expect them to be
maintained and provisioned like any product in the enterprise. They will
have a product description (metadata), which will include information about
versions. The codes will be published in multiple forms (for various uses);
each of these forms will have a service level agreement, appropriate to the
users in the enterprise.
There are many advantages to managing data as a product in this way.
Probably the most obvious is that the enterprise knows what it knows: there
are NAICS codes, and we know what version(s) we have and how they
have been extended. We know that all the versions, regardless of format or
publication, are referring to the same codes. Furthermore, these resources
are related to the external standard, so we get the advantages of adhering to
an industry standard. They are available in multiple formats, and can be
reused by many parts of the enterprise. Additionally, changes and updates to
the codes are done just once, rather than piecemeal across many resources.
The vision of a data fabric, data mesh, or data-centric enterprise is that
every data resource will be treated this way. Our example of NAICS was
intentionally very simple, but the same principles apply to other sorts of
data, both internal and external. A data fabric is made up of an extensible
collection of data products of this sort, with explicit metadata describing
what they are and how they can be used. In the remainder of this report, we
will focus on the data fabric as the vehicle for this distributed data
infrastructure; most of our comments would apply equally well to any of
the other approaches.
References
Darrow, Barb. “So Long Google Search Appliance”. Fortune, February 4,
2016.
Dehghani, Zhamak. “How to Move Beyond a Monolithic Data Lake to a
Distributed Data Mesh”. martinfowler.com, May 20, 2019.
Dehghani, Zhamak. “Data Mesh Principles and Logical Architecture”.
martinfowler.com, December 3, 2020.
Neo4j News. “Gartner Identifies Graph as a Top 10 Data and Analytics
Technology Trend for 2019”. February 18, 2019.
Weinberger, David. “How the Father of the World Wide Web Plans to
Reclaim It from Facebook and Google”. Digital Trends, August 16, 2016.
Wikipedia. “Database design”. Last edited December 20, 2020.
Wikipedia. “Network effect”. Last edited March 3, 2021.
1 This link is often labeled simply as type, but that can be confusing in examples like this, so
we clarify that this is a relationship by calling it has type.
2 Such as Dave McComb in The Data-Centric Revolution (Technics Publications, 2019).
About the Authors
Sean Martin, CTO, and founder of Cambridge Semantics has been on the
leading edge of technology innovation for over two decades and is
recognized as an early pioneer of next-generation enterprise software,
semantic, and graph technologies. Sean’s focus on Enterprise Knowledge
Graphs offers fresh approaches to solving data integration, application
development, and communication problems previously found to be
extremely difficult to address.
Before founding Cambridge Semantics, Sean spent fifteen years with IBM
Corporation where he was a founder and the technology visionary for the
IBM Advanced Internet Technology Skunkworks group, where he had an
astonishing number of internet “firsts” to his credit.
Sean has written numerous patents and authored a number of peer-reviewed
Life Sciences journal articles. He is a native of South Africa, has lived for
extended periods in London, England, and Edinburgh, Scotland, but now
makes his home in Boston, MA.
Ben Szekely is chief revenue officer and cofounder of Cambridge
Semantics, Inc. Ben has impacted all sides of the business from developing
the core of the Anzo Platform to leading all customer-facing functions
including sales and customer success. Ben’s passion is working with
partners and customers to identify, design, and execute high-value solutions
based on knowledge graph.
Before joining the founding team at Cambridge Semantics, Ben worked as
an advisory software engineer at IBM with Sean Martin on early research
projects in Semantic Technology. He has BA in math and computer science
from Cornell University and an SM in computer science from Harvard
University.
Dean Allemang is founder of Working Ontologist LLC, and coauthor of
Semantic Web for the Working Ontologist, and has been a key promoter of
and contributor to semantic technologies since their inception two decades
ago. He has been involved in successful deployments of these technologies
in many industries, including finance, media, and agriculture. Dean’s
approach to technology combines a strong background in formal
mathematics with a desire to make technology concepts accessible to a
broader audience, thereby making them more applicable to business
applications.
Before setting up Working Ontologist, a consulting firm specializing in
industrial deployments of semantic technologies, he worked with
TopQuadrant Inc., one of the first product companies to focus exclusively
on Semantic products. He has a PhD in computer science from The Ohio
State University, and an MSc in Mathematics from the University of
Cambridge.