Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
m
pl
im
en
ts
Getting Your
of
Data Ready for AI
Governing Principles for Fast
Self-Service Data Preparation
Kate Shoup
Getting Your Data
Ready for AI
Kate Shoup
This work is part of a collaboration between O’Reilly and IBM. See our statement of
editorial independence.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Getting Your Data
Ready for AI, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
978-1-492-04239-6
[LSI]
Table of Contents
v
Getting Your Data Ready for AI
Abstract
This report briefly discusses an aspect of artificial intelligence (AI)
and data science that is critical but rarely addressed: data prepara‐
tion. The report first provides a brief overview of the disciplines of
AI and data science. It then outlines a typical AI workflow, with a
focus on the phases involved in data wrangling, before defining the
challenges associated with those phases. Finally, it presents various
solutions to these challenges, with special emphasis on IBM Watson
Studio. Along the way, you will gain insights into the various types
of data used for AI, the different flavors of AI (including machine
learning), and possible long-term ramifications of the growing use
of these technologies.
1
machine learning, in which machines become capable of learning
without input from humans—few organizations have implemented
this technology in any significant way. According to the report,
“only about one in five companies has incorporated AI in some
offerings or processes,” and “only one in 20 companies has exten‐
sively incorporated AI in offerings or processes.” Moreover, “less
than 39% of all companies have an AI strategy in place.”
Impediments to Growth
One impediment to the growth of AI is the onerous process associ‐
ated with developing AI models—a job performed by data scientists,
who are precious resources indeed. Particularly burdensome are the
phases of this process that involve accessing, labeling, and trans‐
forming data—also known as data wrangling or ETL, which is short
for extract, transform, and load. Indeed, it is often reported that data
scientists, whose job is to develop and deploy AI models spend as
much as 80% of their time on these phases. Perhaps worse (at least
from the point of view of the data scientist), according to IBM Wat‐
son machine learning product manager Armand Ruiz, 57% of data
scientists complain that data wrangling is the most tedious—and
therefore their least favorite—part of their job.
The resulting bottleneck not only prevents data scientists from
focusing on the parts of their job that they like and that yield real
business value, but also slows the adoption and implementation of
AI and the realization of its many tangible benefits, including accel‐
erated research and discovery, enriched customer interactions,
reduced operational costs, increased efficiency, higher revenue, and
so on.
1 Davenport, Thomas H. and D. J. Patil. “Data Scientist: The Sexiest Job of the 21st Cen‐
tury.” Harvard Business Review 90 (2012): 70–76.
A Typical AI Workflow
When it comes to the workflow used by data scientists for AI
projects, “there is some difficulty in describing ‘typical,’” says Jarmul.
“The data that you’re working with and the algorithms or models
that you are using can substantially affect what you need to do to get
your data prepared.” Complicating matters, there are various flavors
of AI systems, including supervised learning systems, unsupervised
learning systems, and reinforcement learning systems. Regardless,
most AI projects typically involve the following steps:
Data scientists generally perform the first four steps of this process,
whereas DevOps professionals typically handle steps 5 and 6.
Types of data
There are three main types of data:
Structured data
Structured data is data that can be (or already has been) easily
organized into a spreadsheet or relational database. For exam‐
ple, data in a sales record is structured data.
Unstructured data
Unstructured data is data that does not fit easily into a spread‐
sheet or relational database. Audio files, video files, and PDF
files are examples of unstructured data.
Semi-structured data
This type of data is essentially a hybrid of structured and
unstructured data. In other words, it’s unstructured data that
has structured data attached to it in the form of metadata.
Examples of semi-structured data include comma-separated
values (CSV) files and Twitter messages.
These distinctions are important, says Castrounis, “because they’re
directly related to the type of database technologies and storage
required, the software and methods by which the data is queried and
processed, and the complexity of dealing with the data.”
In addition to these various types of data, there are different data
formats—many of which might be incompatible. For example, data
harvested from web pages might not play nicely with data pulled
from mobile devices, which itself might tangle with data culled from
an on-premises database, and so on.
A Typical AI Workflow | 5
Data provenance
Before you can use any data that you deem relevant in your AI
project, you must trace its provenance—where it came from, how it
was collected, who collected it, and under what conditions. Data
provenance, says Katharine Jarmul, “is incredibly important,
because it’s going to inform your data science team how they should
treat the data and what they can use from it.” It might be the case
that due to privacy concerns some data is off limits. “You have to
figure out from a legal standpoint and a utilitarian standpoint what
data you can use for the problem you want to solve,” says Jarmul.
Labeling Data
“Any attempt to manage and organize information,” observe IBM’s
Jay Limburn and Paul Taylor in a 2017 blog post, “depends on two
things: data and metadata.” They explain: “The data is the informa‐
tion itself, while the metadata describes the information’s attributes,
such as what structure it is stored in, where it is stored, how to find
it, who created it, where it came from, and what it can be used for.”
Applying this metadata—in other words, labeling the data—is a crit‐
ical step in the data-prep workflow. You might also need to “look
over the data and mark a particular thing that you’re trying to
study,” explains Jarmul—often called the target variable.
Transforming Data
After you access and label data, the data usually goes through a ser‐
ies of transformations. These transformations might include remov‐
ing noise, standardizing the data, and so on, depending on what
type of model you want to build. Standardizing data essentially
means fixing any disparities in the data. Often, you’ll be working
with data from different types of sources—meaning it might not
match up. It’s up to you to figure out how to accommodate these
data disparities and to bring the data into some sort of cohesive
model.
Solutions
One way to ease the data-wrangling bottleneck is to try to address it
up front. Katharine Jarmul champions this approach. “Suppose you
have an application,” she explains, “and you’ve decided that you
want to use activity on your application to figure out how to build a
useful predictive model later on to predict what the user wants to do
next. If you already know you’re going to collect this data, and you
already know what you might use it for, you could work with your
developers to figure out how you can create transformations as you
ingest the data.” Jarmul calls this prescriptive data science, which
stands in contrast to the much more common approach: reactionary
data science.
Maybe it’s too late in the game for that. In that case, there are any
number of data catalogs to help data scientists access and prepare
data. A data catalog centralizes information about available data in
one location, enabling users to access it in a self-service manner. “A
good data catalog,” writes analytics expert Jen Underwood in a 2017
blog post, “serves as a searchable business glossary of data sources
and common data definitions gathered from automated data discov‐
ery, classification, and cross-data source entity mapping.” According
to a 2017 article by Gartner, “demand for data catalogs is soaring as
organizations struggle to inventory distributed data assets to facili‐
tate data monetization and conform to regulations.” Examples of
data catalogs include the following:
Discover data
Watson Knowledge Catalog facilitates the discovery and inges‐
tion of data by enabling users to search for the data that they
need in a single, centralized portal (see Figure 1-2). A recom‐
mendation engine connects users with relevant data the same
way Netflix “unlocks” new TV shows based on other shows
you’ve watched, explains Limburn. (Ruiz calls this “Spotify for
data.”) When you find data that you want to use, you click the
“Add to Catalog” button to add it to your project dataset—like
using the “Add to Cart” button you see on ecommerce sites like
Amazon.
Classify data
When you add a data asset to a project in Watson Knowledge
Catalog it is automatically indexed and classified, essentially
automating the “labeling data” step in the data science workflow
(see Figure 1-3). In addition, say Limburn and Taylor, “Users
can add tags and comments to explain what information each
dataset contains, and why it is useful.” They can also rate data‐
sets using a star system.
Govern data
Watson Knowledge Catalog is “underpinned by an intelligent
and robust governance framework that ensures its users comply
with corporate data governance policies,” writes IBM’s Susanna
Tai in a 2017 blog post. This framework allows for the secure
sharing of data assets across the organization through the use of
well-defined access control policies.
Next, you want to filter the dataset to show open records only.
Again, you can do this by way of the Operation menu or by coding
it yourself. In this case, let’s go the second route.
After you enter all the operations you want to apply to your dataset,
click Run to execute the operations in order. You can choose to
write the results of this action back to your database or save it in the
format of your choice. As Data Refinery cleans and shapes your
data, you can continue to monitor and analyze its progress in the
Control Panel. (You can also schedule new or existing runs here.)
Alternatively, you can continue working in Watson Studio.
Conclusion
If what the experts say is true—that AI represents the next wave of
digital disruption; that its impact will rival that of earlier general-
purpose technologies like the steam engine, electricity, and the inter‐
nal combustion engine; that, in the words of Google CEO Sundar
Pichai, it will be “more important than humanity’s mastery of fire or
electricity”—it follows that organizations that effectively employ AI
will enjoy a critical advantage over organizations that don’t. And yet,
at present, relatively few organizations do this—in part because of
problems posed by wrangling data.
That’s where self-service data science tools like Watson Studio come
in. These tools help to eliminate the bottleneck associated with data
wrangling. This not only frees data scientists—who are in limited
supply—to focus on the parts of their jobs that bring more value,
but might also hasten the widespread adoption of AI. When this
happens, expect early adopters to enjoy a significant advantage over
firms that lagged behind.