The Six Principles of AI-Ready Data
The Six Principles of AI-Ready Data
qlik.com
Table of Contents
Executive summary 3
Introduction3
Conclusion 14
• This document outlines six principles for ensuring that data is ready for use with
artificial intelligence (AI).
• These principles are as follows: data has to be diverse, timely, accurate, secure,
discoverable, and easily consumable by machines.
• The document also describes the AI Trust Score, which helps assess how well
your data adheres to these principles.
Introduction
Artificial intelligence (AI) is expected to greatly improve industries like
healthcare, manufacturing, and customer service, leading to higher-quality
experiences for customers and employees alike. Indeed, AI technologies
like machine learning (ML) have already helped data practitioners produce
mathematical predictions, generate insights, and improve decision-making.
Furthermore, emerging AI technologies like generative AI (GenAI) can create
strikingly realistic content that has the potential to enhance productivity
in virtually every aspect of business.
1
“Analysts to Discuss Generative AI Trends and Technologies,”
Gartner, October, 2023.
The Six Principles of AI-Ready Data 3
The six principles for AI-ready data
It would be foolish to believe that you could just throw data at various AI initiatives and expect
magic to happen, but that’s what many practitioners do. While this approach might seem to
work for the first few AI projects, data scientists increasingly spend more time correcting
and preparing the data as projects mature.
Additionally, data used for AI has to be high-quality and precisely prepared for these intelligent
applications. This means spending many hours manually cleaning and enhancing the data
to ensure accuracy and completeness, and organizing it in a way that machines can easily
understand. Also, this data often requires extra information — like definitions and labels — to
enrich semantic meaning for automated learning and to help AI perform tasks more effectively.
Therefore, the sooner data can be prepared for downstream AI processes, the greater the
benefit. Using prepped, AI-ready data is like giving a chef pre-washed and chopped vegetables
instead of a whole sack of groceries — it saves effort and time and helps ensure that the final
dish is promptly delivered. The diagram below defines six critical principles for ensuring the
“readiness” of data and its suitability for AI use.
Diverse
ML/LLM
Timely
Consumable
Discoverable Accurate
Secure
Consequently, our first principle focuses on providing a wide variety of data to AI models, which
increases data diversity and reduces bias, helping to ensure that AI applications are less likely
to make unfair decisions.
Diverse data means you don’t build your AI models on narrow and siloed datasets. Instead,
you draw from a wide range of data sources spanning different patterns, perspectives,
variations, and scenarios relevant to the problem domain. This data could be well-structured
and live in the cloud or on-premises. It could also exist on a mainframe, database, SAP system,
or software as a service (SaaS) application. Conversely, the source data could be unstructured
and live as files or documents on a corporate drive.
CSV
It’s essential to acquire diverse data in various forms for integration into your ML and
GenAI applications.
It’s critical that you build and deploy low-latency, real-time data pipelines for your AI initiatives
to ensure timely data. Change data capture (CDC) is often used to deliver timely data from
relational database systems, and stream capture is used for data originating from IoT devices
that require low-latency processing. Once the data is captured, target repositories are updated
and changes continuously applied in near-real time for the freshest possible data.
Data accuracy has three aspects. The first is profiling source data to understand its
characteristics, completeness, distribution, redundancy, and shape. Profiling is also commonly
known as exploratory data analysis, or EDA.
The final aspect is enabling data lineage and impact analysis — with tools for data engineers
and scientists that highlight the impact of potential data changes and trace the origin of data
to prevent accidental modification of the data used by AI models.
High-quality, accurate data ensures that models can identify relevant patterns and
relationships, leading to more precise decisions, generation, and predictions.
Again, three tactics can help you automate data security at scale, since it’s nearly impossible
to do it manually. Data classification detects, categorizes, and labels data that feeds the next
stage. Data protection defines policies like masking, tokenization, and encryption to obfuscate
the data. Finally, data security defines policies that describe access control, i.e., who can
access the data. The three concepts work together as follows: first, privacy tiers should be
defined and data tagged with a security designation of sensitive, confidential, or restricted.
Next, a protection policy should be applied to mask restricted data. Finally, an access control
policy should be used to limit access rights.
These three tactics protect your data and are crucial for improving the overall trust in your
AI system and safeguarding its reputational value.
Unsurprisingly, good metadata practices lie at the center of discoverability. Aside from the
technical metadata associated with AI datasets, business metadata and semantic typing
must also be defined. Semantic typing provides extra meaning for automated systems, while
additional business terms deliver extra context to aid human understanding. A best practice
is to create a business glossary that maps business terms to technical items in the datasets,
ensuring a common understanding of concepts. AI-assisted augmentation can also be used
to automatically generate documentation and add business semantics from the glossary.
Finally, all the metadata is indexed and made searchable via a data catalog.
Describe data in
Detect and understand Index and organize
business terms to provide
data meaning to provide metadata to make data
clarity, consistency,
more context findable and usable
and productivity
This approach ensures that the data is directly discoverable, applicable, practical, and
significant to the AI task at hand.
However, the training data formats depend highly on the underlying ML infrastructure.
Traditional ML systems are disk-based, and much of the data scientist workflow focuses on
establishing best practices and manual coding procedures for handling large volumes of files.
More recently, lakehouse-based ML systems have used a database-like feature store, and the
data scientist workflow has transitioned to SQL as a first-class language. As a result, well-formed,
high-quality, tabular data structures are the most consumable and convenient data format
for ML systems.
Tr Trusted
st
JSON PDF
Secure
an
Inge
Unstructured
G o vern
Created with
Pushdown SQL
Training Prediction
Data Data
The method for integrating your corporate data into an LLM-based application is called retrieval-
augmented generation, or RAG. The technique generally uses text information derived from
unstructured, file-based sources such as presentations, mail archives, text documents, PDFs,
transcripts, etc. The text is then split into manageable chunks and converted into a numerical
representation used by the LLM in a process known as embedding. These embeddings are then
stored in a vector database like Chroma, Pinecone, and Weviate. Interestingly, many traditional
database vendors — such as PostgreSQL, Redis, and SingleStoreDB — also support vectors.
Moreover, cloud platforms like Databricks, Snowflake, and Google BigQuery have recently
added support for vectors, too.
Tr Trusted
st
JSON PDF
Secure
an
Inge
Unstructured
Context Answer
G o vern (Chatbot)
Structured
Snowflake, Databricks,
Postgres, Elasticsearch*,
Redis*, Pinecone*, Neo4*
Large Language
Copy “docs” Split Create Store Model (LLM)
to cloud into chunks embeddings into vector
Whether your source data is structured or unstructured, Qlik’s approach ensures that
quality data is readily consumable for your GenAI, RAG, or LLM-based applications.
The AI Trust Score assigns a separate dimension for each principle and then aggregates each
value to create a composite score, a quick, reliable shortcut to assessing your data’s AI readiness.
Additionally, because enterprise data continually changes, the trust score is regularly checked
and frequently recalibrated to track data readiness trends.
Diverse
75%
LLM-Ready Timely
30% 80%
Discoverable Accurate
85% 75%
Secure
25%
Figure 9. The AI Readiness Trust score
So, whether you’re creating warehouses or lakes for insightful analytics, modernizing operational
data infrastructures for business efficiency, or using multi-cloud data for artificial intelligence
initiatives, Qlik Talend can show you the way.
CSV
Platforms
Improved Customer
Mainframe Service
Figure 10. Qlik Talend Enterprise Data Fabric for AI & Analytics
About Qlik
Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes. Serving
over 40,000 global customers, our portfolio leverages advanced, enterprise-grade AI/ML and pervasive data
quality. We excel in data integration and governance, offering comprehensive solutions that work with diverse data
sources. Intuitive and real-time analytics from Qlik uncover hidden patterns, empowering teams to address complex
challenges and seize new opportunities. Our AI/ML tools, both practical and scalable, lead to better decisions, faster.
As strategic partners, our platform-agnostic technology and expertise make our customers more competitive.
qlik.com
© 2024 QlikTech International AB. All rights reserved. All company and/or product names may be trade names, trademarks and/or registered trademarks of the respective owners with which they are associated.
For the full list of Qlik trademarks please visit: https://www.qlik.com/us/legal/trademarks
24-DQ-0001-01-WHM