Unlocking AI Advancement:
The Superiority of Lakehouse
Architectures for Hosting
Enterprise Knowledge Graphs
Weimo Liu | CEO at PuppyGraph
Weimo Liu
PuppyGraph / CEO
Dr. Weimo Liu is the CEO and Co-founder of PuppyGraph.
He was a former software engineer within Google's F1 team
and a research scientist at TigerGraph. In these capacities,
he specialized in advancing query languages and engines.
Dr. Liu earned his PhD degree from GWU, and his BS degree
from Fudan University. Notably, he actively participates as a
program committee member and reviewer for esteemed
conferences/journals like TKDE, KDD, and SIGSPATIAL.
Confidential | 2
Agenda
1. The Challenges of Adopting LLMs in Enterprises
2. Why Graph RAG Can Help
3. Example with GPT Model & IMDB Data
4. How To Build Knowledge Graph For LLMs (using IMDB Data &
OpenAI’s GPT Model)
5. Advantages of Hosting Knowledge Graphs On Lakehouse
Confidential | 3
The Challenges of ● OpenAI’s GPT models are not easy to query
Adopting LLMs in your private data:
Enterprises ○ GPT models are not trained on the private data
○ Enterprise also don’t want to share any private data
with OpenAI
● ChatGPT can lead to hallucination when
answering data oriented questions:
○ Provide wrong answers
○ Give a long block of text but doesn’t really answer
the questions
Confidential | 4
ChatGPT Hallucination
Examples with The
IMDB Data
Confidential | 5
ChatGPT Hallucination Examples with IMDB Data
Without Graph RAG With Graph RAG
ChatGPT Hallucination Examples with IMDB Data
Without Graph RAG With Graph RAG
Why Graph RAG?
Confidential | 8
What is Graph RAG?
Graph RAG = RAG x Knowledge Graph
● Graph RAG builds on the concept of RAG by
leveraging on knowledge graphs (KGs).
● Graph RAG allows integration of the
structured data from KGs into the LLM’s
processing, providing a more nuanced and
informed basis for the model’s responses.
A simple knowledge graph example
Confidential | 9
GTC March 2024 Keynote with NVIDIA CEO Jensen Huang
DeepLearning.ai Full Course with Andrew Ng
How to Build Your Own
AI Chatbot With Private
Data & LLM
Confidential | 12
● LLM Model: OpenAI’s GPT 3.5
Tech Stack ● Dataset: IMDB
● Data Storage: Apache Iceberg
● Knowledge Graph: PuppyGraph
Confidential | 13
string
IMDB Dataset Table Schema
title_basics type
title_principals type
tconst string name_basics type
id string
titleType string nconst string
tconst string
primaryTitle string primaryName string
ordering Int64
originalTitle string birthYear Int64
nconst string
isAdult boolean deathYear Int64
category string
startYear Int64 primaryProfession string
job string
endYear Int64 knownForTitles string
characters string
runtimeMinutes Int64
genres string
Confidential | 14
IMDB Dataset Graph Schema
The IMDB knowledge graph has two types of
vertices:
1. `person`. This type of vertex can be directors,
producers or actors/actresses. It has 3 major
attributes: `primaryName`, `birthYear` and
`deathYear`.
2. `title`. This type of vertex can be movies, TV
episodes or TV movies, etc. It has 3 major
attributes: `titleType`, `primaryTitle` and
`startYear`.
The IMDB knowledge graph has one type of edge:
`cast_and_crew`. Note this edge is directed,
pointing from `title` vertices to `person` vertices,
Confidential | 15
Knowledge Graph
The Queries Running On Backend
Graph Queries on Knowledge Graph
1. g.V().has(‘person’, ‘primaryName’, ‘Tom Hanks’).in().has(‘titleType’,
‘movie’).order().by(‘startYear’).range(8, 9).values(‘primaryTitle’)
2. g.V().has(‘person’, ‘primaryName’, ‘Jackie
Chan’).in().out().groupCount().by(‘primaryName’).order(local).by(values, desc).next()
Confidential | 17
Why Host Knowledge Graphs
On Lakehouse vs. On A
Graph Database
Confidential | 18
Knowledge Graphs On Lakehouse
Integrated with the
Large Scale
Existing Data Pipeline
No Data Duplication Total Data Control
Confidential | 19
Large Scale
1. The real world knowledge graph is huge, such as the
Google knowledge graph and Wikipedia knowledge
graph.
2. The enterprise knowledge graph is large as well, and
can be easily up to billions of edges.
3. Small knowledge graph is limited in providing enough
information, which makes the ROI value relatively small.
4. It’s hard for graph databases to handle large amount of
data. The complexity and cost can be very high.
Confidential | 20
No Data Duplication
1. It’s not uncommon that your data in lakehouse is also a
knowledge graph in nature.
a. Think about consolidated customer info for an
ecommerce site (customer, product, credit card,
purchase history, etc.) - they all connected!
2. When the data size is big, ETL is hard, complex, and
expensive.
3. It can be very costly to maintain another CDC.
Confidential | 21
Integrated With The Existing Data Pipeline
1. The data is not only used by LLM, and the existing data
pipeline still works well, and there is not additional
system complexity.
2. You can also run graph data analytics such as anti-fraud,
cybersecurity, e-commerce.
3. The insight of connection between your data can also be
visualized.
Confidential | 22
Total Data Control
1. Eliminate manual work caused by graph database setup
and permissions hassles. It simplifies data management
by reusing your existing data lakehouse permissions,
saving your engineering resources for more impactful
tasks.
2. Your data stays exclusively within your controlled
environment, ensuring unmatched security and control.
Rest easy, knowing that your data remains yours
alone—guaranteed. Your data, your control, maximum
security. LLM vendor such as OpenAI cannot access your
data.
Confidential | 23
Summary
It is not necessary to duplicate the data to The data is not only used by LLM, and the
build knowledge graphs. Your data in existing data pipeline still works well. And
lakehouse is a knowledge graph. you can also run graph data analytics such
as anti-fraud, cybersecurity, e-commerce.
The larger knowledge graph has bigger You can query your private data. The data
value. But the graph database is hard to governance has been set up very well in
handle a large knowledge graph. Lakehouse, you don’t need to do it again.
Confidential | 24
Thanks!
[email protected]
Confidential | 25