How To Create A Knowledge Map From 500k Email Messages Using LLM - by Ben Goosman - Kineviz - Nov, 2024 - Medium
How To Create A Knowledge Map From 500k Email Messages Using LLM - by Ben Goosman - Kineviz - Nov, 2024 - Medium
Open in app
45
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
In this post, I’ll show how we generated an explainable, interactive knowledge map
from 500,000 publicly available email messages in under 12 hours.
define the entities and relationships of interest (that is, an ontology or schema),
apply the schema to documents in bulk to build a knowledge map in the form of a
connected graph, and
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 1/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
this serves to cluster in space the nodes which are related. In that sense, 3D graph
representation is akin to how we think and move through the physical world–an
embodied way of working with data.
SightXR provides a step-by-step interface to guide the user through the necessary
tasks.
First, we created a knowledge map from the 500k Enron emails overnight using
SightXR. Enron data includes the email messages, and email addresses of the sender
and receivers.
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 2/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
A schema for the entities and relationships extracted for our initial Enron knowledge
map is shown below. In general, entities can be things like Persons, Organizations,
Locations, and Events, and the relationships that connect entities can describe
information like “Alice KNOWS Bob”.
Each row of the source csv has a `filename` and `message`. For each message, we
ask gpt-4o-mini to “Find relationships involving entities of types Person,
Organization, Location, Event in the text provided.” We also use some basic email
parsing code to extract the email addresses involved in the message.
With the resulting knowledge map in place, we added Semantic Search, which
enabled fuzzy search over everything.
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 3/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
Semantic Search
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 4/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
The resulting knowledge map lets us explore the connections in the knowledge map,
and review the topics as well as the actual emails involved.
For the technically adept out there, I’ve listed some of the technologies we used.
In essence, Elixir / Oban enable high worker count and rate limiting by distributing
one LLM request per worker. I chose Elixir because it makes it easier to write highly
concurrent programs with lightweight “processes” (BEAM process). The Oban
framework for Elixir made it easy to queue up hundreds of thousands of jobs to
process each email message, one by one.
Azure OpenAI’s gpt-4o-mini deployments allow 2 million tokens per minute, which
sounds like a lot but can quickly be used up. We used two of those. To avoid rate limit
errors, I configured Oban to allow only 150 in-flight LLM requests at a time. Elixir’s
lightweight processes made it possible to have such a high worker count. On the
other hand, I was running into memory limits with my Python / python-rq /
supervisord implementation, because of more heavyweight processes used by each
Python worker. I couldn’t launch 150 Python workers, because of the memory
consumed by the libs loaded in each worker.
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 5/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
After exploring the initial knowledge map, we also wanted to extract the “chain”, or
sequence of email messages embedded within each individual email file. For
example, a single row of the Enron csv might contain a thread of 4 messages, with
complex reply / forward relationships. Such chains give a better sense of the
dynamics of the communication: who was driving the conversation, who was
included, and at what point in time.
Parsing these deterministically from the Enron csv seemed daunting, so we used
gpt-4o to manage that. The resulting knowledge map makes it possible to
investigate communication over time among key Enron employees.
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 6/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
This extraction process worked reasonably well, although since time zones often
must be inferred, one needs to implement sanity checks for date-time parsing.
Moreover, the last reply of a chain might have the time zone, but the embedded
messages may not.
The language model gets it right most of the time. We found gpt-4o can perform well
for email chain extraction, while gpt-4o-mini fails. LLMs are rapidly evolving, and
each has its own capabilities, so the choice of model will affect results. Again, that’s a
compelling reason to be able to revisit the source information.
Thanks for reading! Visit kineviz.com or email me at ben at kineviz.com for more
information. You can also reach me at our Discord here.
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 7/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
Follow
I’m a senior software engineer at Kineviz, where I help create data workflows and visualizations. I’m really
deep into dance and music production too!
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 8/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 9/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
Re-serving the Plaintiff — How SightXR helps you find banal details in the
Epstein documents in…
We feed 4,728 pages of Epstein documents into SightXR, which uses AI to automatically
generate thousands of observations relating entities…
Feb 3
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 10/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
Sixing Huang
Oct 30 110 1
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 11/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
2d ago 21
Lists
Work 101
26 stories · 186 saves
MODERN MARKETING
195 stories · 919 saves
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 12/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
AI Rabbit in CodeX
Oct 27 779 12
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 13/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium
Oct 23 4.2K 22
Nayan Paul
Root Cause Analysis Use Case with the new O1 Reasoning Model
This blog has below 3 sections :
6d ago 3 1
Ferry Djaja
Oct 31 143 2
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 15/15