0% found this document useful (0 votes)
45 views15 pages

How To Create A Knowledge Map From 500k Email Messages Using LLM - by Ben Goosman - Kineviz - Nov, 2024 - Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views15 pages

How To Create A Knowledge Map From 500k Email Messages Using LLM - by Ben Goosman - Kineviz - Nov, 2024 - Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz

| Kineviz | Nov, 2024 | Medium

Open in app

45
Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

How to create a Knowledge Map from 500k


email messages using LLM
Ben Goosman · Follow
Published in Kineviz
5 min read · 4 days ago

Listen Share More

Email correspondence holds key information about what has happened in a


business, and crucially, who was responsible for key decisions. But for most
enterprises, there’s simply too much email. A human cannot read through it all,
extract relevant information, and discover key connections in any reasonable amount
of time. Also, any solution needs to be interactive and explainable: we want to be able
to zoom out and get a sense of everything and also zoom in to individual emails.

In this post, I’ll show how we generated an explainable, interactive knowledge map
from 500,000 publicly available email messages in under 12 hours.

We used Kineviz SightXR, which lets a user

define the entities and relationships of interest (that is, an ontology or schema),

apply the schema to documents in bulk to build a knowledge map in the form of a
connected graph, and

immediately build the knowledge map for exploration.

A knowledge map makes it possible to connect a large amount of related information


and navigate through it efficiently. It helps us focus on one thing at a time, because
information is encoded into nodes and edges which have a sense of space. Nodes
have a size and position, and edges represent the connections between nodes, and

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 1/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

this serves to cluster in space the nodes which are related. In that sense, 3D graph
representation is akin to how we think and move through the physical world–an
embodied way of working with data.

SightXR knowledge mapping captures specific entities of interest and their


relationships, yet preserves visibility into source documents. That’s important,
because while Large Language Models (LLMs) can be used to extract information
efficiently, we also have to be able to explain and validate the results.

Building the Knowledge Map


The Enron corpus is a useful starting point for bulk analysis of text, since it contains
over 500k email messages generated by 158 corporation employees over a period of
several years. It’s open to public use, and is conveniently in CSV format.

SightXR provides a step-by-step interface to guide the user through the necessary
tasks.

First, we created a knowledge map from the 500k Enron emails overnight using
SightXR. Enron data includes the email messages, and email addresses of the sender
and receivers.

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 2/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

A schema for the entities and relationships extracted for our initial Enron knowledge
map is shown below. In general, entities can be things like Persons, Organizations,
Locations, and Events, and the relationships that connect entities can describe
information like “Alice KNOWS Bob”.

Each row of the source csv has a `filename` and `message`. For each message, we
ask gpt-4o-mini to “Find relationships involving entities of types Person,
Organization, Location, Event in the text provided.” We also use some basic email
parsing code to extract the email addresses involved in the message.

With the resulting knowledge map in place, we added Semantic Search, which
enabled fuzzy search over everything.

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 3/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Semantic Search

Finally, we implemented a Classification workflow, to filter emails by criteria like


“discusses the company’s future strategy”.

Classifying emails if they match “discusses the company’s future strategy”

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 4/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

The resulting knowledge map lets us explore the connections in the knowledge map,
and review the topics as well as the actual emails involved.

For the technically adept out there, I’ve listed some of the technologies we used.

Basic architecture of the email ingestion pipeline

In essence, Elixir / Oban enable high worker count and rate limiting by distributing
one LLM request per worker. I chose Elixir because it makes it easier to write highly
concurrent programs with lightweight “processes” (BEAM process). The Oban
framework for Elixir made it easy to queue up hundreds of thousands of jobs to
process each email message, one by one.

Azure OpenAI’s gpt-4o-mini deployments allow 2 million tokens per minute, which
sounds like a lot but can quickly be used up. We used two of those. To avoid rate limit
errors, I configured Oban to allow only 150 in-flight LLM requests at a time. Elixir’s
lightweight processes made it possible to have such a high worker count. On the
other hand, I was running into memory limits with my Python / python-rq /
supervisord implementation, because of more heavyweight processes used by each
Python worker. I couldn’t launch 150 Python workers, because of the memory
consumed by the libs loaded in each worker.

Parsing Email Chains

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 5/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

After exploring the initial knowledge map, we also wanted to extract the “chain”, or
sequence of email messages embedded within each individual email file. For
example, a single row of the Enron csv might contain a thread of 4 messages, with
complex reply / forward relationships. Such chains give a better sense of the
dynamics of the communication: who was driving the conversation, who was
included, and at what point in time.

Advanced: parsing email chains

Parsing these deterministically from the Enron csv seemed daunting, so we used
gpt-4o to manage that. The resulting knowledge map makes it possible to
investigate communication over time among key Enron employees.

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 6/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

This extraction process worked reasonably well, although since time zones often
must be inferred, one needs to implement sanity checks for date-time parsing.
Moreover, the last reply of a chain might have the time zone, but the embedded
messages may not.

The language model gets it right most of the time. We found gpt-4o can perform well
for email chain extraction, while gpt-4o-mini fails. LLMs are rapidly evolving, and
each has its own capabilities, so the choice of model will affect results. Again, that’s a
compelling reason to be able to revisit the source information.

Thanks for reading! Visit kineviz.com or email me at ben at kineviz.com for more
information. You can also reach me at our Discord here.

Knowledge Graph Elixir Etl Llm Visualization

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 7/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Follow

Written by Ben Goosman


97 Followers · Editor for Kineviz

I’m a senior software engineer at Kineviz, where I help create data workflows and visualizations. I’m really
deep into dance and music production too!

More from Ben Goosman and Kineviz

Ben Goosman in Kineviz

Visualizing Jira in GraphXR


Being a fully remote, globally distributed company, I have limited visibility into my colleagues’
day-to-day activities. What are they…

Aug 31, 2022 40

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 8/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Alex Law in Kineviz

No-code graph authoring for Linkedin network analysis


Beginners guide to GraphXR for market research

Dec 22, 2022 16

Weidong Yang in Kineviz

Mapping JSON to graph with GraphXR


JSON is a very popular and versatile data format. Its nested tree structure is essentially a graph.
In fact JSON is often used to store…

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 9/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Aug 17, 2022 69

Ben Goosman in Kineviz

Re-serving the Plaintiff — How SightXR helps you find banal details in the
Epstein documents in…
We feed 4,728 pages of Epstein documents into SightXR, which uses AI to automatically
generate thousands of observations relating entities…

Feb 3

See all from Ben Goosman

See all from Kineviz

Recommended from Medium

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 10/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Sixing Huang

Tailor a Multi-Model Chatbot for a Multi-Model DuckDB


A Human-in-the-loop chatbot capable of SQL, graph, vector, and full-text queries

Oct 30 110 1

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 11/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Makbule Gulcin Ozsoy in Neo4j Developer Blog

Introducing the Neo4j Text2Cypher (2024) Dataset


Authors: Makbule Gulcin Ozsoy, Leila Messallem, Jon Besga

2d ago 21

Lists

data science and AI


40 stories · 279 saves

Work 101
26 stories · 186 saves

MODERN MARKETING
195 stories · 919 saves

Natural Language Processing


1801 stories · 1414 saves

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 12/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

AI Rabbit in CodeX

Has Anthropic Claude just wiped out an entire industry?


If you have been following the news, you may have read about a new feature (or should I call it a
product) in the Claude API — it is…

Oct 27 779 12

Abdur Rahman in Stackademic

Python is No More The King of Data Science


5 Reasons Why Python is Losing Its Crown

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 13/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

Oct 23 4.2K 22

Nayan Paul

Root Cause Analysis Use Case with the new O1 Reasoning Model
This blog has below 3 sections :

6d ago 3 1

Ferry Djaja

Build an Intelligent Document Processing with Confidence Scores with


GPT-4o
https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 14/15
2024/11/10 晚上11:41 How to create a Knowledge Map from 500k email messages using LLM | by Ben Goosman | Kineviz | Nov, 2024 | Medium

An Intelligent Document Processing (IDP) provides actionable insights through confidence


scores, allowing you to evaluate process…

Oct 31 143 2

See more recommendations

https://medium.com/kineviz/how-to-create-a-knowledge-map-from-500k-email-messages-using-llm-8b194fc875c9 15/15

You might also like