How To Create A Private ChatGPT With Your Own Data
How To Create A Private ChatGPT With Your Own Data
Search Medium
Save
With the rise of Large Language Models (LLMs) like ChatGPT and GPT-4, many are
asking if it’s possible to train a private ChatGPT with their corporate data. But is this
feasible? Can such language models offer these capabilities?
In this article, I will discuss the architecture and data requirements needed to create
“your private ChatGPT” that leverages your own data. We will explore the advantages of
this technology and how you can overcome its current limitations.
Disclaimer: this article provides an overview of architectural concepts that are not specific to
Azure but are illustrated using Azure services since I am a Solution Architect at Microsoft.
Factual correctness and traceability, where does the answer come from
Costs, new documents require retraining of the model and model hosting
This will make it extremely hard, close to impossible, to use fine-tuning for the
purpose of Question Answering (QA). How can we overcome such limitations and still
benefit from these LLMs?
A brain (represents knowledge) and AI (computer) separated from each other — (Image by DALL·E 2)
It might seem like a good idea to feed all documents to the model during run-time, but
this isn’t feasible due to the character limit (measured in tokens) that can be processed
at once. For example, GPT-3 supports up to 4K tokens, GPT-4 up to 8K or 32K tokens.
Since pricing is per 1000 tokens, using fewer tokens can help to save costs as well.
2. Application finds the most relevant text that (most likely) contains the answer
(Image by author)
This approach is often referred to as grounding the model. The application will provide
additional context to the language model, to be able to answer the question based on
relevant resources.
Now you understand the high-level architecture required to start building such a
scenario, it is time to dive into the technicalities.
3. Retrieve the most relevant data
Context is key. To ensure the language model has the right information to work with,
we need to build a knowledge base that can be used to find the most relevant
documents through semantic search. This will enable us to provide the language model
with the right context, allowing it to generate the right answer.
We can start by simply splitting the document per page, or by using a text splitter that
splits on a set token length. When we have our documents in a more accessible format,
it is time to create a search index that can be queried by providing it with a user
question.
Next to these chunks, you should add additional metadata to your index. Store the
original source and page number to link the answer to your original document. Store
additional metadata that can be used for access control and filtering.
An embedding is a vector (list) of floating point numbers. The distance between two vectors
measures their relatedness. Small distances suggest high relatedness and large distances
suggest low relatedness. [1]
If you want to leverage the latest semantic models and have more control over your
search index, you could use the text embedding models from OpenAI. For all your
sections you will need to precompute embeddings and store them.
On Azure you can store these embeddings in a managed vector database like Azure
Cache for Redis (RediSearch) or in a open source vector database like Weaviate or
Pinecone. During the application run-time you will first turn the user question into an
embedding, so we can compare the cosine similarity of the question embedding with
the document embeddings we generated earlier.
(Image by author)
Use a sliding window; chunking per page or per token can have the unwanted
effect of losing context. Use a sliding window to have overlapping content in your
chunks, to increase the chance of having the most relevant information in a chunk.
Provide more context; a very structured document with sections that nest multiple
levels deep (e.g. section 1.3.3.7) could benefit from extra context like the chapter
and section title. You could parse these sections and add this context to every
chunk.
Summarization, create chunks that contain a summary of a larger document
section. This will allow us to capture the most essential text and bring this all
together in one chunk.
Designing your prompt is how you “program” the model, usually by providing some
instructions or a few examples. [2]
In your prompt you want to be clear that the model should be concise and only use data
from the provided context. When it cannot answer the question, it should provide a
predefined ‘no answer’ response. The output should include a footnote (citations) to
the original document, to allow the user to verify its factual accuracy by looking at the
source.
"You are an intelligent assistant helping Contoso Inc employees with their healthc
"Use 'you' to refer to the individual asking the questions even if they ask with '
"Answer the following question using only the data provided in the sources below.
"For tabular information return it as an html table. Do not return markdown format
"Each source has a name followed by colon and the actual information, always inclu
"If you cannot answer using the sources below, say you don't know. " + \
"""
###
Question: 'What is the deductible for the employee plan for a visit to Overlake in
Sources:
info1.txt: deductibles depend on whether you are in-network or out-of-network. In-
info2.pdf: Overlake is in-network for the employee plan.
info3.pdf: Overlake is the name of the area that includes a park and ride near Bel
info4.pdf: In-network institutions include Overlake, Swedish and others in the reg
Answer:
In-network deductibles are $500 for employee and $1000 for family [info1.txt] and
###
Question: '{q}'?
Sources:
{retrieved}
Answer:
"""
Don’t forget to set a low temperature via your parameters if you want a more repetitive
and deterministic response. Increasing the temperature will result in more unexpected
or creative responses.
This prompt is eventually used to generate a response via the (Azure) OpenAI API. If
you use the gpt-35-turbo model (ChatGPT) you can pass the conversation history in
every turn to be able to ask clarifying questions or use other reasoning tasks (e.g.
summarization). A great resource to learn more about prompt engineering is dair-
ai/Prompt-Engineering-Guide on GitHub.
5. Next steps
In this article, I did discuss the architecture and design patterns needed to build an
implementation, without delving into the specifics of the code. These patterns are
commonly used nowadays, and the following projects and notebooks can serve as
inspiration to help you start building such a solution.
ChatGPT Retrieval Plugin, let ChatGPT access up-to-date information. For now, this
only supports the public ChatGPT, but hopefully the capability to add plugins will
be added to the ChatGPT API (OpenAI + Azure) in the future.
Eventually, you can look into extending ‘your own ChatGPT’ by linking it to more
systems and capabilities via tools like LangChain or Semantic Kernel. The possibilities
are endless.
Conclusion
In conclusion, relying solely on a language model to generate factual text is a mistake.
Fine-tuning a model won’t help either, as it won’t give the model any new knowledge
and doesn’t provide you with a way to verify its response. To build a Q&A engine on top
of a LLM, separate your knowledge base from the large language model, and only
generate answers based on the provided context.
(Image by author)
If you enjoyed this article, feel free to connect with me on LinkedIn, GitHub or Twitter.
References
[1] Embeddings — OpenAI API. March 2023,
https://platform.openai.com/docs/guides/embeddings
[3] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P.,
Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., Zhang, Y. “Sparks of
Artificial General Intelligence: Early experiments with GPT-4” (2023), arXiv:2303.12712
[4] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L.,
Cancedda, N., Scialom, T. “Toolformer: Language Models Can Teach Themselves to Use
Tools” (2023), arXiv:2302.04761
[5] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R.,
Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., Scialom, T.
“Augmented Language Models: a Survey” (2023),
arXiv:2302.07842