How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy,
including cookie policy.
Published in Better Programming
Follow
How To Build Your Own Custom ChatGPT
With Custom Knowledge Base More from Medium
Step-by-step guide on how to feed your ChatGPT bot with custom
Thomas Smith in The Generator
data sources
Google Bard First
Impressions—Will It Kill
ChatGPT?
ChatGPT has become an integral tool that most people use daily to automate
various tasks. If you have used ChatGPT for any length of time, you would
have realized that it can provide wrong answers and has limited to zero
context on some niche topics. This brings up the question of how we can tap
into chatGPT to bridge the gap and allow ChatGPT to have more custom data.
Wouldn't it be nice if you could selectively choose your data sources and feed
that information into ChatGPT conversation with your data with ease?
The issue with this approach is the model has a limited context; it can only
accept approximately 4,097 tokens for GPT-3. You will soon run into a wall
with this approach as it's also quite a manual, tedious process to always have
to paste in the content.
LlamaIndex connects your existing data sources and types with available data
connectors, for example (APIs, PDFs, docs, SQL, etc.) It enables you to
employ LLMs by offering indexes over your structured and unstructured
data. These indices facilitate in-context learning by removing typical
boilerplate and pain points: preserving context in an accessible manner for
quick insertion.
Dealing with prompt restrictions — a 4,096 token limit for the GPT-3 Davinci
and an 8,000 token limit for GPT-4 — when the context is too large becomes
much more accessible and tackles the text-splitting issue by giving users a
way to interact with the index. LlamaInde also abstracts the process of
extracting relevant parts from the documents and feeding them to the
prompt.
Prerequisites
Before we start, make sure you have access to the following:
An OpenAI API key, which can be found on the OpenAI website. You can
use your Gmail account to single sign-on.
How It Works
1. Create a document data index with LlamaIndex.
4. After that, you may ask ChatGPT, given the feed in context.
Create a new folder for your Python project, which you can call mychatbot,
preferably using a virtual environment or conda environment.
Next, we'll import the libraries in Python and set up your OpenAI API key in
a new main.py file.
os.environ['OPENAI_API_KEY'] = 'SET-YOUR-OPEN-AI-API-KEY'
In the above snippet, we are explicitly setting the environment variable for
clarity, as the LlamaIndex package implicitly requires access to OpenAI. In a
typical production environment, you can put your keys in environment
variables, vault, or whatever secrets management service your infra can
access.
def authorize_gdocs():
google_oauth2_scopes = [
"https://www.googleapis.com/auth/documents.readonly"
]
cred = None
if os.path.exists("token.pickle"):
with open("token.pickle", 'rb') as token:
cred = pickle.load(token)
if not cred or not cred.valid:
if cred and cred.expired and cred.refresh_token:
cred.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file("credentials.json", google_oauth2_scopes)
cred = flow.run_local_server(port=0)
with open("token.pickle", 'wb') as token:
pickle.dump(cred, token)
To enable the Google Docs API and fetch the credentials in the Google
Console, you can follow these steps:
2. Create a new project if you haven't already. You can do this by clicking on
the "Select a project" dropdown menu in the top navigation bar and
selecting "New Project." Follow the prompts to give your project a name
and select the organization you want to associate it with.
3. Once your project is created, please select it from the dropdown menu in
the top navigation bar.
4. Go to the "APIs & Services" section from the left-hand menu and click on
the "+ ENABLE APIS AND SERVICES" button at the top of the page.
5. Search for "Google Docs API" in the search bar and select it from the
results list.
6. Click the "Enable" button to enable the API for your project.
7. Click on the OAuth consent screen menu and create and give your app a
name, e.g., "mychatbot," then enter the support email, save, and add
scopes.
You must also add test users since this Google app will not be approved yet.
This can be your own email.
You will then need to set up credentials for your project to use the API. To do
this, go to the "Credentials" section from the left-hand menu and click
"Create Credentials." Select "OAuth client ID" and follow the prompts to set
up your credentials.
Once your credentials are set up, you can download the JSON file and store it
in the root of your application, as illustrated below:
Once you have set up your credentials, you can access the Google Docs API
from your Python project.
Go to your Google Docs, open up a few of them, and get the unique id that
can be seen in your browser URL bar, as illustrated below:
Gdoc ID
Copy out the gdoc IDs and paste them into your code below. You can have N
number of gdocs that you can index so ChatGPT has context access to your
custom knowledge base. We will use the GoogleDocsReader plugin from the
LlamaIndex library to load your documents.
loader = GoogleDocsReader()
If you wish to save and load the index on the fly, you can use the following
function calls. This will speed up the process of fetching from pre-saved
indexes instead of making API calls to external sources.
Querying the index and getting a response can be achieved by running the
following code below. Code can easily be extended into a rest API that
connects to a UI where you can interact with your custom data sources via
the GPT interface.
Given we have a Google Doc with details about me, information that's readily
available if you publicly search on google.
We will interact directly with vanilla ChatGPT first to see what output it
generates without injecting a custom data source.
INFO:google_auth_oauthlib.flow:"GET /?state=oz9XY8CE3LaLLsTxIz4sDgrHha4fEJ&code=4/0AWtgzh4LlIfmCMEa0t36dse_xoS0fXFeEWKHFiouzTvz4Qwr7T2Pj6anb-GiZ__Wg
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 175 tokens
Type prompt...who is timothy mugayi hint he is a writer on medium
Type prompt...Given you know who timothy mugayi is write an interesting introduction about him
Timothy Mugayi is an experienced and accomplished professional with a wealth of knowledge in engineering, coding, and mentoring. He is currently an
last_token_usage=330
It can now infer answers using a new custom data source, accurately
producing the following output.
Type prompt...Write a cover letter for timothy mugayi for an upwork python project to build a custom ChatGPT bot with access to external data source
INFO:root:> [query] Total LLM token usage: 436 tokens
INFO:root:> [query] Total embedding token usage: 30 tokens
I am writing to apply for the Python project to build a custom ChatGPT bot with access to external data sources. With over 15 years of experience in
I am currently an Engineering Manager at OVO (PT Visionet Internasional), a subsidiary of GRAB. I have extensive experience in Python and have been
I am confident that I can deliver a high-quality product that meets the requirements of the project. I am also available to discuss the project furt
Sincerely,
Timothy Mugayi
last_token_usage=436
Type prompt...
LlamaIndex will internally accept your prompt, search the index for pertinent
chunks, and then pass both your prompt and the pertinent chunks to the
ChatGPT model. The procedures above demonstrate a fundamental first use
of LlamaIndex and GPT for answering questions. Yet, there is much more you
can do. You are only limited by your creativity when configuring LlamaIndex
to utilize a different large language model (LLM), using a different type of
index for various activities, or updating old indices with a new index
programmatically.
Here is an example of changing the LLM model explicitly. This time we tap
into another Python package that comes bundled with LlamaIndex called
langchain.
...
index = GPTSimpleVectorIndex(
documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
If you want to keep tabs on your OpenAI free or paid credits, you can
navigate to the OpenAI dashboard and check how much credit is left.
Creating an index, inserting into an index, and querying an index will use
tokens. Hence, it's always important to ensure you output token usage for
tracking purposes when building your custom bots.
last_token_usage = index.llm_predictor.last_token_usage
print(f"last_token_usage={last_token_usage}")
Final Thoughts
ChatGPT combined with LlamaIndex can help to build a customized
ChatGPT chatbot that can infer knowledge based on its own document
sources. While ChatGPT and other LLM are pretty powerful, extending the
LLM model provides a much more refined experience and opens up the
possibility of building a conversational-style chatbot that can be used to build
real business use cases like customer support assistance or even spam
classifiers. Given we can feed real-time data, we can evaluate some of the
limitations of ChatGPT models being trained up to a certain period.
For the complete source code, you can refer to this GitHub repo.
If you are looking to build custom ChatGPT bots that understand your
domain, drop a message in the comments section and let's connect.
1.4K 26
1.4K 26
A newsletter covering the best programming articles published across Medium Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review Get this newsletter
our Privacy Policy for more information about our privacy practices.