0% found this document useful (0 votes)
14 views

Large Language Model Based Search Tool Prototype

NASA aims to make research more efficient through tools like search tools. The document proposes a prototype search tool that uses large language models to interact with NASA's dataset. It was built using PandasAI and OpenAI libraries in Google Colab and allows natural language queries of the dataset. However, PandasAI had a bug that limited custom prompts, but the tool demonstrates searching for authors and publications.

Uploaded by

nataliasbackup
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Large Language Model Based Search Tool Prototype

NASA aims to make research more efficient through tools like search tools. The document proposes a prototype search tool that uses large language models to interact with NASA's dataset. It was built using PandasAI and OpenAI libraries in Google Colab and allows natural language queries of the dataset. However, PandasAI had a bug that limited custom prompts, but the tool demonstrates searching for authors and publications.

Uploaded by

nataliasbackup
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

LARGE LANGUAGE MODEL BASED

SEARCH TOOL PROTOTYPE

NASA is a research-based institution, and its primary mission is to conduct research


and development in aeronautics, space exploration, and related fields, but research is a
very time consuming process, but with the right tools this process can be made more
efficient. One such tool is “search tool” , the ability to find the data you are looking for in
a short span of time in plain english or even roughly formatted queries can reduce the
time and increase productivity. Here I am suggesting a prototype search tool that uses a
large language model to interact with NASA's ntrs dataset.

My solution is based on openAI llm, and a relatively new library known as pandasAI,
using these two libraries along with other libraries, I have built a Google Colab based
search tool, which talks to NASA's ntrs dataset.

The techniques and tools I have used can be incorporated into the existing NASA’s
search website or it can be used as a standalone app, or even as a chatbot on the
search site.

I will now describe a little bit about what the two main libraries do and what worked and
what didn’t work for me. OpenAI offers an API platform that provides its latest models
that can be used to build different apps. PandasAI is a Python library that integrates
generative AI capabilities into Pandas, the popular data analysis and manipulation tool.
It is designed to be used in conjunction with Pandas, and it makes data analysis
conversational, allowing users to ask questions to their data in natural language. It can
show dataframes, plot graphs and bars also.

However, I found a minor bug in PandasAI that limited my options to develop a more
powerful solution. The problem that I faced in pandasAI is that there custom_prompt is
not working, whenever, I tried to pass a function to custom_prompt to make more
complex queries, as pandasAI can not build complex queries with short sentences like
“find who is the lead author in astronomy field ” it gives me a generic solution. The
workaround I found to this problem is that I am directly sending the prompt to the chat
function, but the prompt is very detailed along with some examples.
The downside of this bug is that I was not able to use embeddings and any vector store,
however, to the best of my knowledge they are working on this bug, and soon this can
be fixed.
I tried to use gradio to show the dataframe but the pandasAI’s smart dataframe didn’t
seem to work, I don’t know whether it is my fault or it is pandasAI fault. However
streamlit can be used as there is a middleware for that in PandasAI.
Now, a little bit about the code and dataset, I have used ntrs-public-metadata.json.gz
dataset, the dataset has used different structures with different columns some of them
are plain strings, or simple list while others have dictionaries inside lists. I have used
four different columns out of 28 columns to make research related queries. The
columns I have used are ‘authorAffiliations’, ‘organization’ , ‘subjectCategories’, ‘Stitype’.
To address the problems statements like who is the lead author in a particular field,
what type of publications an author has published, what organization does he/she
belong to. More columns like ‘curated’ and ‘publishing date’ can also be added. Or
instead of making another dataframe the methods I have used can be applied on the
original dataframe as well . I have used pandas explode function to expand the number
of authors in different rows and a custom function to convert subjectCategories into
strings

PROMPTS:-
Below are a few example prompts I have used to check-
1. how many rows are in {x1} and {data2}
2. How many unique authors are in {x1}
3. Complex query that address the question who is the lead author in a particular
field : "use the provided dataframe to find the rows where the subject is
astronomy you can use the following query as an example
x1.query(subjectCategories.str.contains(Astronomy)). save them in another
data frame. you can use the code
x1.query(subjectCategories.str.contains(Physics)) to generate python code.
find the value counts of each author in this new dataframe. the following code
shows an example code
x1[x1[subjectCategories]==Astronomy][authorAffiliations].value_counts().
provide the counts of each author as output"

You might also like