0% found this document useful (0 votes)
66 views4 pages

Assignment For Applied AI Engineer (RAG Pipeline) Role

Uploaded by

sreekanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views4 pages

Assignment For Applied AI Engineer (RAG Pipeline) Role

Uploaded by

sreekanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

About the company:

We are building a stealth venture studio focused on the application layer of the AI revolution.
Over this year, we will be public in the media about what we are building. You can see a draft
video of our vision here.

We are currently building multiple projects in the forefront of AI


- OpenMic AI - SaaS for Conversational AI
- SNR Audio - Foundational AI model for Text-to-Speech & Speech-to-Text
- PoolCompute - GPU Marketplace
- Stealth Computer Vision Idea - Coming soon!
- Stealth SaaS for Outbound sales - Coming Soon!

About the founders:

Kaushik Tiwari (Columbia University ‘17) : The 2nd Thiel fellow from India after Ritesh
Aggarwal, Kaushik is a 2nd time founder with a previous exit in healthcare x fintech. His work
has been featured in Axios, Forbes, American Banker etc. He has experience both as an exited
second time founder and a venture investor.

Saumik Tiwari (Trinity College ‘20): He is a Y Combinator backed founder with a previous exit
in fintech. He is an active venture investor, you can read about some of his top investments like
Andromeda Surgical or Electric Air among others.
Assignment for Applied AI Engineer (RAG Pipeline) Role

Assignment: Unstructured Data Handling and Retrieval-Augmented


Generation (RAG) Pipeline Implementation
Assignment Overview:

In this assignment, you will be working with a structured dataset consisting of text data. Your
task is to process this data, store it in a vector database, and implement a Retrieval-Augmented
Generation (RAG) pipeline to generate meaningful responses based on user queries. The goal
is to demonstrate the end-to-end flow of handling structured data, embedding it, storing it
efficiently, and retrieving relevant information for text generation.

Here is the dataset.

Problem Statement:
You are provided with a collection of structured documents (e.g., resumes). Your task is to:

1. Ingest and preprocess the structured data by applying advanced chunking


techniques based on semantic similarity or topic modeling. Avoid naive chunking
methods.
2. Generate embeddings from the preprocessed data using open-source models such as
SentenceTransformers.
3. Store the embeddings in a vector database (e.g., Milvus, FAISS, Pinecone, or Chroma)
using efficient indexing methods like FLAT or IVF to enable efficient similarity and
semantic search.
4. Implement a Retrieval-Augmented Generation (RAG) pipeline that retrieves relevant
chunks from the database in response to a user query and generates a coherent
response using a language model.
5. Develop a REST API (using Flask or FastAPI) to expose this pipeline, allowing users to
input queries and receive generated responses based on the retrieved information.

Note: Using query expansion technique to enhance retrieval and hybrid retrieval methods,
combining BM25 (Best Match 25) and BERT/bi-encoder based retrieval methods such as DPR
and spider would be a bonus.

Assignment Task:
1. Unstructured Data Preprocessing:
● Load the structured documents (e.g., resumes) from the corpora.
● Clean and preprocess the data, removing noise, handling missing data, and
tokenizing the text.
● Segment the data into meaningful chunks based on advanced NLP techniques
like semantic similarity, topic modeling, or NER to ensure context relevance.
2. Embedding Generation:
● Select an appropriate pre-trained language model (e.g.,
SentenceTransformers, BERT, or GPT embeddings) to convert the preprocessed
chunks into vector embeddings.
● Ensure the embeddings capture the semantic meaning of the text for effective
retrieval.
3. Vector Database Integration:
● Set up a vector database such as Milvus, FAISS, Pinecone, or Chroma.
● Store the generated embeddings and associated metadata in the vector
database.
● Use advanced indexing methods like FLAT or IVF to optimize the search for
relevant chunks during retrieval. (Optional)
4. Retrieval-Augmented Generation (RAG) Pipeline:
● Retrieval Step: Convert a user query into an embedding and use the vector
database to retrieve the most relevant chunks of text using hybrid search
methods (e.g., BM25, BERT/DPR).
● Generation Step: Pass the retrieved documents to a language model (e.g., GPT
or any open-source model) to generate an accurate, context-aware response.
5. API Development:
● Develop a REST API using Flask or FastAPI:
○ The API should accept user queries (in text format).
○ The API should implement the RAG pipeline to retrieve relevant
information and generate a response.
○ Return the generated response as the API output.
○ Test the API with multiple queries to ensure it handles various input types
effectively.

Deliverables:
1. Codebase: Include ingestion, preprocessing, chunking, embedding generation, vector
database setup, and RAG pipeline implementation.
2. Vector Database Setup: Provide documentation on the vector database configuration
and embedding storage process.
3. API Documentation: Include API endpoints, sample queries, and expected responses.
4. Report: Detail your approach to data handling, embedding generation, RAG pipeline
development, and API creation. Include challenges and solutions.

Evaluation Criteria:
● Preprocessing: Efficiency in cleaning and chunking the data.
● Embedding and Database: Accuracy of embeddings and database integration.
● RAG Pipeline: Quality and relevance of generated responses.
● API: Performance and correctness of the API.
● Code and Report Quality: Organization, documentation, and problem-solving
approaches.

Bonus:
● The retrieved data should be re-ranked based on relevance and similarity to the query.
● In the Generation step, the retrieved and re-ranked data should be passed to an LLM.
● Optimize the RAG pipeline for better accuracy or speed.
● Deploy the API on a cloud platform (e.g., AWS, GCP, or Azure) and provide a public
endpoint.
● The participant can create a user interface using frameworks like Streamlit or Gradio to
demonstrate the functionality of the developed system.

Submission Guidelines:

● Submit your work as a ZIP file or a link to a GitHub repository.


● Include a README file that explains the overall structure of your submission, the steps
to set up and run the solutions, and any additional information that may be useful for the
evaluation.
● Ensure that your code is well-organized, commented, and adheres to best practices.
● If you are using any third-party libraries or frameworks, please list them in the README
file.

You might also like