Backend Engineering Take-Home Assignment
Backend Engineering Take-Home Assignment
Assignment
Overview
In this assignment, you will build a Retrieval Augmented Generation (RAG)
system that leverages Weaviate as its vector database. The primary goal is to
design a performant system that efficiently retrieves answers from uploaded
documents in various formats. This project will assess your ability to develop a
robust backend system focused on data ingestion, embedding generation,
indexing, and retrieval.
Project Requirements
1. Document Ingestion & Embedding Generation
● Supported Formats:
○ PDF
○ DOCX
○ JSON
○ TXT
● Ingestion Pipeline:
○ Document Upload: Implement functionality to upload documents
in any of the above formats.
○ Embedding Creation: For each uploaded document, generate
embeddings using an appropriate model (e.g., OpenAI’s
text-embedding, Hugging Face models, etc.).
○ Note: I should be able to upload a new doc with the same name
and that should clear the earlier embeddings stored for that doc
and replace it with the new document embeddings.
○ Storage: Store the generated embeddings within Weaviate.
○ Automation: Develop an automated pipeline that:
1. Monitors for new document uploads.
2. Processes and generates embeddings.
3. Indexes the embeddings in Weaviate.
2. Question-Answer API Endpoint
● Functionality:
○ APIs to ingest documents and update a document. Note: Update
is equivalent to re-uploading(via API, no need of an UI, but if you
can quickly spin it up then great) the entire doc with the changes.
○ Create an API endpoint that accepts queries against individual
documents.
○ The system should retrieve the most relevant text snippet(s) from
the queried document stored in Weaviate.
● Response Requirements:
○ Return the answer with associated metadata such as:
■ Snippet of the retrieved text.
■ Document ID or other relevant identifiers.
3. Performance Optimization
● Efficient Retrieval:
○ Ensure that the RAG system is optimized for quick and accurate
retrieval.
○ Implement best practices such as:
■ Document chunking for handling large documents.
■ Precomputed embeddings to reduce latency during query
time.
4. Deployment
● Platform:
○ Deploy the application on a cloud platform of your choice (e.g.,
AWS, GCP, Azure, Render, Railway, etc.).
● Accessibility:
○ Provide publicly accessible endpoints for testing.
○ Include clear deployment instructions in your documentation.
Deliverables
1. Code Repository:
Evaluation Criteria
● Correctness & Completeness:
○ Does the system correctly ingest documents and answer queries?
● Code Quality & API Design:
○ Is the code well-organized, maintainable, and documented?
● Deployment:
○ Is the application successfully deployed and easily testable via
public endpoints?
● Bonus Features:
○ Are the extended JSON data aggregation capabilities
implemented effectively?
● Clarity & Documentation:
○ Are the provided instructions clear, comprehensive, and easy to
follow?
Thank you for taking on this assignment. It is designed to evaluate your skills
in backend system development, data ingestion, indexing, and retrieval.
Should you have any questions during the process, please do not hesitate to
reach out.
Good luck!
Note: Below are the sample files that can be used to generate embeddings.
whistleblower-policy-ba-revised.pdf
example.json