Tutorial Membuat RAG AI ChatBot API Dengan Python FastAPI Dan Open Source LLMs
Tutorial Membuat RAG AI ChatBot API Dengan Python FastAPI Dan Open Source LLMs
Source: https://arxiv.org/pdf/2401.05856
Source: https://arxiv.org/pdf/2401.05856
Open Source LLMs
● Pros
○ Cost-Effective: No licensing fees or per-use costs.
○ Customizable:
■ Full control over model parameters and architecture.
■ Ability to fine-tune on specific datasets.
○ Transparency:
■ Access to source code and model internals.
■ Facilitates auditing and understanding model behavior.
● Cons
○ Resource Requirements:
■ Requires significant computational power, especially GPUs.
■ Higher upfront infrastructure costs.
○ Maintenance: Responsibility for updates, bug fixes, and security patches.
○ Performance: May lag behind the latest advancements in paid models.
Paid LLMs (e.g., OpenAI
● Pros
○ Ease of Use:
■ Simple API integrations with extensive documentation.
■ Quick setup without the need for specialized infrastructure.
○ Performance: Access to cutting-edge models with superior capabilities.
○ Support:
■ Professional customer service.
■ Regular model updates and improvements.
● Cons
○ Cost:
■ Usage fees based on number of requests or tokens processed.
■ Costs can escalate with high-volume usage.
○ Data Privacy:
■ Sending data to third-party servers may raise compliance issues.
■ Potential concerns over proprietary data exposure.
○ Customization: Limited ability to modify or fine-tune models beyond provided parameters.
Use Case and Recommendations
● Open Source LLMs
○ Cost Considerations:
■ Initial Setup: Investment in hardware or cloud infrastructure.
■ Operational Costs: Ongoing expenses for maintenance and energy consumption.
○ Best Suited For:
■ Projects requiring full control over the model.
■ Applications needing to process sensitive data in-house.
■ Organizations with the technical expertise to manage ML infrastructure.
● Paid LLMs
○ Cost Considerations:
■ Usage-Based Pricing: Charges per API call or per token (e.g., OpenAI's GPT-4).
■ Scalability Costs: Expenses grow proportionally with usage.
○ Best Suited For:
■ Rapid development and deployment needs.
■ Access to state-of-the-art model performance.
■ Teams without extensive ML infrastructure experience.
Challenges and Best Practices
● Challenges
○ Hardware: GPUs or high-performance CPUs are necessary for reasonable inference times.
○ Expertise Needed:
■ Requires knowledge in machine learning deployment and optimization.
■ System administration skills for infrastructure setup and management.
○ Scalability Issues: Difficulty handling increased user requests without performance degradation.
● Best Practices
○ Cloud Services: Use GPU-enabled instances from providers like AWS, GCP, Azure.
○ Optimized Implementations:
■ Utilize efficient libraries (e.g., LLaMA.cpp) for better performance on less powerful hardware.
○ Containerization:
■ Deploy models using Docker for consistent environments and easier management.
○ Monitoring and Logging:
■ Implement tools to track performance metrics and system health.
■ Use services like Prometheus, Grafana, or CloudWatch.
Running Open Source LLMs 1
● LLaMA.cpp (https://github.com/ggerganov/llama.cpp)
○ Description: Lightweight implementation for running LLaMA models on CPUs.
○ Pros:
■ Low resource requirements.
■ Can run on consumer-grade hardware.
○ Cons:
■ Slower inference compared to GPU solutions.
○ May not handle large models effectively.
● vLLM (https://docs.vllm.ai/)
○ Description: Optimized inference and serving engine for LLMs.
○ Pros:
■ High throughput and low latency.
■ Supports dynamic batching.
○ Cons:
■ Requires GPUs for optimal performance.
■ More complex setup.
Running Open Source LLMs 2
● HuggingFace (https://huggingface.co/)
○ Description: Comprehensive library for state-of-the-art models.
○ Pros:
■ Wide range of pre-trained models.
■ Active community and support.
○ Cons:
■ May need adaptation for production-scale deployment.
■ Can be resource-intensive.
● Ollama (https://ollama.com/)
○ Description: Platform simplifying the deployment of large models.
○ Pros:
■ Streamlined setup process.
■ Focus on simplifying ML deployment.
○ Cons:
■ Less flexibility in customization.
■ May have limitations in model choices.
Running Open Source LLMs 3
● LM Studio (https://lmstudio.ai/)
○ Description: User-friendly interface for running language models locally.
○ Pros:
■ Easy to use GUI.
■ Good for experimentation.
○ Cons:
■ Not ideal for backend API integrations.
■ Limited scalability.
Why Building Your Own API with FastAPI
● Customization:
○ Tailor API endpoints to specific application requirements.
● Scalability:
○ Optimize performance to handle increased traffic.
● Integration:
○ Seamlessly connect with other services (databases, frontends, etc.).
● Control:
○ Full oversight over data processing and security.
● Flexibility:
○ Implement custom authentication and authorization mechanisms.
○ Adjust middleware and request handling as needed.
Live Demo Agenda
1. Building the API with FastAPI
a. Setting Up the Environment
b. Creating Basic API Endpoints
c. Integrating Open Source LLMs
2. Managing and Parsing Documents
a. Creating the Document Upload Endpoint
b. Parsing Text Documents
c. Parsing PDF Documents with OCR
3. Database Integration and Management
a. Setting Up PostgreSQL
b. Integrating the Database with FastAPI
c. Testing the FastAPI Application
4. Configuring the LLM Model
a. Understanding Model Parameters
b. Adjusting Configurations for Optimal Performance
Setting Up the Environment
● Prerequisites
○ Python 3.7+ installed
○ Basic Python programming knowledge
○ Familiarity with command-line interface (CLI)
○ Git installed for version control
○ An IDE or text editor (e.g., VSCode, PyCharm)
● Creating a Virtual Environment
○ Create Virtual Environment: python -m venv env
○ Activate the Environment:
■ macOS/Linux: source env/bin/activate
■ Windows: env\Scripts\activate
● Installing Required Packages with UV
○ https://github.com/astral-sh/uv
Building the API with FastAPI 1
● Creating Basic API Endpoints
○ Create main.py: The main application file.
○ Import FastAPI: from fastapi import FastAPI
○ Initialize the App: app = FastAPI()
● Creating the Root Endpoint
○ Define Root Endpoint:
@app.get("/")
def read_root():
return {"message": "Welcome to the RAG AI ChatBot API"}
○ Explanation:
■ This endpoint responds to GET requests at the root URL.
■ Returns a welcome message as a JSON response.
Building the API with FastAPI 2
● Creating the Generate Endpoint
○ Define Generate Endpoint:
@app.post("/generate")
def generate_response(user_input: str):
# Placeholder for response generation
return {"response": f"Echo: {user_input}"}
○ Explanation:
■ This endpoint responds to POST requests at /generate.
■ Receives user_input as a string.
■ Currently echoes back the input; to be updated with LLM integration.
Integrating Open Source LLMs 1
● Loading the Model and Tokenizer
○ Import Libraries:
from transformers import AutoModelForCausalLM, AutoTokenizer
○ Load Pre-trained Model and Tokenizer:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
● Updating the Generate Endpoint
○ Modify generate_response Function:
○ Explanation:
■ Encoding Input:
● Adds end-of-sequence token to user input.
● Converts text to tensor format for the model.
■ Generating Output:
● Uses the model to generate a response.
● Specifies maximum length and padding behavior.
■ Decoding Output:
● Converts generated tokens back to human-readable text.
● Skips special tokens during decoding.
Integrating Open Source LLMs 2
● Updating the Generate Endpoint
○ Modify generate_response Function:
@app.post("/generate")
def generate_response(user_input: str):
inputs = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors="pt")
outputs = model.generate(inputs, max_length=500, pad_token_id=tokenizer.eos_token_id)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": text}
○ Explanation:
■ Encoding Input:
● Adds end-of-sequence token to user input.
● Converts text to tensor format for the model.
■ Generating Output:
● Uses the model to generate a response.
● Specifies maximum length and padding behavior.
■ Decoding Output:
● Converts generated tokens back to human-readable text.
● Skips special tokens during decoding.
Integrating Open Source LLMs 3
● Understanding Model Parameters
○ max_length:
■ Sets the maximum number of tokens to generate.
■ Prevents overly long responses.
○ pad_token_id:
■ Specifies the token used for padding sequences.
■ Ensures consistent sequence lengths.
Managing and Parsing Documents 1
● Creating the Document Upload Endpoint
○ Add File Upload Support:
from fastapi import File, UploadFile
@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
contents = await file.read()
# Process the contents
return {"filename": file.filename, "status": "Processed"}
○ Explanation:
■ upload_document Endpoint:
● Asynchronously handles file uploads.
● Uses UploadFile for efficient file handling.
■ Processing Steps:
● Reads contents of the uploaded file.
● Placeholder for parsing and storing the document.
Managing and Parsing Documents 2
● Parsing Text Documents
○ Function to Parse Text Content:
def parse_text_document(contents):
text = contents.decode("utf-8")
# Further processing (e.g., cleaning, tokenization)
return text
○ Explanation:
■ Decoding Content:
● Converts byte data to a string using UTF-8 encoding.
■ Additional Processing:
● Prepare text for embedding (e.g., removing noise).
Managing and Parsing Documents 3
● Parsing PDF Documents with OCR
○ Add Dependencies:
uv add PyPDF2 pytesseract pillow
○ Function to Parse PDF Content:
import PyPDF2
from PIL import Image
import pytesseract
import io
def parse_pdf_document(contents):
pdf_reader = PyPDF2.PdfReader(io.BytesIO(contents))
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# If text extraction fails, use OCR
if not text.strip():
images = convert_pdf_to_images(contents)
for image in images:
text += pytesseract.image_to_string(image)
return text
Managing and Parsing Documents 4
● Explanation of PDF Parsing
○ PDF Text Extraction:
■ Uses PyPDF2 to extract text directly from PDF pages.
○ Handling Non-Text PDFs:
■ Some PDFs are scanned images without embedded text.
■ If direct extraction yields no text, fallback to OCR.
○ OCR Process:
■ Converts PDF pages to images.
■ Uses pytesseract to perform OCR on images.
■ Extracts text from images for further processing.
Database Integration and Management
● Setting Up PostgreSQL
○ Install PostgreSQL:
■ Download and install from the official website.
○ Create a New Database:
psql -U postgres
CREATE DATABASE rag_chatbot;
○ Explanation:
■ psql: PostgreSQL command-line interface.
■ Database Creation: Initializes a new database for the application.
What's Next? 1
● RAG Algorithm Enhancement
○ Advanced Retrieval Techniques:
■ Implement BM25, TF-IDF, or Hybrid Retrieval combining semantic and keyword search.
○ Contextual Embeddings:
■ Use models that consider conversation history for more coherent responses.
○ Feedback Mechanisms:
■ Incorporate user feedback to refine retrieval and response generation.
■ Adjust weights or retrain models based on interactions.
● Finding Best and Optimized Open Source LLMs
○ Model Comparison: Evaluate models like GPT-J, BLOOM, LLaMA-based models.
○ Resource Considerations: Balance model size and performance with hardware capabilities.
○ Optimization Techniques:
■ Quantization: Reduce model size by decreasing precision (e.g., from 32-bit to 8-bit).
■ Pruning: Remove redundant weights to streamline the model.
○ Benchmarking:
■ Test models in your specific use case to determine the best fit.
What's Next? 2
● Authentication and Deployment
○ Authentication Methods
■ OAuth 2.0:
● Industry-standard protocol for authorization.
● Allows third-party applications to access user data without exposing credentials.
■ JSON Web Tokens (JWT):
● Stateless authentication mechanism.
● Encode user information securely in tokens.
○ Deployment Strategies
■ Containerization (Docker): Package applications with all dependencies for consistency.
■ Orchestration (Kubernetes):
● Manage containerized applications across multiple hosts.
● Provides scalability and high availability.
■ Serverless Platforms:
● Use services like AWS Lambda or Google Cloud Functions.
● Automatically scale with demand; reduce server management overhead.
What's Next? 3
● Production-Ready System (1)
○ Scalability
■ Auto-Scaling:
● Configure systems to adjust resource allocation based on load.
■ Load Balancing:
● Distribute incoming traffic to prevent overloading a single instance.
○ Monitoring and Logging
■ Monitoring Tools:
● Prometheus, Datadog for metrics collection.
● Real-time alerts for system issues.
■ Logging Solutions:
● ELK Stack (Elasticsearch, Logstash, Kibana) for centralized log management.
● Facilitates debugging and performance tuning.
What's Next? 4
● Production-Ready System (2)
○ Security Best Practices
■ Encryption: Use HTTPS/TLS for secure data transmission.
■ Input Validation: Prevent injection attacks by validating and sanitizing inputs.
■ Rate Limiting:
● Protect against denial-of-service attacks.
● Control the number of requests a client can make in a given time frame.
○ Compliance and Data Privacy
■ Regulatory Compliance:
● Adhere to GDPR, CCPA, or other relevant regulations.
● Implement consent mechanisms where necessary.
■ Data Anonymization:
● Remove personally identifiable information from datasets.
■ User Consent and Control:
● Provide options for users to manage their data and privacy settings.
Thank
You!
If you have any other questions, you can get in touch
with me from the following media:
LinkedIn: https://linkedin.com/in/IrvanFza
Website: https://irvan.cc
Community: https://linktr.ee/Meetap