3 RD Draft
3 RD Draft
1 ER Diagram
3 Component
Diagram
4 Agile Model
5 Waterfall Model
6 Spiral Model
List of Figures ................................................ vi
1. Introduction 1-4
1.1 Purpose of the project
1.2 Project Objective
1.3 Project Scope
1.4 Overview of the project
1.5 Problem area description
2. System Analysis 5-8
2.1 Existing System
2.2 Proposed System
2.3 Overview
3. Feasibility Report 9-10
3.1 Operational Feasibility
3.2 Technical Feasibility
3.3 Financial and Economical Feasibility
4. System Requirement Specifications 11-13
4.1 Functional Requirements
4.2 Non-Functional Requirements
4.3 System Components
4.4 System Interaction
4.5 Constraints
4.6 User Roles
4.7 Module Description
5. SDLC Methodologies 14-18
6. Software Requirement 19
7 Hardware Requirement 20
8. System Design 21-22
9. Process Flow 23
9.1 ER Diagram 23
10. Data Flow Diagram 24-34
10.1 DFD Level 0 & Level 1…
10.2 DFD Level…
10.3 UML Diagram
10.4 Use Case Description
10.5 Use Case Diagram
10.6 Component Diagram
In today's digital age, a vast amount of information is stored and shared in the
form of PDF documents. These documents often contain valuable data,
research findings, reports, or manuals that are essential for various purposes,
such as academic research, business operations, or personal knowledge
acquisition. However, navigating through lengthy PDF files and extracting
relevant information can be a daunting and time-consuming task, especially
when dealing with complex or technical content.
The PDF-CHAT application aims to revolutionize the way users interact with and
extract information from PDF documents. By leveraging the power of natural
language processing (NLP) and large language models (LLMs), the application
provides an intuitive and user-friendly interface that allows users to ask
questions about the content of a PDF using natural language.
The application employs advanced text chunking algorithms to break down the
PDF content into smaller, manageable chunks, making it easier to process and
generate semantic representations using OpenAI embeddings. These
embeddings capture the contextual meaning and relationships within the text,
enabling the LLM to understand the content and provide relevant and accurate
responses to user queries.
One of the key advantages of the PDF-CHAT application is its ability to handle
complex and technical PDF documents with ease. By leveraging the power of
LLMs and their vast knowledge base, the application can provide insightful
responses even for specialized or niche subject areas, making it a valuable tool
for researchers, professionals, and anyone seeking to extract and comprehend
information from PDF documents efficiently.
PURPOSE:
The primary purpose of the PDF-CHAT application is to revolutionize the way
users interact with and extract information from PDF documents. It aims to
address the challenges associated with navigating through lengthy and complex
PDF files, which often contain valuable information that can be difficult to
locate and comprehend manually.
One of the key purposes of the application is to provide users with a natural
and intuitive way to access the information contained within PDF documents.
By leveraging natural language processing (NLP) and large language models
(LLMs), the application enables users to ask questions about the PDF content
using natural language, eliminating the need for complex search queries or
extensive manual scanning.
OVERVIEW:
The PDF-CHAT application is a powerful and innovative solution that combines
several cutting-edge technologies to provide users with a seamless and
intuitive experience for extracting information from PDF documents. At its
core, the application leverages natural language processing (NLP) and large
language models (LLMs) to enable users to ask questions about the content of
a PDF using natural language.
Once a PDF file is uploaded, the application employs advanced text chunking
algorithms to break down the PDF content into smaller, manageable chunks.
This chunking process ensures efficient processing and generation of semantic
representations, even for large and complex PDF documents. The chunked text
is then fed into the OpenAI embeddings component, which generates high-
dimensional vector representations of the text, capturing the contextual
meaning and relationships within the content.
These embeddings serve as the input for the LangChain LLM component, which
integrates with powerful language models like GPT-3 or other state-of-the-art
models. LangChain acts as an abstraction layer, facilitating the communication
between the application and the LLM, allowing for seamless integration and
customization of the language model used for generating responses.
When a user asks a question through the Streamlit interface, the application
processes the query and retrieves the most relevant embeddings from the PDF
content. These embeddings are then passed to the LLM, which generates a
contextual and informative response based on its understanding of the content
and the user's question.
SCOPE:
The PDF-CHAT application has a broad scope that encompasses various
functionalities and features to provide users with a comprehensive solution for
extracting information from PDF documents. The scope of the application can
be divided into several key areas:
Furthermore, traditional search and indexing methods for PDFs often rely on
keyword-based searches, which can be limiting and may fail to capture the
nuances and contextual information present in the content. This can result in
irrelevant or incomplete search results, further compounding the challenges of
extracting relevant information from PDFs.
The PDF-CHAT application aims to address these problems by leveraging state-
of-the-art natural language processing (NLP) and large language model (LLM)
technologies. By enabling users to ask questions about the PDF content using
natural language, the application eliminates the need for complex search
queries or extensive manual scanning. Additionally, the application's ability to
understand the contextual meaning and relationships within the PDF text
through advanced text chunking and semantic embeddings ensures that
relevant and accurate information is retrieved, saving users valuable time and
effort.
While these existing systems provide some means for accessing and retrieving
information from PDF documents, they have significant limitations in terms of
efficiency, contextual understanding, and usability. The proposed PDF chat app
aims to address these limitations by leveraging advanced natural language
processing techniques, vector embeddings, and language models to provide a
more intuitive and intelligent way of interacting with PDF content through
natural language queries.
Drawbacks of Existing Systems:
Inefficient and Time-Consuming: Manual searching and keyword-based
searches can be extremely time-consuming, especially when dealing with large
volumes of PDF documents or complex information needs.
Lack of Context and Semantic Understanding: Keyword-based searches and
traditional search engines often lack the ability to understand the context and
semantic meaning of the content, leading to incomplete or irrelevant results.
Limited Natural Language Interaction: Most existing systems do not support
natural language queries, forcing users to formulate precise keyword-based
queries, which may not accurately represent their information needs.
Rigid and Inflexible: Existing systems can be rigid and inflexible, making it
difficult to accommodate evolving information needs or adapt to new
document formats or data sources.
High Maintenance Overhead: Dedicated search engines or document
management systems often require significant setup, configuration, indexing,
and ongoing maintenance efforts, increasing the overall operational costs and
resource requirements.
2.PROPOSED SYSTEM:
Frontend (Streamlit):
- User Interface (UI): The Streamlit framework is used to build a responsive and
modern web-based user interface, providing a seamless and intuitive
experience for users.
- File Upload: Users can easily upload one or more PDF files to the system
through the UI. The interface may include features such as file previews,
progress indicators, and support for various PDF file formats and encodings.
- Query Input: Users can enter natural language queries related to the
uploaded PDF content through a text input field or a voice input interface
(optional).
- Answer Display: The generated answers from the backend are displayed to
the users in a clear and readable format within the UI.
- Additional Features (optional): The UI may incorporate additional features like
bookmarking, annotating, or highlighting relevant sections of the PDF for
future reference, providing feedback on answer quality, accessing personalized
features based on search history and preferences, and more.
Backend (LangChain and Python):
- PDF Processing Module: This module handles the loading, parsing, and text
extraction from the uploaded PDF files. It supports various PDF file formats,
encodings, and character sets, while preserving the logical structure and
formatting of the content.
- Text Splitting Module: The extracted text content is split into smaller chunks
or passages using techniques like character-based splitting or token-based
splitting. This module ensures that the text chunks maintain context and
coherence for effective processing.
- Embedding Generation Module: This module generates vector embeddings
(numerical representations) for the text chunks and the user's query using pre-
trained embedding models like OpenAI's `text-embedding-ada-002` or Hugging
Face's `sentence-transformers`. These embeddings capture the semantic
meaning and context of the text.
- Vector Store Module: The generated embeddings are stored and indexed in a
vector database like FAISS, Weaviate, or Milvus. This module handles efficient
similarity search and retrieval operations on the vector data.
- Retrieval Module: Based on the user's query, this module performs vector
similarity search on the indexed embeddings to retrieve the most relevant text
chunks from the vector store. It may implement techniques like top-k retrieval,
semantic search, and query expansion for improved retrieval accuracy.
- Language Model Module: This module integrates with advanced language
models like OpenAI's GPT-3 or other natural language generation models. It
handles communication with the language model APIs or hosted services and
generates natural language answers based on the retrieved text chunks and the
user's query.
- Answer Generation Module: This module combines the retrieved text chunks
and the user's query to generate coherent and contextual answers. It may
implement techniques like answer summarization, extraction, and refinement
to provide concise and relevant responses.
- API Integration Module: This module handles communication with external
APIs like the OpenAI API or other third-party services. It manages API
authentication, rate limiting, error handling, and provides a unified interface
for interacting with external services.
- Caching and Persistence Module (optional): This module implements caching
mechanisms to improve response times and reduce the computational load for
frequently accessed PDF content or commonly asked queries. It may also
handle persistent storage of PDF content, embeddings, and other data for long-
term use, supporting various storage solutions like Redis, PostgreSQL, or cloud-
based services.
- Error Handling and Logging Module: This module implements robust error
handling mechanisms for graceful error management and logging of relevant
information for debugging, monitoring, and auditing purposes.
- Authentication and Authorization Module (optional): If required, this module
handles user authentication and authorization mechanisms on the backend,
managing user data and access control policies, and integrating with the
frontend authentication module for seamless user management.
- Proven Technologies: The PDF chat app leverages proven and widely
adopted technologies, such as Python, LangChain, Streamlit, and the
OpenAI API. These technologies have established communities, extensive
documentation, and ongoing support, reducing the technical risks
associated with the development and deployment of the system.
- Availability of Resources: The required hardware and software
resources for developing and deploying the PDF chat app are readily
available. The system can be developed using standard development
environments and tools, and can be deployed on various infrastructures,
including cloud platforms or on-premises servers.
- Integration Capabilities: The system's modular architecture and the use
of well-defined APIs and industry-standard data formats ensure seamless
integration with external services and APIs, such as the OpenAI API, cloud
storage services, and logging/monitoring services.
- Scalability and Performance: The system design incorporates scalability
and performance considerations, such as the use of vector databases for
efficient similarity search and retrieval, caching mechanisms for improved
response times, and the ability to leverage distributed computing or
cloud-based resources for handling large workloads.
- Security Considerations: The system specifications address security
concerns by including provisions for input validation, secure data transfer
(HTTPS), access control mechanisms, and data encryption. These
measures help mitigate potential security risks and ensure the protection
of sensitive data and user privacy.
3. Financial and Economical Feasibility:
- Development Costs: The development costs for the PDF chat app are
expected to be moderate, as it leverages open-source libraries and
frameworks (e.g., Python, LangChain, Streamlit) and utilizes cloud-based
services (e.g., OpenAI API) with pay-as-you-go pricing models. This
reduces upfront costs and allows for better cost control and scalability.
- Cost Savings: The PDF chat app has the potential to provide cost savings
by streamlining information retrieval and knowledge management
processes within organizations. By enabling users to quickly and efficiently
access relevant information from PDF documents through natural
language queries, the system can improve productivity and reduce the
time and resources spent on manual searching and information gathering
tasks.
- Return on Investment (ROI): While the ROI may vary depending on the
specific use case and organizational context, the potential benefits of the
PDF chat app, such as improved productivity, enhanced knowledge
management, and better decision-making capabilities, can translate into
tangible cost savings and increased efficiency, ultimately contributing to a
positive ROI over time.
- Scalability and Flexibility: The system's scalable architecture and
modular design allow for flexible deployment options, ranging from small-
scale on-premises installations to large-scale cloud-based deployments.
This flexibility enables organizations to choose the most cost-effective
deployment option based on their specific needs and budgets.
Based on the feasibility analysis, the PDF chat app built using LangChain,
Streamlit, and the OpenAI API appears to be operationally, technically,
and financially/economically feasible. The system leverages proven
technologies, addresses scalability and performance concerns,
incorporates security and compliance considerations, and offers potential
cost savings and operational efficiencies. However, it's essential to
perform a detailed cost-benefit analysis and risk assessment specific to
the organization's requirements and constraints before proceeding with
the development and deployment of the system.
SOFTWARE REQUIREMENT SPECIFICATION
1-FUNCTIONAL REQUIREMENTS:
c. Query Input:
- Allow users to enter text queries related to the uploaded PDF content.
- Support natural language queries with varying levels of complexity and
ambiguity.
- Implement query preprocessing techniques (e.g., stopword removal,
stemming, lemmatization) for improved retrieval accuracy.
- Provide query suggestions or autocomplete functionality (optional).
d. Information Retrieval:
- Perform full-text search and retrieval of relevant information from the PDF
content based on the user query.
- Utilize vector embeddings and similarity search techniques for efficient and
accurate retrieval.
- Support retrieval of multiple relevant text chunks or passages.
- Implement query refinement or expansion mechanisms to handle
ambiguous or broad queries.
e. Answer Generation:
- Generate natural language answers to user queries using a language model
(e.g., OpenAI's GPT-3).
- Combine the retrieved relevant text chunks and the user query to generate
coherent and contextual answers.
- Implement answer summarization techniques to provide concise and
focused responses.
- Support answer generation in multiple languages (optional).
f. User Interface:
- Provide an intuitive and user-friendly interface for interacting with the
system.
- Display the generated answers in a clear and readable format.
- Allow users to view the relevant text chunks or passages used to generate
the answer.
- Implement features for bookmarking, annotating, or highlighting relevant
sections of the PDF for future reference.
- Support voice queries and voice-based answer generation for improved
accessibility (optional).
g. Search History and Personalization:
- Maintain a history of user queries and generated answers.
- Allow users to review and revisit previous queries and answers.
- Implement personalization features based on user preferences and search
history (e.g., customized suggestions, tailored results).
b. Scalability:
- The system should be designed to scale horizontally and vertically to
accommodate increasing numbers of users, PDF files, and queries.
- Utilize distributed or cloud-based architectures to scale computing
resources (e.g., CPU, RAM, storage) as needed.
- Implement load balancing and auto-scaling mechanisms to distribute the
workload across multiple servers or instances.
- The system should be able to scale its storage capacity and vector database
to handle large volumes of PDF content and embeddings.
c. Reliability:
- The system should be highly available and fault-tolerant, with minimal
downtime or service disruptions.
- Implement redundancy and failover mechanisms to ensure uninterrupted
service in case of hardware or software failures.
- Implement robust error handling and logging mechanisms to track and
troubleshoot issues effectively.
- Regularly perform backups and have disaster recovery plans in place to
protect against data loss or system failures.
d. Security:
- Implement proper input validation and sanitization to prevent potential
security threats like SQL injection, cross-site scripting (XSS), or code injection
attacks.
- Ensure secure data transfer through the use of HTTPS and encrypted
communication channels.
- Implement access control mechanisms and user
authentication/authorization to protect sensitive data and system resources.
- Regularly monitor and update the system to address newly discovered
security vulnerabilities or threats.
e. Usability:
- The user interface should be intuitive, responsive, and user-friendly,
adhering to established design principles and guidelines.
- Provide clear instructions, tooltips, and error messages to guide users
through the system.
- Implement accessibility features (e.g., keyboard navigation, screen reader
compatibility) to cater to users with disabilities.
- Ensure consistent and predictable behavior across different platforms and
devices (e.g., desktop, mobile).
f. Maintainability:
- Adopt modular and loosely coupled architecture to facilitate easier
maintenance and future enhancements.
- Follow coding standards, best practices, and guidelines to ensure readable,
well-documented, and maintainable codebase.
- Implement automated testing (unit, integration, and end-to-end) to ensure
code quality and catch regressions early.
- Utilize version control systems and continuous integration/continuous
deployment (CI/CD) pipelines to streamline development and deployment
processes.
g. Compatibility:
- The system should be compatible with a wide range of PDF file formats and
versions.
- Ensure cross-browser compatibility for the web-based user interface.
- Support multiple operating systems and architectures (e.g., Windows,
macOS, Linux) for server-side components.
- Regularly test and update the system to ensure compatibility with new
software and hardware releases.
h. Extensibility:
- Design the system with extensibility in mind, allowing for easy integration of
new features, modules, or third-party services.
- Implement well-defined APIs and interfaces to facilitate integration with
other systems or applications.
- Adopt industry-standard data formats and protocols to ensure
interoperability and ease of integration.
3.System Components:
Sure, here are the expanded system components for the PDF chat app:
a. Frontend:
- User Interface (UI) Module:
- Responsible for rendering the web-based user interface using Streamlit.
- Provides components for file upload, query input, answer display, and
other UI elements.
- Implements user interaction logic and event handling.
- Integrates with the backend APIs for data exchange and communication.
- Authentication and Authorization Module (optional):
- Handles user authentication and authorization mechanisms.
- Implements features like user registration, login, password management,
and session management.
- Integrates with the backend for user data management and access control.
b. Backend:
- PDF Processing Module:
- Handles PDF file loading, parsing, and text extraction.
- Supports various PDF file formats and encodings.
- Extracts text content while preserving logical structure and formatting.
- Splits the PDF text into smaller chunks for efficient processing.
- Text Preprocessing Module:
- Performs text cleaning and preprocessing operations.
- Handles tasks like stopword removal, stemming, lemmatization, and
tokenization.
- Prepares the text data for embedding generation and retrieval processes.
- Embedding Generation Module:
- Generates embeddings (numerical representations) for text chunks and
user queries.
- Utilizes pre-trained embedding models like OpenAI's `text-embedding-ada-
002` or Hugging Face's `sentence-transformers`.
- Supports efficient batch processing of embeddings for large datasets.
- Vector Store Module:
- Manages the storage and indexing of embeddings in a vector database.
- Supports various vector database solutions like FAISS, Weaviate, or Milvus.
- Handles efficient similarity search and retrieval operations.
- Retrieval Module:
- Performs vector similarity search and retrieval of relevant text chunks
based on the user query.
- Implements techniques like top-k retrieval, semantic search, and query
expansion.
- Utilizes the vector store and embedding generation modules for efficient
retrieval.
- Language Model Module:
- Integrates with language models like OpenAI's GPT-3 or other natural
language generation models.
- Handles communication with language model APIs or hosted services.
- Generates natural language answers based on the retrieved text chunks
and user query.
- Answer Generation Module:
- Combines the retrieved text chunks and user query to generate coherent
and contextual answers.
- Implements techniques like answer summarization, extraction, and
refinement.
- Utilizes the language model module for answer generation.
- API Integration Module:
- Handles communication with external APIs like OpenAI's GPT-3 API or other
third-party services.
- Manages API authentication, rate limiting, and error handling.
- Provides a unified interface for interacting with external services.
- Caching and Persistence Module (optional):
- Implements caching mechanisms for improved performance and reduced
response times.
- Handles persistent storage of PDF content, embeddings, and other data for
long-term use.
- Supports various storage solutions like Redis, PostgreSQL, or cloud-based
services.
- Error Handling and Logging Module:
- Implements error handling mechanisms for graceful error management.
- Logs relevant information for debugging, monitoring, and auditing
purposes.
- Integrates with logging and monitoring tools or services.
- Authentication and Authorization Module (optional):
- Handles user authentication and authorization mechanisms on the
backend.
- Manages user data and access control policies.
- Integrates with the frontend authentication module for seamless user
management.
4.System Interaction:
1. User Interactions:
- File Upload: The user interacts with the frontend UI to select and upload
one or more PDF files to the system.
- Query Input: The user enters a text query related to the uploaded PDF
content through the UI.
- Answer Display: The generated answer is displayed to the user through the
frontend UI.
- Additional Interactions (optional): Users may interact with features like
bookmarking, annotating, or highlighting relevant sections of the PDF,
providing feedback on answer quality, or accessing personalized features based
on their search history and preferences.
2. Frontend-Backend Interactions:
- File Upload Request: The frontend UI sends a request to the backend API
with the uploaded PDF file(s).
- Query Request: The frontend UI sends the user's query to the backend API.
- Answer Response: The backend API responds with the generated answer,
which is displayed in the frontend UI.
- Authentication and Authorization (optional): The frontend UI communicates
with the backend API for user authentication and authorization, sending
credentials or tokens for secure access to protected resources or features.
5. Infrastructure Interactions:
- Web Server: The frontend UI is hosted and served by a web server, enabling
users to access the application through their web browsers.
- Application Server: The backend components, including the Python
application and APIs, run on an application server or set of servers.
- Vector Database: The vector store module interacts with a dedicated vector
database solution (e.g., FAISS, Weaviate, Milvus) for storing and indexing
embeddings.
- Caching and Storage (optional): The caching and persistence module
interacts with dedicated caching solutions (e.g., Redis) and persistent storage
solutions (e.g., PostgreSQL) for caching and long-term data storage.
- Load Balancing (optional): If multiple application servers are deployed, a
load balancer distributes incoming traffic across the servers for improved
scalability and availability.
f. Resource Constraints:
- The system may be constrained by the available computational resources,
such as CPU, RAM, and storage capacity.
- Optimize resource utilization through techniques like parallel processing,
distributed computing, or leveraging cloud-based resources.
- Implement resource monitoring and management strategies to ensure
efficient utilization and avoid resource exhaustion.
g. Integration Constraints:
- The system may need to integrate with existing systems, databases, or third-
party services, which may impose constraints on data formats, protocols, and
integration methods.
- Ensure compatibility with industry standards and best practices for seamless
integration and interoperability.
- Develop well-defined APIs and interfaces to facilitate integration with
external systems or future enhancements.
User Roles:
a. End User:
- Can upload PDF files to the system.
- Can enter text queries related to the uploaded PDF content.
- Can view the generated answers to their queries.
- Can provide feedback on the quality and relevance of the generated
answers (optional).
- Can access additional features like bookmarking, annotating, or highlighting
relevant sections of the PDF (optional).
- Can access personalized features based on their search history and
preferences (optional).
b. Administrator:
- Responsible for system configuration, maintenance, and monitoring.
- Can manage user accounts and access privileges (if user management is
implemented).
- Can access and analyze system logs and usage metrics.
- Can perform system updates, backups, and data management tasks.
- Can configure system settings, such as API keys, rate limits, and resource
allocation.
- Can monitor and troubleshoot system issues and performance bottlenecks.
Module Descriptions:
1. Agile Methodology:
Agile is a popular and widely adopted methodology that emphasizes iterative
development, continuous feedback, and collaboration. It is well-suited for
projects with dynamic requirements and frequent changes. For the PDF-CHAT
application, you could follow the Scrum framework, which is a specific
implementation of Agile.
2. Waterfall Methodology:
The Waterfall methodology is a traditional, sequential approach where each
phase of the project must be completed before moving to the next phase. It
follows a linear progression from requirements gathering to design,
implementation, testing, and deployment.
- Advantages: Well-defined stages, structured approach, and clear
documentation.
- Potential Drawbacks: Inflexible to changing requirements, lack of early
feedback, and difficulty in addressing defects discovered late in the project.
3. Incremental Development:
This methodology involves developing the application in incremental cycles,
with each cycle delivering a working version of the software with a subset of
the complete requirements. It combines elements of the Waterfall and Iterative
methodologies.
- Advantages: Early and continuous delivery of working software, risk
mitigation, and ability to adapt to changing requirements.
- Key Practices: Requirements prioritization, iterative development, and
continuous integration.
4. Spiral Methodology:
The Spiral methodology is a risk-driven approach that combines elements of
the Waterfall and Iterative methodologies. It follows a spiral pattern, with each
iteration involving planning, risk analysis, development, and evaluation phases.
- Advantages: Risk management, early prototyping, and ability to adapt to
changing requirements.
- Key Practices: Risk analysis, prototyping, and continuous feedback.
Minimum Requirements:
Hardware:
- CPU: 2 cores (4 logical processors)
- RAM: 4 GB
- Storage: 20 GB of free disk space
Software:
- Operating System: Windows 10 or later, macOS 10.15 or later, or a Linux
distribution
- Python: Python 3.7 or later
- Web Browser: Modern web browser (Chrome, Firefox, Safari, Edge)
Dependencies:
- LangChain: `pip install langchain`
- Streamlit: `pip install streamlit`
- OpenAI Python Library: `pip install openai`
- PDF Processing Library: `pip install pypdf` or `pip install pdfplumber`
- Vector Database: `pip install faiss-cpu` or `pip install weaviate-client`
Good Requirements:
Hardware:
- CPU: 4 cores (8 logical processors) or better
- RAM: 8 GB or more
- Storage: 50 GB or more of free disk space (depending on the size and
number of PDF files)
Software:
- Operating System: Windows 10 or later, macOS 10.15 or later, or a Linux
distribution
- Python: Python 3.8 or later
- Web Browser: Modern web browser (Chrome, Firefox, Safari, Edge)
Dependencies:
- LangChain: `pip install langchain`
- Streamlit: `pip install streamlit`
- OpenAI Python Library: `pip install openai`
- PDF Processing Library: `pip install pdfplumber` (more advanced PDF
processing)
- Vector Database: `pip install weaviate-client` (more scalable and
advanced vector database)
- GPU (optional): If you plan to use GPU acceleration for the Language
Model and vector embeddings, you'll need a CUDA-compatible GPU and
the appropriate CUDA and cuDNN libraries installed.
Additional Recommendations:
The high-level design focuses on the overall system architecture, major components, and
their interactions. It provides a bird's-eye view of the system without diving into
implementation details.
1. Frontend:
- Streamlit UI: The frontend will be built using Streamlit, a Python library for creating
interactive web applications. It will provide a user-friendly interface for uploading PDF files
and entering queries.
- File Upload: The UI will allow users to upload one or more PDF files for processing.
- Query Input: The UI will provide a text input field for users to enter their queries.
2. Backend:
- PDF Processing: LangChain's `UnstructuredPDFLoader` will be used to load and parse the
PDF file(s) into text format.
- Text Splitting: LangChain's `CharacterTextSplitter` or `RecursiveCharacterTextSplitter` will
be used to split the PDF text into smaller chunks (or "Documents") for efficient processing.
- Embeddings: LangChain's embedding module (e.g., `OpenAIEmbeddings` or
`HuggingFaceInstructEmbeddings`) will be used to generate embeddings (numerical
representations) of the text chunks and the user query.
- Vector Store: A vector store (e.g., LangChain's `FAISS` or `Chroma`) will be used to store
and index the embeddings for efficient retrieval.
- Retriever: LangChain's retriever (e.g., `VectorDBQARetriever` or `ConvAIRetriever`) will be
used to retrieve the most relevant text chunks based on the user query.
- Language Model: OpenAI's text completion API (e.g., `text-davinci-003`) will be used as
the Language Model to generate answers based on the retrieved text chunks and the user
query.
- Answer Generation: The retrieved text chunks and the user query will be passed to the
Language Model to generate an answer.
3. Data Flow:
- The user uploads PDF file(s) and enters a query through the Streamlit UI.
- The backend processes the PDF file(s), generates embeddings for the text chunks and the
query, and stores them in the vector store.
- The retriever retrieves the most relevant text chunks from the vector store based on the
user query.
- The Language Model generates an answer based on the retrieved text chunks and the
user query.
- The answer is displayed in the Streamlit UI.
The low-level design focuses on the implementation details of each component, including
data structures, algorithms, and specific libraries or frameworks used.
1. Frontend:
- Streamlit UI:
- Use Streamlit's `st.file_uploader` to allow users to upload PDF files.
- Use Streamlit's `st.text_input` to get the user's query.
- Display the generated answer using `st.write`.
2. Backend:
- PDF Processing:
- Use LangChain's `UnstructuredPDFLoader` to load and parse the PDF file(s) into text
format.
- Handle multiple PDF files by iterating over the list of uploaded files.
- Text Splitting:
- Use LangChain's `CharacterTextSplitter` or `RecursiveCharacterTextSplitter` to split the
PDF text into smaller chunks.
- Determine an appropriate chunk size (e.g., 1000 characters) and chunk overlap (e.g., 200
characters) to ensure context preservation.
- Embeddings:
- Use LangChain's `OpenAIEmbeddings` or `HuggingFaceInstructEmbeddings` to generate
embeddings for the text chunks and the user query.
- Determine the appropriate embedding model (e.g., `text-embedding-ada-002` for
OpenAI) based on performance and cost considerations.
- Vector Store:
- Use LangChain's `FAISS` or `Chroma` vector store to store and index the embeddings.
- Configure the vector store parameters (e.g., index type, dimension) for optimal
performance.
- Retriever:
- Use LangChain's `VectorDBQARetriever` or `ConvAIRetriever` to retrieve the most
relevant text chunks based on the user query.
- Configure the retriever parameters (e.g., search quality, number of results) based on
performance and accuracy requirements.
- Language Model:
- Use OpenAI's text completion API (e.g., `text-davinci-003`) as the Language Model for
answer generation.
- Configure the Language Model parameters (e.g., temperature, max tokens) based on
desired output characteristics.
- Answer Generation:
- Use LangChain's `RetrievalQA` chain to combine the retriever and the Language Model
for generating answers.
- Configure the chain parameters (e.g., chain type, prompt template) based on the
desired behavior.
3. Additional Considerations:
- Error Handling: Implement error handling mechanisms for various scenarios, such as
invalid file formats, failed API requests, or other exceptions.
- Caching and Persistence: Consider caching or persisting the vector store and embeddings
to improve performance for subsequent queries on the same PDF file(s).
- Scalability: Evaluate the scalability requirements and consider using distributed or
serverless architectures for handling large volumes of PDF files or queries.
- Security: Implement appropriate security measures, such as input validation, API key
management, and secure data transfer (e.g., HTTPS).
- User Experience: Enhance the user experience by providing progress indicators, file
validation feedback, and helpful error messages.
- Logging and Monitoring: Implement logging and monitoring mechanisms to track
application performance, identify bottlenecks, and troubleshoot issues.
The choice of vector store depends on factors such as dataset size, scalability
requirements, performance needs, and deployment environment (local or cloud).
2. Embeddings Selection:
- LangChain supports several embedding models, including OpenAI's `text-embedding-ada-
002` and Hugging Face's `sentence-transformers` models.
- `text-embedding-ada-002` is a high-performance and efficient embedding model
provided by OpenAI, suitable for most use cases.
- Hugging Face's `sentence-transformers` models, such as `all-MiniLM-L6-v2` and `all-
mpnet-base-v2`, are also popular choices and can be used with LangChain's
`HuggingFaceInstructEmbeddings`.
Level 0:
LEVEL 2:
Entity Relationship Diagram:
Component Diagram:
TECHNOLOGY DESCRIPTION
PYTHON
What is Python?
Python is a high-level, general-purpose programming language that
emphasizes code readability and simplicity. It was created by Guido van
Rossum in the late 1980s and first released in 1991. Python's design
philosophy emphasizes writing code that is easy to read and understand,
making it an excellent choice for beginners as well as experienced
developers.
Windows:
1. Go to the official Python website
(https://www.python.org/downloads/windows/) and download the latest
version of Python for Windows.
2. Run the installer and follow the on-screen instructions. Make sure to
check the "Add Python to PATH" option during the installation process.
3. After installation, open the command prompt and type `python --
version` to verify that Python has been installed correctly.
macOS:
1. Visit the official Python website
(https://www.python.org/downloads/mac-osx/) and download the latest
version of Python for macOS.
2. Run the installer package and follow the on-screen instructions.
3. After installation, open the terminal and type `python3 --version` to
verify that Python has been installed correctly.
Linux:
Python is often pre-installed on most Linux distributions, but you may
need to install a specific version or update it manually. The process varies
depending on your distribution, but here are the general steps:
1. Open the terminal.
2. Check if Python is already installed by typing `python3 --version`.
3. If Python is not installed or if you need a different version, use your
distribution's package manager to install or update Python. For example,
on Ubuntu or Debian, you can use `sudo apt-get install python3`.
4. After installation, verify the installation by typing `python3 --version`.
Python comes with a vast standard library that provides a wide range of
functionality out of the box. Additionally, there are thousands of third-
party modules and libraries available in the Python Package Index (PyPI)
that extend Python's capabilities even further. Here are some of the most
popular and widely-used modules in Python:
Beautiful Soup: Beautiful Soup is a Python library for web scraping, used
to parse HTML and XML documents. It provides a simple and intuitive way
to navigate and search the parse tree, extract data from HTML and XML
files, and handle malformed markup with ease.
Here's a sample Python code that demonstrates the use of some of the
modules mentioned above:
python
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Make predictions
future_years = np.array([[2020], [2021], [2022]])
future_sales = model.predict(future_years)
Advantages of Python
1. Easy to Learn and Read: Python has a simple and clean syntax that
follows the principles of readability and minimalism. Its code is easy to
understand and write, even for beginners, making it a great language for
learning programming concepts.
7. Large and Active Community: Python has a large and active community
of developers, which contributes to its continuous growth and
improvement. This community provides extensive documentation,
tutorials, and support forums, making it easier for developers to learn and
solve problems.
Disadvantages of Python
Development History
Streamlit was developed to democratize data science and machine
learning by providing a simple yet powerful interface for creating
interactive web applications. While the exact date of its development is
not specified in the provided sources, it has evolved significantly since its
inception, with numerous updates and features added over time to
enhance its capabilities and usability.
Deployment Options
Streamlit provides several options for deploying and sharing Streamlit
apps:
Streamlit Community Cloud: A free platform for deploying and sharing
Streamlit apps.
Streamlit Sharing: A service for deploying, managing, and sharing public
Streamlit apps for free.
Streamlit in Snowflake: An enterprise-class solution for housing data and
apps in a unified, global system.
Getting Started
SYNTAX:
To import the Streamlit library in your Python file:
import streamlit as st
• To run the Streamlit app, navigate to the directory where your Python
file is located in your command prompt or terminal, and run the
command:
streamlit run your_file_name.py
#replacing `your_file_name.py` with the actual name of your Python file.
Resources
• Streamlit Gallery
• Streamlit Documentation
Conclusion
Streamlit is a powerful tool for anyone involved in data science, machine
learning, or data analysis, offering a straightforward way to create
interactive web applications. Its ease of use, combined with the flexibility
and power of Python, makes it an essential tool in the data scientist's
toolkit. Whether you're a beginner looking to explore data or an
experienced professional wanting to deploy a machine learning model,
Streamlit has something to offer.
LangChain
GETTING STARTED
Installation
To install LangChain run:
pip install langchain
Building with LangChain
LangChain enables building application that connect external sources of
data and computation to LLMs. In this quickstart, we will walk through a
few different ways of doing that. We will start with a simple LLM chain,
which just relies on information in the prompt template to respond. Next,
we will build a retrieval chain, which fetches data from a separate
database and passes that into the prompt template. We will then add in
chat history, to create a conversation retrieval chain. This allows you to
interact in a chat manner with this LLM, so it remembers previous
questions. Finally, we will build an agent - which utilizes an LLM to
determine whether or not it needs to fetch data to answer questions. We
will cover these at a high level, but there are lot of details to all of these!
We will link to relevant docs.
LLM Chain
We'll show how to use models available via API, like OpenAI, and local
open source models, using integrations like Ollama.
pip install langchain-openai
export OPENAI_API_KEY="..."
We can then initialize the model:
llm = ChatOpenAI()
If you'd prefer not to set an environment variable you can pass the key in
directly via the api_key named parameter when initiating the OpenAI LLM
class:
API Reference:
ChatPromptTemplate
We can now combine these into a simple LLM chain:
We can now invoke it and ask the same question. It still won't know the
answer, but it should respond in a more proper tone for a technical
writer!
chain.invoke({"input": "how can langsmith help with testing?"})
output_parser = StrOutputParser()
We can now invoke it and ask the same question. The answer will now be
a string (rather than a ChatMessage).
Conclusion
LangChain represents a significant advancement in the field of LLM
application development, offering a comprehensive framework that
simplifies every stage of the LLM application lifecycle. Its open-source
nature, coupled with a suite of powerful libraries and components, makes
it an invaluable tool for developers looking to leverage the power of LLMs
in their applications. With its focus on streamlining development,
productionization, and deployment, LangChain stands as a testament to
the future of LLM-powered applications.
Large Language Model
A large language model (LLM) is a deep learning algorithm that can
perform a variety of natural language processing (NLP) tasks. Large
language models use transformer models and are trained using massive
datasets — hence, large. This enables them to recognize, translate,
predict, or generate text or other content.
Large language models also have large numbers of parameters, which are
akin to memories the model collects as it learns from training. Think of
these parameters as the model’s knowledge bank.
Large language models might give us the impression that they understand
meaning and can respond to it accurately. However, they remain a
technological tool and as such, large language models face a variety of
challenges.
Hallucinations: A hallucination is when a LLM produces an output that is
false, or that does not match the user's intent. For example, claiming that it is
human, that it has emotions, or that it is in love with the user. Because large
language models predict the next syntactically correct word or phrase, they
can't wholly interpret human meaning. The result can sometimes be what is
referred to as a "hallucination."
Security: Large language models present important security risks when not
managed or surveyed properly. They can leak people's private information,
participate in phishing scams, and produce spam. Users with malicious intent
can reprogram AI to their ideologies or biases, and contribute to the spread of
misinformation. The repercussions can be devastating on a global scale.
Bias: The data used to train language models will affect the outputs a given
model produces. As such, if the data represents a single demographic, or
lacks diversity, the outputs produced by the large language model will also
lack diversity.
Consent: Large language models are trained on trillions of datasets — some
of which might not have been obtained consensually. When scraping data
from the internet, large language models have been known to ignore copyright
licenses, plagiarize written content, and repurpose proprietary content without
getting permission from the original owners or artists. When it produces
results, there is no way to track data lineage, and often no credit is given to
the creators, which can expose users to copyright infringement issues.
They might also scrape personal data, like names of subjects or
photographers from the descriptions of photos, which can compromise
privacy.2 LLMs have already run into lawsuits, including a prominent one by
Getty Images3, for violating intellectual property.
Scaling: It can be difficult and time- and resource-consuming to scale and
maintain large language models.
Deployment: Deploying large language models requires deep learning, a
transformer model, distributed software and hardware, and overall technical
expertise.
An API (Application Programming Interface) is a set of rules and protocols that allow
different software applications to communicate and interact with each other. It defines the
ways in which one application can access and use the services or data provided by another
application or system.
1. Web Services: APIs enable different web applications or websites to share data and
functionalities, allowing for seamless integration and communication between them.
2. Mobile App Development: APIs provide a way for mobile apps to interact with
remote servers or databases, enabling features such as accessing user data, processing
payments, or integrating with third-party services.
3. Software Integration: APIs facilitate the integration of different software systems or
components, enabling them to exchange data and functionality, enhancing
interoperability and reducing the need for custom development.
4. Data Sharing: APIs allow organizations to securely share data with partners,
developers, or customers, enabling them to build applications or services on top of
that data.
5. Internet of Things (IoT): APIs play a crucial role in IoT systems by enabling
communication and data exchange between various devices, sensors, and platforms.
6. Cloud Services: Cloud service providers, such as Amazon Web Services (AWS),
Google Cloud Platform (GCP), and Microsoft Azure, offer APIs that allow developers
to access and utilize their services programmatically.
7. Machine Learning and AI: APIs can be used to integrate machine learning models
or artificial intelligence capabilities into applications, enabling features like natural
language processing, image recognition, or predictive analytics.
here's an example of how to make a GET request to an API endpoint and retrieve the
response data using Python's requests library:
import requests
url = "https://api.example.com/data"
# Optional parameters or headers
params = {
"key1": "value1",
"key2": "value2"
headers = {
if response.status_code == 200:
data = response.json()
print(data)
else:
print(f"Error: {response.status_code}")
Here's an example of how to make a POST request to an API endpoint with a JSON payload:
import requests
import json
url = "https://api.example.com/create"
# Request payload
payload = {
"email": "[email protected]"
print(data)
else:
print(f"Error: {response.status_code}")
Now, let's dive into the OpenAI API for text generation:
OpenAI's API provides access to their language models, including GPT-3 (Generative Pre-
trained Transformer 3), which is a powerful natural language processing model capable of
generating human-like text. The API allows developers to integrate text generation
capabilities into their applications or services.
Some use cases for the OpenAI API for text generation include:
1. Content Generation: Generating articles, stories, essays, scripts, or any other form of
written content based on prompts or inputs.
2. Creative Writing: Assisting with creative writing tasks, such as generating plot ideas,
character descriptions, or dialogue.
3. Language Translation: Translating text from one language to another, leveraging the
model's understanding of context and language structure.
9. Data Augmentation: Generating synthetic training data for machine learning models
by creating variations of existing text samples.
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
py2pdf is a Python library that allows you to convert HTML content to PDF documents. It
utilizes the versatile wkhtmltopdf rendering engine, which is based on the Qt WebKit engine,
providing a reliable and robust conversion process. This library simplifies the task of
generating PDF files from HTML templates, making it an ideal choice for web developers,
report generation applications, and any scenario where you need to create PDF documents
programmatically. With its straightforward API and customization options, py2pdf
streamlines the process of transforming HTML content into professional-looking PDF files.
Here's a detailed example of how to implement the `py2pdf` library in a Python project to
convert HTML content to PDF files:
```bash
pip install py2pdf
```
Next, we'll create a new Python file, e.g., `html_to_pdf.py`, and add the following code:
You can customize the HTML content, styles, and conversion options according to your
requirements.
Once you have `wkhtmltopdf` installed, you can run the `html_to_pdf.py` script, and it will
generate a PDF file named `output.pdf` in the same directory.
Here are some additional options you can use with the `htmltopdf` function:
- `output_path`: Specify the path (directory) where the output PDF file should be saved.
- `stylesheet`: Provide a CSS file or a list of CSS files to apply styles to the HTML content.
- `header_html`: Specify HTML content to be included as a header on each page.
- `footer_html`: Specify HTML content to be included as a footer on each page.
- `toc`: Generate a table of contents for the PDF document.
- `cover`: Specify an HTML file or a URL to be used as the cover page.
- `orientation`: Set the orientation of the PDF document to either "Portrait" or "Landscape".
You can find more information about the available options and their usage in the `py2pdf`
documentation: https://py2pdf.readthedocs.io/en/latest/
2. Faiss-cpu:
Faiss-cpu is a CPU-based version of the Faiss (Facebook AI Similarity Search) library, which is
a powerful tool for efficient similarity search and clustering of dense vector embeddings. It
provides high-performance and scalable algorithms for searching, indexing, and comparing
large collections of high-dimensional vectors. Faiss-cpu is particularly useful in applications
involving natural language processing, computer vision, and recommendation systems,
where similarity search is a crucial component. Despite being a CPU-based implementation,
Faiss-cpu still offers impressive performance and can be integrated into various machine
learning pipelines and applications that require efficient vector similarity computations.
I can provide an example of how to use the `faiss-cpu` library in a Python project. Faiss
(Facebook AI Similarity Search) is a library for efficient similarity search and clustering of
dense vectors. Here's an example implementation:
Next, we'll create a new Python file, e.g., `faiss_example.py`, and add the following code:
import numpy as np
import faiss
# Sample data
num_vectors = 1000
vector_dim = 128
vectors = np.random.rand(num_vectors, vector_dim).astype('float32')
# Create index
index = faiss.IndexFlatL2(vector_dim)
# Add vectors to the index
index.add(vectors)
1. We import the necessary libraries: `numpy` for working with arrays, and `faiss` for
similarity search and clustering.
2. We create a sample dataset of `num_vectors` random vectors, each with `vector_dim`
dimensions, using NumPy.
3. We create a `faiss.IndexFlatL2` index, which is a flat index that computes L2 (Euclidean)
distances between vectors.
4. We add the sample vectors to the index using the `index.add()` method.
5. We create a random query vector to search for similar vectors.
6. We specify the number of nearest neighbors (`k`) to retrieve for the query vector.
7. We perform the similarity search using the `index.search()` method, providing the query
vector and the number of nearest neighbors to retrieve.
8. The `index.search()` method returns two arrays: `distances` and `indices`. `distances`
contains the distances between the query vector and each of the retrieved nearest
neighbors, while `indices` contains the indices of the nearest neighbor vectors in the original
dataset.
9. We print the indices and distances of the `k` nearest neighbors to the query vector.
This example demonstrates how to create an index, add vectors to the index, and perform
similarity search using the `faiss-cpu` library.
You can customize the code to work with your own dataset and vector representations.
Additionally, you can explore different index types provided by Faiss, such as `IndexIVFFlat`
for larger datasets or `IndexHNSWFlat` for approximate nearest neighbor search.
Faiss also supports GPU acceleration through the `faiss-gpu` package, which can significantly
improve performance for large-scale similarity search tasks.
3. Altair:
```bash
pip install altair
```
Next, we'll create a new Python file, e.g., `altair_example.py`, and add the following code:
```python
import altair as alt
import pandas as pd
scatter_plot = alt.Chart(source).mark_point().encode(
x='x',
y='y'
)
1. We import the necessary libraries: `altair` for creating visualizations and `pandas` for
working with data.
2. We create a sample dataset using a Pandas DataFrame.
3. We create a simple bar chart using the `alt.Chart` function from Altair. We specify the data
source (`data`), the mark type (`mark_bar()`), and the encoding (`encode()`) for the x and y
axes.
4. We create another sample dataset for a scatter plot.
5. We create a scatter plot using the `alt.Chart` function, specifying the data source
(`source`), the mark type (`mark_point()`), and the encoding for the x and y axes.
6. We display the bar chart and scatter plot using the `show()` method.
When you run this script, it will display two visualizations: a bar chart and a scatter plot.
You can customize the visualizations by using different mark types (e.g., `mark_line()`,
`mark_area()`, `mark_circle()`), adjusting the encoding, adding titles, legends, and other
visual properties.
```python
import altair as alt
from vega_datasets import data as vega_data
Altair provides a powerful and expressive syntax for creating a wide range of visualizations,
from simple charts to complex, interactive dashboards. You can find more examples and
documentation at https://altair-viz.github.io/.
CODING
Graphircal User Interface(GUI):
history.py:
This part of the code deals with the chat history during the session:
import streamlit as st
class ChatHistory:
def __init__(self):
self.history = st.session_state.get("history",
ConversationBufferMemory(memory_key="chat_history",
return_messages=True))
st.session_state["history"] = self.history
def default_greeting(self):
def reset(self):
st.session_state["history"].clear()
st.session_state["reset_chat"] = False
class Layout:
def show_header(self):
"""
Displays the header of the app
"""
st.markdown(
"""
<h1 style='text-align: center;'>PDFChat, A New way to interact with your
pdf! </h1>
""",
unsafe_allow_html=True,
)
def show_api_key_missing(self):
"""
Displays a message if the user has not entered an API key
"""
st.markdown(
"""
<div style='text-align: center;'>
<h4>Enter your <a href="https://platform.openai.com/account/api-
keys" target="_blank">OpenAI API key</a> to start chatting </h4>
</div>
""",
unsafe_allow_html=True,
)
def prompt_form(self):
"""
Displays the prompt form
"""
with st.form(key="my_form", clear_on_submit=True):
user_input = st.text_area(
"Query:",
placeholder="Ask me anything about the PDF...",
key="input",
label_visibility="collapsed",
)
submit_button = st.form_submit_button(label="Send")
import streamlit as st
class Sidebar:
MODEL_OPTIONS = ["gpt-3.5-turbo", "gpt-4"]
TEMPERATURE_MIN_VALUE = 0.0
TEMPERATURE_MAX_VALUE = 1.0
TEMPERATURE_DEFAULT_VALUE = 0.0
TEMPERATURE_STEP = 0.01
@staticmethod
def about():
" ",
]
for section in sections:
about.write(section)
def model_selector(self):
model = st.selectbox(label="Model", options=self.MODEL_OPTIONS)
st.session_state["model"] = model
@staticmethod
def reset_chat_button():
if st.button("Reset chat"):
st.session_state["reset_chat"] = True
st.session_state.setdefault("reset_chat", False)
def temperature_slider(self):
temperature = st.slider(
label="Temperature",
min_value=self.TEMPERATURE_MIN_VALUE,
max_value=self.TEMPERATURE_MAX_VALUE,
value=self.TEMPERATURE_DEFAULT_VALUE,
step=self.TEMPERATURE_STEP,
)
st.session_state["temperature"] = temperature
def show_options(self):
class Utilities:
@staticmethod
def load_api_key():
"""
Loads the OpenAI API key from the .env file or from the user's input
and returns it
"""
if os.path.exists(".env") and os.environ.get("OPENAI_API_KEY") is not
None:
user_api_key = os.environ["OPENAI_API_KEY"]
@staticmethod
def handle_upload():
"""
Handles the file upload and displays the uploaded file
"""
uploaded_file = st.sidebar.file_uploader("upload", type="pdf",
label_visibility="collapsed")
if uploaded_file is not None:
pass
else:
st.sidebar.info(
@staticmethod
def setup_chatbot(uploaded_file, model, temperature):
"""
Sets up the chatbot with the uploaded file, model, and temperature
"""
embeds = Embedder()
with st.spinner("Processing..."):
uploaded_file.seek(0)
file = uploaded_file.read()
vectors = embeds.getDocEmbeds(file, uploaded_file.name)
chatbot = Chatbot(model, temperature, vectors)
st.session_state["ready"] = True
return chatbot
app.py:
This is the main executable file that is executed with the command
streamlit run app.py
import streamlit as st
if __name__ == '__main__':
layout.show_header()
user_api_key = utils.load_api_key()
if not user_api_key:
layout.show_api_key_missing()
else:
os.environ["OPENAI_API_KEY"] = user_api_key
pdf = utils.handle_upload()
if pdf:
sidebar.show_options()
try:
history = ChatHistory()
chatbot = utils.setup_chatbot(
pdf, st.session_state["model"], st.session_state["temperature"]
)
st.session_state["chatbot"] = chatbot
if st.session_state["ready"]:
history.initialize(pdf.name)
with prompt_container:
is_ready, user_input = layout.prompt_form()
if st.session_state["reset_chat"]:
history.reset()
if is_ready:
output =
st.session_state["chatbot"].conversational_chat(user_input)
history.generate_messages(response_container)
except Exception as e:
st.error(f"{e}")
st.stop()
sidebar.about()
chatbot.py:
import streamlit as st
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
class Chatbot:
return result["answer"]
embeddings.py
import os
import pickle
import tempfile
class Embedder:
def __init__(self):
self.PATH = "embeddings"
self.createEmbeddingsDir()
def createEmbeddingsDir(self):
"""
Creates a directory to store the embeddings vectors
"""
if not os.path.exists(self.PATH):
os.mkdir(self.PATH)
return vectors
.gitignore
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# SonarLint plugin
.idea/sonarlint/
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code
is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in
version control.
# However, in case of collaboration, if having platform-specific dependencies
or dependencies
# having no cross-platform support, pipenv may install dependencies that
don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in
version control.
# This is especially recommended for binary packages to ensure
reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-
to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in
version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended
to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# Celery stuff
celerybeat-schedule
celerybeat.pid
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
requirements.txt
# ChatPDF/chatbot.py: 2,3,4
# ChatPDF/embedding.py: 5,6,7
# ChatPDF/gui/history.py: 4
# ChatPDF/notebook/pdf_chat.ipynb: 1,3,10,11,19,20,21,22
langchain==0.0.153
# ChatPDF/app.py: 3
# ChatPDF/chatbot.py: 1
# ChatPDF/gui/history.py: 1
# ChatPDF/gui/layout.py: 1
# ChatPDF/gui/sidebar.py: 3
streamlit==1.22.0
# ChatPDF/gui/history.py: 5
streamlit_chat_media==0.0.4
pypdf==3.8.1
openai==0.27.5
tiktoken==0.3.3
faiss-cpu==1.7.4
TESTING
1. Unit Testing:
- Unit tests are designed to test individual units or components of the
application in isolation.
- For the PDF-CHAT application, unit tests can be written to verify the
functionality of individual modules such as text chunking algorithms, OpenAI
embedding generation, LangChain LLM integration, and user interface
components.
- Unit tests help catch bugs early in the development process and facilitate
code refactoring and maintainability.
- Tools like pytest (for Python), Jest (for JavaScript), and JUnit (for Java) can be
used to write and run unit tests.
2. Integration Testing:
- Integration tests verify the interaction and communication between
different components or modules of the application.
- In the case of PDF-CHAT, integration tests can be performed to ensure that
the text chunking, embedding generation, and LLM components work together
seamlessly to generate accurate responses.
- Integration tests can also be used to validate the integration between the
backend and frontend components, such as testing the API endpoints and data
flow between the Flask server and Streamlit UI.
- Tools like Selenium or Cypress can be used for end-to-end integration testing
of the application's user interface and backend integration.
3. Functional Testing:
- Functional tests validate the application against specified requirements and
user scenarios.
- For PDF-CHAT, functional tests can be designed to test the core
functionalities, such as uploading PDF files, asking questions, displaying
responses, and handling edge cases or error scenarios.
- Automated functional tests can simulate user actions and verify the
expected outputs, ensuring that the application behaves as intended.
- Tools like Selenium WebDriver or Appium can be used for automating
functional tests across different browsers, devices, and platforms.
4. Performance Testing:
- Performance tests evaluate the application's behavior and response times
under different load conditions, such as high user traffic or large PDF files.
- For PDF-CHAT, performance tests can measure the application's response
times for processing PDFs, generating embeddings, querying the LLM, and
rendering responses in the UI.
- Load testing tools like Apache JMeter, Locust, or k6 can be used to simulate
different levels of concurrent users and measure the application's performance
metrics.
5. Security Testing:
- Security tests assess the application's resilience against potential
vulnerabilities and attacks, such as SQL injection, cross-site scripting (XSS), or
unauthorized access attempts.
- For PDF-CHAT, security tests can focus on testing the file upload
functionality, user input validation, and protection against potential attacks or
malicious PDF content.
- Tools like OWASP ZAP or Burp Suite can be used for security testing and
identifying vulnerabilities.
6. Usability Testing:
- Usability tests evaluate the application's user interface and user experience,
identifying areas for improvement in terms of ease of use, navigation, and
accessibility.
- For PDF-CHAT, usability tests can involve observing users interacting with the
application, gathering feedback on the interface design, and identifying any
usability issues or pain points.
- Tools like UserTesting, Hotjar, or moderated usability testing sessions can be
employed to gather usability data and insights.
7. Compatibility Testing:
- Compatibility tests ensure that the application functions correctly across
different platforms, browsers, devices, and configurations.
- For PDF-CHAT, compatibility tests can involve testing the application on
various operating systems (Windows, macOS, Linux), different web browsers
(Chrome, Firefox, Safari, Edge), and mobile devices with varying screen sizes
and resolutions.
- Tools like BrowserStack or SauceLabs can be used for cross-browser and
cross-device compatibility testing.
8. Regression Testing:
- Regression tests are performed to ensure that existing features continue to
work as expected after introducing new changes, bug fixes, or enhancements
to the application.
- For PDF-CHAT, regression tests can be automated to verify that the core
functionality, such as PDF processing, question-answering, and UI interactions,
remain intact after each code change or update.
- Regression test suites can be built using test automation frameworks like
Selenium or pytest and integrated into the continuous integration/continuous
deployment (CI/CD) pipeline.
By incorporating these various testing types into the development process, you
can ensure the quality, reliability, and robustness of the PDF-CHAT application,
while also identifying and addressing any potential issues or defects early on.
Additionally, adopting a test-driven development (TDD) approach and
integrating testing into the continuous integration/continuous deployment
(CI/CD) pipeline can further streamline the testing process and ensure a high-
quality product delivery.
OUTPUT SCREENS
Run the code with the given command in the terminal.
There is a collapsable nav bar with some options like rerun, settings etc.
On the side bar there is dialogue box that prompts your API KEY to start the
chat.
Once Verified, an option to upload the pdf appears as shown below.
Upload any pdf that you want to interact with.
After uploading, a new chat window appears as shown, where you can chat
with the API about your pdf contents. There is also a slider bar on the sidebar
to adjust the “Temperature” of the LLM- that means you can adjust its
creativity levels while answering.
AT the end there is an option to reset the chat once done with the purpose.
CONCLUSION
The PDF-CHAT application is a groundbreaking solution that revolutionizes the
way users interact with and extract information from PDF documents. By
leveraging cutting-edge technologies in natural language processing, machine
learning, and user interface design, the application provides an intuitive and
efficient means of navigating through complex PDF content.
1. Multi-Language Support:
Enhance the application to support multiple languages for both the PDF
content and the user interface. This would involve integrating language
detection algorithms, incorporating multilingual language models, and enabling
language selection options for users, making the application accessible to a
broader global audience.
These future enhancements would not only improve the functionality and user
experience of the PDF-CHAT application but also broaden its applicability and
appeal across various domains and use cases, further solidifying its position as
a powerful and innovative tool for information retrieval and knowledge
management.
BIBLIOGRAPHY
1. Gillies, S. (2022). "Introducing ChatGPT and the AI revolution." Nature,
613(7942), 13-13. https://doi.org/10.1038/d41586-023-00446-w
2. Honnibal, I., & Montag, I. (2017). "spaCy 2: Natural language understanding
with Bloom embeddings, convolutional neural networks and incremental
parsing." To appear, 7(1), 411-420. https://spacy.io/
3. Johnson, J., Douze, M., & Jégou, H. (2021). "Billion-scale similarity search
with GPUs." IEEE Transactions on Big Data, 7(3), 535-547.
https://doi.org/10.1109/TBDATA.2019.2921572
4. Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage
Search via Contextualized Late Interaction over BERT." Proceedings of the 43rd
International ACM SIGIR Conference on Research and Development in
Information Retrieval, 39-48. https://doi.org/10.1145/3397271.3401081
5. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... &
Riedel, S. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive
NLP Tasks." Advances in Neural Information Processing Systems, 33, 9459-9474.
https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945
df7481e5-Abstract.html
6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V.
(2019). "Roberta: A robustly optimized bert pretraining approach." arXiv
preprint arXiv:1907.11692. https://arxiv.org/abs/1907.11692
7. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019).
"Language models are unsupervised multitask learners." OpenAI blog, 1(8), 9.
https://cdn.openai.com/better-language-
models/language_models_are_unsupervised_multitask_learners.pdf
8. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks." Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982-3992.
https://doi.org/10.18653/v1/D19-1410
9. Wenzina, R. (2021). "PDF Parsing in Python." In Advanced Guide to Python 3
Programming (pp. 289-312). Apress, Berkeley, CA.
https://doi.org/10.1007/978-1-4842-6044-5_10