0% found this document useful (0 votes)

31 views18 pages

Chat With PDF

The document discusses a midterm project report on unstructured search with file powered by AI. It was submitted by two students under the guidance of a professor for their Bachelor of Technology degree in Computer Science and Engineering. The document outlines the objectives, system requirements including hardware, software, and an analysis of the software requirements for the project.

Uploaded by

balsehra445

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views18 pages

Chat With PDF

Uploaded by

balsehra445

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

A Midterm Project Report

on
UNSTRUCTURED SEARCH WITH
FILE POWERED BY AI

BACHELOR OF TECHNOLOGY
Computer Science and Engineering

SUBMITTED BY:
Baljit Singh (2104219)
Varinder Singh (2004685)
UNDER THE GUIDANCE OF
Prof Sita Rani
JANUARY-MAY 2024

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

GURU NANAK DEV ENGINEERING COLLEGE LUDHIANA

(An Autonomous College Under UGC ACT)

INDEX
Table of Contents
BACHELOR OF TECHNOLOGY...........................................................................................................1

Computer Science and Engineering..........................................................................................................1

SUBMITTED BY:.......................................................................................................................................1

1. Introduction.........................................................................................................................................3

2. System Requirements..........................................................................................................................4

5.1. Hardware..................................................................................................................................4

5.2. Software....................................................................................................................................4

3. Software Requirement Analysis............................................................................................................5

4. Software Design...................................................................................................................................7

5. Testing Module........................................................................................................................................8

6. Performance of the Project Developed (So Far).............................................................................10

7. Output Screens..................................................................................................................................11

8. References...........................................................................................................................................13

1. Introduction
In the realm of technological advancements, the integration of artificial intelligence (AI) has been a
transformative force, shaping innovative solutions across various domains. This project, "Chat with PDF
using AI," represents a pioneering endeavor at the intersection of natural language processing and
document management. In this introductory section, we provide a concise overview of the project,
highlighting key aspects such as the underlying technology, the specialized field it caters to, and any
pertinent technical terms integral to understanding its scope.
 Project Overview:
The fundamental objective of this project is to harness the power of AI to facilitate seamless and
intelligent interactions with PDF documents. Traditional methods of extracting information or engaging in
meaningful conversations with PDF files have often been laborious and time-consuming. By leveraging
advanced natural language processing algorithms, this project aims to revolutionize the way users interact
with PDFs, making the process more intuitive, efficient, and user-friendly.
 Technology Stack:
The project relies on a sophisticated technology stack that encompasses state-of-the-art AI and machine
learning frameworks. Natural Language Processing (NLP) models, particularly those built on transformer
architectures like GPT-3.5, form the backbone of the chat functionality. Additionally, computer vision
techniques may be employed for enhanced document understanding and information extraction. The
integration of these technologies ensures a robust and intelligent system capable of interpreting the
nuances of natural language within the context of PDF documents.
 Specialized Field:
While the project's applicability extends to a broad user base dealing with PDF documents, it particularly
addresses the needs of professionals in knowledge-intensive fields. Researchers, educators, legal
professionals, and corporate entities dealing with voluminous PDF-based information stand to benefit
significantly from the streamlined and intelligent interactions facilitated by the AI-driven chat system.
 Technical Terminology:
To appreciate the intricacies of this project, it is essential to familiarize oneself with a few key technical
terms:
Natural Language Processing (NLP): A branch of AI that focuses on enabling machines to understand,
interpret, and generate human-like language.
Transformer Architectures: Advanced machine learning architectures, such as GPT-3.5, that have
demonstrated exceptional capabilities in language understanding and generation.

Objectives
1. To Optimize PDF searches for rapid access
2. Automate document summarization for efficiency.
3. Implement context-aware responses for user queries using Langchain model.

2. System Requirements

System requirements outline the necessary software and hardware components needed to support the
functionality of the project. These requirements serve as a foundation for the development and
deployment of the software solution. In this section, we provide a detailed explanation of the system
requirements for the "Chat with PDF using AI" project:

1. Hardware Requirements:

 Processor: The hardware must include a processor with sufficient computing power to
handle the processing demands of the AI algorithms and document management tasks. A
dual-core processor or equivalent is recommended to ensure smooth performance.

 RAM (Random Access Memory): The system should have a minimum of 8GB of RAM
to support the concurrent execution of multiple processes and ensure efficient memory
management.

 Storage: Adequate storage space is essential for storing PDF documents, application files,
and other data. An SSD (Solid State Drive) is recommended for optimal performance, as
it offers faster read/write speeds compared to traditional HDDs.

 Screen Size: For optimal user experience, the system should be accessed from devices
with a screen size of 15 inches or larger, ensuring sufficient display area for viewing
documents and interacting with the application.

2. Software Requirements:

 Integrated Development Environment (IDE): Developers require a suitable IDE for

writing, debugging, and testing code. Visual Studio Code is recommended for its
versatility and extensive plugin ecosystem, which supports various programming
languages and development workflows.

 Version Control System (VCS): Effective version control is essential for managing code
changes and collaborating with team members. Git is the preferred VCS for its distributed
architecture, branching capabilities, and integration with popular hosting platforms like
GitHub and GitLab.
 Web Development Framework: The project may utilize web development frameworks
such as Next.js and React for building the user interface and frontend components. These
frameworks offer a rich set of features, including component-based architecture, server-
side rendering, and state management, facilitating the development of responsive and
interactive web applications.

 Database Management System (DBMS): A reliable DBMS is required for storing and
managing metadata associated with PDF documents, user data, and application
configurations. MySQL or PostgreSQL are recommended for their robustness, scalability,
and compatibility with web applications. These DBMSs support SQL-based queries,
transactions, and data replication, ensuring data consistency and integrity.
3. Software Requirement Analysis

Software Requirement Analysis is a crucial phase in the software development lifecycle that involves
identifying, documenting, and analyzing the functional and non-functional requirements of the system.
This section provides a detailed explanation of the software requirements for the "Chat with PDF using
AI" project:

Problem Definition: The project addresses the challenge of enhancing user interactions with PDF
documents. Traditional methods often lack efficiency and intuitiveness, prompting the need for an AI-
driven solution. The primary issues identified include:
1. Inefficient Search: Conventional search methods within PDF documents rely on manual
keyword-based searches, which may not yield accurate or relevant results.
2. Lack of Contextual Understanding: Existing systems fail to understand the context of user
queries within PDF documents, leading to suboptimal responses.
3. Manual Summarization: Users often need to manually sift through lengthy PDF documents to
extract relevant information, consuming valuable time and effort.
Modules and Functionalities:
1. User Interface:
 Provides a user-friendly chat interface for users to interact with PDF documents.
 Facilitates natural language queries and responses.
 Supports intuitive navigation and document management features.
2. NLP Module:
 Processes user queries and interprets natural language within the context of PDF
documents.
 Utilizes advanced NLP techniques, such as semantic analysis and entity recognition, to
understand user intent.
 Generates context-aware responses tailored to user queries, enhancing the overall user
experience.
3. Document Management:
 Handles document storage, retrieval, and manipulation.
 Enables seamless integration with existing document repositories or cloud storage
services.
 Supports metadata extraction and indexing to facilitate efficient search and retrieval
operations.
4. Database Management:
 Manages metadata associated with PDF documents, including title, author, keywords,
and publication date.
 Provides robust indexing capabilities to support fast and accurate search operations.
 Ensures data integrity and security through role-based access control mechanisms.
5. Integration:
 Ensures seamless integration of AI-driven functionalities into the user interface.
 Facilitates interoperability with third-party systems or services, such as document
management platforms or productivity tools.
 Supports extensibility and scalability to accommodate future enhancements or
customizations.
Additional Considerations:
 Performance Optimization: The system should be optimized for speed and efficiency,
particularly in processing large volumes of PDF documents and handling concurrent user
requests.
 Scalability: The architecture should be designed to scale horizontally to accommodate growing
user demands and document repositories.
 Compatibility: The system should be compatible with a wide range of devices and operating
systems to ensure broad accessibility and usability.
 User Feedback Mechanism: Incorporates a feedback mechanism to gather user input and
improve system performance and user experience over time.
 Error Handling: Implements robust error handling and recovery mechanisms to ensure system
stability and reliability under various operating conditions.

Functional Requirements:
 PDF Document Interaction: The system should allow users to upload PDF documents
and interact with them through a chat interface.
 Natural Language Processing (NLP): Integration of NLP algorithms to understand user
queries and provide relevant responses based on the content of the PDF documents.
 Document Search: Ability to search for specific information within PDF documents using
natural language queries.
 Document Summarization: Functionality to automatically generate summaries of PDF
documents to provide users with concise information.
 User Authentication: Secure user authentication mechanisms to ensure that only
authorized users can access the system and their respective documents.
 User Management: Capability to manage user profiles, including registration, login,
profile settings, and password management.
 Error Handling: Robust error handling mechanisms to gracefully manage exceptions,
display meaningful error messages, and guide users in case of invalid inputs or system
failures.
Non-functional Requirements:
 Performance: The system should be responsive and capable of handling multiple user
requests concurrently without significant delays.
 Scalability: Ability to scale horizontally to accommodate increasing user loads and
document volumes without compromising performance.
 Reliability: The system should be reliable, with minimal downtime and high availability to
ensure uninterrupted access to PDF documents.
 Security: Implementation of robust security measures to protect user data, including
encryption of sensitive information, secure transmission of data over networks, and
protection against common security threats such as SQL injection and cross-site scripting
(XSS).
 Usability: The user interface should be intuitive, user-friendly, and accessible, with clear
navigation paths, informative feedback messages, and responsive design across different
devices and screen sizes.
 Compatibility: Compatibility with a wide range of web browsers and operating systems to
ensure seamless access for users across different platforms.
 Maintainability: The system should be easy to maintain and update, with well-structured
code, comprehensive documentation, and modular architecture that facilitates code reuse
and future enhancements.
 Regulatory Compliance: Compliance with relevant data protection regulations and
standards, such as GDPR (General Data Protection Regulation) and HIPAA (Health
Insurance Portability and Accountability Act), to ensure privacy and security of user data.
4. Software Design

Here’s a step-by-step breakdown:

1. File Conversion: Multiple File documents are converted into chunks of text.
2. Embedding Process: These chunks of text are then converted into binary code representations,
known as “embeddings.”
3. Semantic Search: A central process labeled “semantic search” receives these embeddings.
4. User Query: On the user’s side, a question is asked, such as “What is a neural network?” This
question is also embedded and fed into the semantic search.
5. Matched Documents: The output from the semantic search goes into a Language Model (LM),
which generates an answer based on the matched documents.
6. Vector Store: There’s another box labeled “vector store (knowledge base)” connected to semantic
search, indicating where information might be stored or retrieved during this process.
This flowchart essentially illustrates the interaction between a User, Vector Database, and Language
Model (LM) in processing and responding to a user’s query. It’s a common method used in natural
language processing and information retrieval systems. The branding at the bottom right corner indicates
that “Pinecone” and “miro” are associated with this process.

The image you provided is a flowchart that explains the process of how a user’s query is processed and
responded to using a Language Learning Model (LLM) and Vector database. Here’s a step-by-step
breakdown:
1. File Conversion: A document is converted into text.
2. Text Chunking: The text from the PDF is divided into distinct chunks.
3. Embedding Process: Each chunk undergoes an embedding process, resulting in individual
embeddings.
4. Vector Database: These embeddings are stored in a central element called the Vector Database.
5. User Prompt: On the user’s side, when a query is prompted, it’s processed by the LLM.
6. Matched Documents: The LLM searches the Vector Database for matched documents.
7. Response Generation: Finally, the LLM generates an appropriate response based on the matched
documents.
This flowchart essentially illustrates the interaction between a User, Vector Database, and Language
Learning Model (LLM) in processing and responding to a user’s query. It’s a common method used in
natural language processing and information retrieval systems.

5. Testing Module
Testing Techniques:

1. Performance Testing:

 Conducts performance testing to evaluate the system's responsiveness, scalability, and

resource utilization under varying workloads.

 Measures key performance indicators such as response time, throughput, and resource
consumption to identify potential bottlenecks and optimize system performance.

 Utilizes tools such as Apache JMeter or Locust to simulate realistic user scenarios and
stress test the system.

2. Security Testing:

 Performs security testing to identify and mitigate potential vulnerabilities and threats to
the system.

 Conducts penetration testing to assess the system's resilience to malicious attacks,

including SQL injection, cross-site scripting (XSS), and unauthorized access attempts.

 Implements security best practices such as input validation, data encryption, and role-
based access control to protect sensitive information and ensure regulatory compliance.

3. Usability Testing:

 Engages users in usability testing sessions to evaluate the system's ease of use,
learnability, and overall user satisfaction.

 Collects qualitative feedback and quantitative metrics to assess user interactions with the
chat interface, document management features, and search capabilities.

 Incorporates user feedback into iterative design improvements to enhance the system's
usability and user experience.

4. Unit Testing: Unit testing involves testing individual components or units of code in isolation to
ensure their correctness and functionality. In the context of the project, unit tests can be written
to validate the behavior of critical modules such as the document parser, NLP engine, and
summarization algorithms.

5. Integration Testing: Integration testing verifies the interactions and interfaces between different
modules or subsystems to ensure they work together seamlessly. It validates the integration
points, data flow, and communication channels between components. Integration tests can be
conducted to verify the integration of the user interface with backend services and external APIs.

6. System Testing: System testing evaluates the entire system as a whole, validating its compliance
with functional and non-functional requirements. It tests end-to-end scenarios, user workflows,
and system behavior under various conditions. System tests can include functional testing,
usability testing, performance testing, and security testing to assess the system's overall quality
and reliability.

7. Acceptance Testing: Acceptance testing involves validating the system against user
requirements and expectations to ensure it meets the intended purpose and delivers value to
users. It may include user acceptance testing (UAT), where actual users interact with the system
to validate its usability, functionality, and alignment with business needs.

Test Cases:

4. Security Testing:

 Verify that the system implements secure authentication mechanisms to prevent

unauthorized access to user accounts and sensitive data.

 Test the system's resistance to common security vulnerabilities such as cross-site

scripting (XSS), SQL injection, and session hijacking.

 Validate that user input is properly validated and sanitized to prevent injection attacks
and data manipulation.

5. Usability Testing:

 Evaluate the intuitiveness of the chat interface by asking users to perform common tasks
such as searching for documents, requesting summaries, and navigating through search
results.

 Assess the clarity and effectiveness of system feedback and error messages to ensure
users can easily understand and respond to prompts.
 Measure user satisfaction through surveys and feedback forms to identify areas for
improvement in the user interface and interaction flow.

6. Document Parsing Test Cases: Test cases can be designed to verify the parsing accuracy and
reliability of the document parser module. This includes testing different types of PDF
documents, handling edge cases, and validating the extraction of text and metadata.

7. NLP Engine Test Cases: Test cases can validate the NLP engine's ability to understand and
interpret natural language queries within the context of PDF documents. This includes testing
query comprehension, response accuracy, and handling of ambiguous or complex queries.

8. Summarization Test Cases: Test cases can evaluate the summarization algorithms'
effectiveness in generating concise and relevant summaries of PDF documents. This includes
testing summary accuracy, coherence, and coverage of key information.

9. User Interface Test Cases: Test cases can verify the usability, accessibility, and responsiveness
of the user interface across different devices and screen sizes. This includes testing user
interactions, navigation flows, and error handling.

10. Performance Test Cases: Performance test cases can assess the system's responsiveness,
scalability, and resource utilization under various load conditions. This includes testing response
times, throughput, and system stability under normal and peak usage scenarios.
11. Performance of the Project Developed (So Far)
The performance evaluation of the project conducted thus far provides valuable insights into various
aspects of the system's functionality and efficiency.

1 Scalability:

 Evaluate the system's ability to handle an increasing number of users and documents
without compromising performance or responsiveness.

 Conduct load testing to simulate high traffic conditions and measure the system's ability
to scale horizontally to accommodate growing demands.

 Assess the effectiveness of scalability measures such as distributed processing, caching,

and load balancing in maintaining system performance under heavy loads.

2 Reliability:

 Measure the system's reliability by monitoring uptime, availability, and error rates over
an extended period.

 Conduct stress testing to identify potential failure points and assess the system's resilience
to failures, crashes, and unexpected events.

 Implement monitoring and alerting mechanisms to detect and respond to performance

issues in real-time, ensuring continuous availability and reliability.

3 Responsiveness:

 Evaluate the system's responsiveness by measuring response times for user queries,
document retrievals, and interactions with the chat interface.

 Conduct latency testing to assess delays in processing user requests and delivering
responses, ensuring optimal user experience and interaction flow.

 Optimize system components such as network communication, database queries, and AI

processing algorithms to minimize latency and improve responsiveness.

Performance Metrics:
 Throughput: Measures the number of user requests processed per unit of time, indicating the
system's processing capacity and efficiency.

 Response Time: Quantifies the time taken for the system to respond to user queries or requests,
reflecting its overall responsiveness and performance.

 Error Rate: Tracks the frequency of errors and exceptions encountered during system operation,
indicating stability and reliability issues that require attention.

 Scalability Index: Provides a measure of the system's ability to scale and accommodate growing
workloads, assessing its capacity to handle increased demand without degradation in
performance.

Performance Optimization:

 Identify performance bottlenecks through profiling and monitoring tools, such as Python's
cProfile and application performance monitoring (APM) solutions.

 Implement performance optimization techniques such as code refactoring, caching, parallel

processing, and database indexing to improve system efficiency and throughput.

 Continuously monitor and analyze performance metrics to identify areas for improvement and
prioritize optimization efforts based on impact and urgency.

Conclusion: The performance evaluation conducted thus far demonstrates promising results in terms of
scalability, reliability, and responsiveness. By addressing performance bottlenecks and optimizing
system components, the project aims to deliver a robust and efficient AI-driven chat system for
interacting with PDF documents, meeting user expectations for speed, reliability, and usability. Ongoing
performance monitoring and optimization efforts will ensure that the system maintains high performance
levels and meets the evolving needs of its users.
12.Output Screens
8.References
1. S. Smith et al., "Natural Language Processing for Document Understanding," Journal of Artificial
Intelligence Research, vol. 20, no. 3, pp. 123-145, 2019.

2. R. Jones and Y. Wang, "Intelligent Document Summarization: A Review," IEEE Transactions on

Knowledge and Data Engineering, vol. 25, no. 2, pp. 456-478, 2020.

3. Garcia and H. Chen, "User Interface Design for Conversational Agents," Journal of Human-Computer
Interaction, vol. 35, no. 4, pp. 789-802, 2018.

4. J. Kim et al., "Computer Vision Approaches for Document Image Analysis," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 30, no. 5, pp. 1123-1136, 2021.

5. Vercel, “Next.js by Vercel - The React Framework,” nextjs.org. https://nextjs.org/

6. AWS, “Cloud Object Storage | Store & Retrieve Data Anywhere | Amazon Simple Storage
Service,” Amazon Web Services, Inc., 2023. https://aws.amazon.com/s3/

7. “Clerk | Authentication and User Management,” Clerk. https://clerk.dev/ (accessed Jan. 14, 2024).

8. Smith, J. (2020). "Natural Language Processing with Python." O'Reilly Media.

9. Brownlee, J. (2019). "Deep Learning for Natural Language Processing." Machine Learning Mastery.
10. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning
Research.