0% found this document useful (0 votes)
26 views10 pages

IR Report

Uploaded by

Manshu Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

IR Report

Uploaded by

Manshu Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PUNE INSTITUTE OF COMPUTER

TECHNOLOGY, DHANKAWADI, PUNE-43

Mini Project Report – Information Retrieval

‘Text Summarization – Using Transformers and LLMs’

Submitted By
Name: Atharva Litake
Roll no: 41143 Class: BE-1

Name: Mihir Deshpande


Roll no: 41150 Class: BE-1

Under the guidance of


Prof. YOGESH ASHOK HANDGE

DEPARTMENT OF COMPUTER ENGINEERING


Academic Year 2024-25
Contents
1. TITLE

2. PROBLEM DEFINITION

3. LEARNING OBJECTIVES

4. LEARNING OUTCOMES

5. ABSTRACT

6. TECHNICAL DETAILS ABOUT THE PROJECT

7. GLIMPSE OF THE PROJECT

8. CONCLUSION
1. TITLE: Text Summarization – Using Transformers and LLMs’

2. PROBLEM DEFINITION:

World is growing at a faster speed and so does networking amongst


individuals. Social media has now become one of the essential necessities
of life. With the advent of Social Media applications like Whatsapp,
Twitter , Facebook and many more , people around the globe grew more
interconnected. Regardless of borders and distance , now people are a
single click away from each other. Communication is at the heart of all
these Social Media platforms.
Communication amongst individuals can take place via messages , tweets
, emails and other functionalities. To get connected and stay up-to-date
with all the real world events , every individual uses these communication
media. These platforms not only help in establishing contact with near and
dear ones but provide various updates from various sectors such as science
& technology , medical , agriculture and countless others. They act as a
rich source of information from all walks of life. But , with the increasing
number of users on these platforms , a huge amount of data is generated.
According to a survey , 500 million tweets are made every day on twitter.
On an average, a person receives 121 emails everyday. Looking at the
growing number of users and data generated , it is humanly impossible to
analyse and understand each and every message and gain information
from it.
As a solution to the above problem , the role of Text Summarization
and techniques was taken into consideration. Text summarization is the
process of distilling the most important information from a source to
produce an abridged version for a particular user and task. Text
summarization is a concise representation of a huge amount of data
without hampering meaning and context of the data. It creates a short
and coherent version of data, through which understanding and
analyzing the data becomes easier. It enhances the process of
understanding and information extraction. In today's fast-paced world,
where everyone is constantly racing against time, text summarization
serves as a valuable tool for quickly understanding and analyzing data.
3. LEARNING OBJECTIVES:

● Grasp the importance and applications of text summarization, including


extractive and abstractive approaches.
● Comprehend the underlying architecture of transformers (e.g., attention
mechanism, encoder-decoder models) used for natural language processing
tasks.
● Learn about LLMs like GPT, BERT, or T5, and their roles in generating
meaningful summaries.
● Gain hands-on experience in fine-tuning transformer-based models for
specific summarization tasks.
● Understand techniques for preparing and cleaning text data, including
tokenization, padding, and attention masking.
● Learn about evaluation metrics like ROUGE, BLEU, and F1 Score to assess
the performance of text summarization models.
● Explore various strategies for optimizing hyperparameters (e.g., learning
rate, batch size) to improve model accuracy and efficiency.
● Address common challenges such as information loss, coherence, and
maintaining the semantic meaning in generated summaries.

4. LEARNING OUTCOMES:

• Developed a strong understanding of transformer-based architectures and


their role in natural language processing tasks.
• Successfully implemented a text summarization model using pretrained
large language models (LLMs).
• Gained proficiency in fine-tuning LLMs for specific summarization use
cases.
• Acquired skills in preprocessing text data, including tokenization and
attention mechanisms, for summarization tasks.
• Demonstrated the ability to evaluate summarization models using metrics
like ROUGE and BLEU.
• Improved model performance by optimizing hyperparameters effectively.
• Addressed challenges in maintaining the coherence and semantic accuracy
of generated summaries.
• Applied summarization techniques in practical scenarios, integrating models
into real-world applications.
• Recognized the ethical implications of using automated summarization
tools, particularly in terms of bias and accuracy.
• Enhanced problem-solving skills through iterative model development and
evaluation

5. ABSTRACT:

Text summarization is a crucial task in natural language processing, aimed


at condensing large amounts of text into concise, informative summaries
while retaining the key points and context. This project explores the use
of transformer-based architectures and large language models (LLMs)
such as BERT, GPT, and T5 for performing automated text
summarization. By leveraging the power of pretrained models, the project
focuses on both extractive and abstractive summarization techniques. The
transformer models are fine-tuned on specific datasets to generate high-
quality summaries, addressing challenges related to coherence, semantic
accuracy, and information retention. Through rigorous evaluation using
metrics like ROUGE and BLEU, the performance of the models is
assessed, and optimizations are applied to enhance their efficiency and
accuracy. The project also investigates the preprocessing steps necessary
for preparing data for summarization tasks and emphasizes the importance
of ethical considerations, such as reducing bias and ensuring the factual
correctness of the generated summaries. Ultimately, this work
demonstrates the potential of LLMs in automating and improving text
summarization, with practical applications in various domains such as
news, document management, and information retrieval.
TECHNICAL DETAILS ABOUT THE PROJECT

Text Summarization is a natural language processing (NLP) task that involves


condensing large volumes of text into shorter, coherent summaries while
preserving the key information. It can be categorized into two main types:

1. Extractive Summarization:
- In extractive summarization, important sentences or phrases are directly
selected from the source text to create the summary.
- The process relies on identifying key sentences based on features such as term
frequency, sentence position, or using scoring models.
- Extractive methods tend to preserve the original structure of the text but may
lack coherence and flow in the summary.

2. Abstractive Summarization:
Abstractive summarization involves generating new sentences that capture the
essence of the input text, similar to how humans summarize. It requires deep
language understanding and relies on generating natural language summaries
using models such as transformers and LLMs. Abstractive summarization is
more complex and can generate coherent and fluent summaries but may
sometimes introduce factual inconsistencies.
Transformer Models in Summarization:
Transformer architectures like BART (Bidirectional and Auto-Regressive
Transformers) and its distilled variant DistilBART are commonly used for
abstractive summarization. These models consist of an encoder-decoder
structure. The encoder processes the input text, generating hidden
representations. The decoder generates the summary based on these
representations, often using attention mechanisms to focus on relevant parts of
the input.

DistilBART:
DistilBART is a compressed, faster version of BART, reducing the size and
computational cost while maintaining strong performance. It retains the core
transformer architecture but uses knowledge distillation techniques to create a
lighter model. DistilBART is suitable for tasks requiring both speed and
accuracy, making it a popular choice for large-scale text summarization.

Key Techniques:
Tokenization: Text is split into smaller units (tokens), which are then fed into the
transformer model.
Attention Mechanisms: The model weighs the importance of different words in
the input to generate contextually relevant summaries.
Pretraining and Fine-tuning: Pretrained models are further fine-tuned on specific
summarization datasets to specialize them for the summarization task.

Evaluation:
Common evaluation metrics include ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy), which
compare the generated summary to reference summaries based on word overlaps
and n-gram matching.
7. GLIMPSE OF THE PROJECT & DEPLOYMENT:
8. CONCLUSION

In this project, the sshleifer/distilbart-cnn-12-6 , a distilled version of


the BART transformer, was successfully applied to the task of text
summarization. The use of this lightweight and efficient model
demonstrated that high-quality text summaries can be generated while
maintaining a balance between performance and computational
efficiency. Through fine-tuning on relevant datasets, the model
achieved coherent and concise summaries, effectively capturing the
main ideas of the input text. The evaluation metrics, such as ROUGE,
indicated the model's ability to maintain semantic accuracy and
relevance in generated summaries. Additionally, the optimization of
hyperparameters improved the model’s efficiency without
compromising output quality. Despite the challenges of preserving
critical information in reduced text, the use of sshleifer/distilbart
showcased its potential for handling large-scale summarization tasks
in real-world applications, such as document management and news
aggregation.

Overall, the project highlighted the value of using transformer-based


models like sshleifer/distilbart for text summarization while addressing
challenges such as coherence, information loss, and ethical concerns
related to bias and accuracy.

You might also like