0% found this document useful (0 votes)
3 views18 pages

Code

The document outlines a Flask web application that extracts text from PDF files, summarizes it using a fine-tuned T5 model, and converts the summary into audio using Google Text-to-Speech. It includes details on setting up routes, handling file uploads, and processing text for summarization. Additionally, it describes evaluation code for assessing the model's performance using metrics like ROUGE and BLEU.

Uploaded by

Anush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views18 pages

Code

The document outlines a Flask web application that extracts text from PDF files, summarizes it using a fine-tuned T5 model, and converts the summary into audio using Google Text-to-Speech. It includes details on setting up routes, handling file uploads, and processing text for summarization. Additionally, it describes evaluation code for assessing the model's performance using metrics like ROUGE and BLEU.

Uploaded by

Anush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

code-

from flask import Flask, render_template, request, jsonify

Flask is used to build web applications.

render_template loads HTML pages.

request handles form data or file uploads.

jsonify helps return JSON data (like extracted PDF text).

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

This imports Torch, a popular deep learning library used for building and
training machine learning models.

It's commonly used to manage tensors (like NumPy arrays) for training of
deep learning models.

: This is the T5 model (Text-to-Text Transfer


T5ForConditionalGeneration

Transformer) pre-trained for conditional text generation tasks such as


summarization, translation, and question answering.

: This is the tokenizer for the T5 model. It is used to convert input


T5Tokenizer

text into tokens (which are numerical representations that the model
understands), and convert output tokens back into human-readable text.

from gtts import gTTS

gTTS (Google Text-to-Speech) converts summary text into audio (MP3).

import os
from datetime import datetime
from PyPDF2 import PdfReader

os helps manage files and folders.

datetime is used to add timestamps to avoid audio file caching.

PdfReader reads PDF files and extracts text.

app = Flask(name)

This line creates a Flask web application.

code- 1
is a Python framework that helps you build web apps (like websites or
Flask

APIs).

__name__ tells Flask where it should look for files, routes, etc.

If you are writing this in a file called app.py , then __name__ will be "__main__"

when you run it.

app.config['UPLOAD_FOLDER'] = 'uploads'

This line sets a configuration option for your Flask app.

'UPLOAD_FOLDER' it tells Flask where to store files uploaded by users.

'uploads' is the name of the folder where those files will go.

model_name = 'finetuned_t5_summary'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

The value 'finetuned_t5_summary' is the name of your custom model directory.

This folder should contain the fine-tuned T5 model and its configuration
files.

This model was likely trained earlier for a task like summarization.

2nd - This line loads the tokenizer for your model.

A tokenizer’s job is to convert text into numbers (tokens) that the model
can understand.

This line loads the actual T5 model, which has been fine-tuned for a task
like summarization.

is a class used when you want the model to generate


T5ForConditionalGeneration

new text based on input (like creating a summary from a paragraph).

tells it to load the model weights and settings from


from_pretrained(model_name)

the folder finetuned_t5_summary .

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


model =
model.to(device)

Checks if GPU ( cuda ) is available; if not, uses CPU.

Moves the model to the selected device.

code- 2
A
GPU is a powerful chip used to handle large, complex calculations quickly —
perfect for machine learning and deep learning tasks.

@app.route('/')

This is a Flask route decorator.

It tells Flask:
📍 "When someone visits the homepage ( / ), run the function below."

The '/' means the root URL (e.g., http://localhost:5000/ or your main website
URL).

def index():

This defines a Python function named index .

Flask will run this function when the user visits the route above ( / ).

return render_template('index.html', summary_length=150)

Render (display) an HTML file called index.html (from your templates folder).

Pass a variable called summary_length into that HTML file, with the value 150 .

The 150 means:

"Let’s limit the summary output to about 150 tokens (or


words), so it’s not too short or too long."

@app.route('/extract_text', methods=['POST'])
def extract_text():
file = request.files.get('pdf_file')

This line defines a route in your Flask app.

is the URL path. When someone sends a request to this path


'/extract_text'

(like clicking a button), Flask will run the function below.

methods=['POST'] means this route will only respond to POST requests (not
GET requests).

code- 3
A POST request is used when you send data to the server (like
uploading a file).

📥 “When the user uploads a file and clicks submit, run the extract_text()

function.”

if file and file.filename.endswith('.pdf'):

file → Checks if a file was actually uploaded.

file.filename.endswith('.pdf') → Checks if the uploaded file has a name that ends


with .pdf (i.e., it’s a PDF).

filepath = os.path.join(app.config['UPLOAD_FOLDER'], file.filename)

his line creates the full path where the uploaded PDF file will be saved.

is the folder where you want to store uploaded files


app.config['UPLOAD_FOLDER']

(you set it earlier as 'uploads' ).

file.filename is the original name of the uploaded file.

os.path.join(...) combines both safely, no matter which operating system you’re


using (Windows, Linux, etc.).

file.save(filepath)

his line actually saves the uploaded PDF file to the path you just created.

The file is stored on your server or local folder so you can open and read it
later (

reader = PdfReader(filepath)

This line creates a PDF reader object from the file you uploaded.

PdfReader is a class from the PyPDF2 or pypdf library (depending on what


you’re using).

It opens and reads the PDF file located at filepath .

text = ''

is line creates an empty string named text .

You’ll use this string to collect and store the text from all pages in the PDF.

code- 4
for page in reader.pages:
text += page.extract_text() or ''

reader.pages gives you a list of all pages in the PDF.

The for loop goes through each page one by one.

page.extract_text() tries to pull out the text from the current page.

or '' means: If extract_text() returns None (if a page has no text), use an empty
string instead — this avoids errors.

return jsonify({'text': text.strip()})

“Send the cleaned extracted text back to the frontend in JSON format.”
The value is
text.strip() — this removes any extra spaces or newlines at the beginning or end

of the extracted text.

return jsonify({'error': 'Invalid file'}), 400

“Tell the user the file is not valid and send back an error message with code
400.”

@app.route('/summarize', methods=['POST'])
def summarize():
input_text = request.form['input_text']
summary_length = int(request.form.get('summary_length', 150))

This defines the Python function called


summarize()that will run when the /summarize route is triggered.
This gets the
text input that the user typed or pasted into a form field named input_text .

This tries to get the value of a form field named summary_length .

request.form.get('summary_length', 150) means:

If the user submitted a value, use it.

If not, use 150 as the default.

Then we wrap it with int(...) to convert the value from a string to an integer
(since form inputs are text by default).

code- 5
if not input_text.strip():
return render_template('index.html', summary="Please enter some text.",
input_text=input_text, summary_length=summary_length)

If the input is empty, reloads the page with a warning.

preprocessed_text = input_text.strip().replace('\n', ' ')

Removes extra spaces from the start/end ( strip() )

Replaces new lines ( \n ) with spaces

t5_input_text = f'summarize: {preprocessed_text}'


Adds the prefix
"summarize: " — this is how T5 knows the task is summarization.

original_word_count = len(preprocessed_text.split())
Counts the number of words in the input text.

tokenized_text = tokenizer.encode(
t5_input_text,
return_tensors='pt',
max_length=1024,
truncation=True
).to(device)

tokenizer.encode(...) : Converts the input text into token IDs.

return_tensors='pt' : Returns it as a PyTorch tensor (needed for model input).

max_length=1024 : Limits input to 1024 tokens.

truncation=True : Cuts off extra tokens if the text is too long.

.to(device) : Moves the tensor to GPU if available, else CPU.

summary_ids = model.generate(
tokenized_text,
max_length=summary_length,
min_length=max(50, int(summary_length * 0.7)),
no_repeat_ngram_size=3,
length_penalty=1.0,

code- 6
num_beams=5,
early_stopping=True
)

tokenized_text : The input text in token form.

max_length=summary_length : Max length of the summary.

min_length=max(50, int(summary_length * 0.7)) : Summary will be at least 70% of the


desired length or 50 tokens.

no_repeat_ngram_size=3 : Stops repeating 3-word phrases

length_penalty=1.0 : Controls how much the model prefers longer or shorter


summaries.

num_beams=5 : Thinks through 5 options before picking the best summary

early_stopping=True : Stops early when a good summary is found.

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)


summary_word_count = len(summary.split())

Decodes the generated tokens into readable summary text.

Counts words in the summary.

try:
tts = gTTS(text=summary, lang='en')
audio_filename = "summary_audio.mp3"
audio_path = os.path.join("static", audio_filename)
tts.save(audio_path)
timestamp = datetime.now().timestamp()
audio_file_url = f"{audio_path}?v={timestamp}"
except Exception:
audio_file_url = None

2-Creates speech from text using gTTS (Google Text-to-Speech) in English


( lang='en' ).
3-4 line
Sets the filename and path where the audio will be saved (inside the static

code- 7
folder).
5-
Saves the generated speech as an MP3 file to the given path.
6-7
Adds a timestamp to the URL to avoid browser caching issues — ensures the
latest audio plays.
9-If there's an error (e.g. internet issue, gTTS failure), it sets audio_file_url to None

so the app won’t break.

except Exception as e:
summary = f"An error occurred: {str(e)}"
return render_template('index.html', summary=summary, input_text=inpu
t_text, summary_length=summary_length)

Catches any error that happens in the try block and stores it in e .
2 line-
Creates a message showing what the error was, so the user knows something
went wrong.
3 line-
Reloads the page and shows the error message, while keeping the user’s
original input and summary length

return render_template(
'index.html',
summary=summary,
input_text=input_text,
audio_file=audio_file_url,
summary_length=summary_length,
original_word_count=original_word_count,
summary_word_count=summary_word_count
)

This tells Flask to show the index.html page and send data to it

1. summary – the final summary text

2. input_text – the original text the user entered

3. audio_file – the MP3 file of the summary (if created)

code- 8
4. summary_length – the target length of the summary

5. original_word_count – number of words in the input

6. summary_word_count – number of words in the summary

if name == 'main':
Run this code only when you launch this file directly, not when it's imported
as a module.

os.makedirs("uploads", exist_ok=True)
os.makedirs("static", exist_ok=True)
1.create folder named
uploads and static if they don't already exist.

2.exist_ok=Trueprevents errors if the folders already exist.

app.run(debug=True)
Starts the Flask web server in debug mode:

Automatically restarts on changes

Shows detailed error messages

JavaScript Functions
countWords(text) : Counts words in given text.

Updates word counts live for both input and summary textareas.

copyText() : Copies summary to clipboard.

downloadSummary() : Downloads the summary as a text file.

extractText() : Sends the uploaded PDF to backend, extracts text, and fills the
input textarea.

On page load, initializes word counts.

EVALUATION CODE-

1. Importing Libraries

python
CopyEdit

code- 9
import pandas as pd
from evaluate import load as load_metric
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Dataset

pandas : For reading the CSV dataset.

evaluate : For loading evaluation metrics (ROUGE, BLEU).

transformers : For loading the T5 model and tokenizer.

torch : For using PyTorch and GPU support.

DataLoader, Dataset : For batching input text during evaluation.

📄 2. Load and Clean Dataset


python
CopyEdit
val_df = pd.read_csv("news_summary_valid_small.csv")[['Text', 'Summar
y']].dropna()

Loads a CSV file with 'Text' and 'Summary' columns.

Removes any rows with missing values.

🧠 3. Load Fine-tuned Model & Tokenizer


python
CopyEdit
model_path = "finetuned_t5_summary"
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

Loads your fine-tuned T5 model and tokenizer from the specified folder.

code- 10
💻 4. Set Device (GPU if available)
python
CopyEdit
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

Uses GPU if available.

Sets model to evaluation mode (disables dropout, etc).

📦 5. Define Custom Dataset


python
CopyEdit
class SummaryDataset(Dataset):
def __init__(self, df):
self.inputs = df['Text'].tolist()
self.targets = df['Summary'].tolist()

def __len__(self):
return len(self.inputs)

def __getitem__(self, idx):


return self.inputs[idx], self.targets[idx]

Converts DataFrame into a custom dataset.

Each item is a pair: (original text, actual summary).

📚 6. Create DataLoader
python
CopyEdit
dataset = SummaryDataset(val_df)

code- 11
loader = DataLoader(dataset, batch_size=4)

Wraps your dataset into batches of 4 for processing.

📏 7. Load Evaluation Metrics


python
CopyEdit
rouge = load_metric("rouge")
bleu = load_metric("bleu")

Loads ROUGE and BLEU metrics using Hugging Face evaluate .

✍️ 8. Initialize Lists
python
CopyEdit
generated = []
references = []

generated : Stores model summaries.

references : Stores actual (true) summaries.

🔁 9. Generate Summaries in Batches


python
CopyEdit
for texts, refs in loader:
inputs = ["summarize: " + text for text in texts]
encodings = tokenizer(inputs, return_tensors="pt", padding=True, trunca
tion=True, max_length=512).to(device)
summary_ids = model.generate(**encodings, max_length=128, num_bea
ms=4, early_stopping=True)
preds = tokenizer.batch_decode(summary_ids, skip_special_tokens=Tru

code- 12
e)

generated.extend(preds)
references.extend(refs)

For each batch:

Prepends "summarize:" to input texts.

Tokenizes inputs and sends to model.

Generates summaries with beam search.

Decodes output tokens to readable text.

Adds to generated and references lists.

📊 10. Calculate ROUGE Scores


python
CopyEdit
rouge_scores = rouge.compute(predictions=generated, references=referen
ces, use_stemmer=True)

Compares generated vs reference summaries using ROUGE-1, 2, and L.

🧮 11. Calculate BLEU Score


python
CopyEdit
bleu_scores = bleu.compute(
predictions=generated,
references=[[ref] for ref in references]
)

BLEU expects references as list of lists → [[ref]] .

code- 13
🖨️ 12. Print Final Results
python
CopyEdit
📊 Evaluation Results:")
print("\n
print(f"ROUGE-1 F1 Score: {rouge_scores['rouge1']:.4f}")
print(f"ROUGE-2 F1 Score: {rouge_scores['rouge2']:.4f}")
print(f"ROUGE-L F1 Score: {rouge_scores['rougeL']:.4f}")
print(f"BLEU Score: {bleu_scores['bleu']:.4f}")

Nicely prints evaluation metric

Fine training code———

1. Import Libraries

python
CopyEdit
import pandas as pd, torch, transformers, etc.

➡️ Loads required libraries for:


Handling data ( pandas )

Model training ( torch , transformers )

Progress bar ( tqdm )

File handling ( os )

📄 2. Load and Clean Data


python
CopyEdit
train_df_raw = pd.read_csv("news_summary_train_small.csv")

➡️ Reads training and validation datasets from CSV files.

code- 14
Then it automatically finds and renames the correct text and summary
columns.

🧠 3. Load Tokenizer
python
CopyEdit
tokenizer = T5Tokenizer.from_pretrained("t5-base")

➡️ Loads a tokenizer that converts text into tokens that the T5 model can
understand.

🧰 4. Create Custom Dataset


python
CopyEdit
class NewsSummaryDataset(Dataset): ...

➡️ A PyTorch Dataset class that:

Prepares input-output pairs

Tokenizes the data

Pads/truncates to fixed length

🧪 5. Create DataLoaders
python
CopyEdit
train_loader = DataLoader(...), val_loader = DataLoader(...)

➡️ Wraps datasets into loaders for batch processing during training.


💻 6. Load Model and Set Device

code- 15
python
CopyEdit
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model.to(device)

➡️ Loads the pre-trained T5 model and moves it to GPU (if available) or CPU.
🧪 7. Setup Optimizer and Scheduler
python
CopyEdit
optimizer = AdamW(...), scheduler = get_scheduler(...)

➡️ Uses AdamW optimizer and linear learning rate scheduler to control how the
model learns.

🔁 8. Training Loop
python
CopyEdit
for epoch in range(num_epochs): ...

➡️ Repeats training for multiple epochs:


Feeds batches to the model

Computes loss

Updates weights using backpropagation

Tracks training progress

💾 9. Save Fine-Tuned Model


python
CopyEdit

code- 16
model.save_pretrained(model_path)

➡️ Saves the trained model and tokenizer to a folder for future use.
✅ In short:
This code:

Loads news summary data

Fine-tunes the T5 model

Saves the new model for summarizing news articles

Let me know if you want a version that works on Google Colab or adds
validation accuracy tracking!
4o

You said:
WHAT IS Adamw
explain in short and easy language

ChatGPT said:

🤖 What is AdamW ?
AdamW is an optimizer used to train machine learning models like T5.

🧠 In simple words:
It helps the model learn by adjusting its weights during training.

It's an improved version of Adam optimizer.

The W in AdamW stands for Weight Decay — a technique to prevent


overfitting.

✅ Why use AdamW?


It’s fast and efficient.

It keeps the model from memorizing training data too much (by applying
weight decay).

code- 17
It's recommended by Hugging Face Transformers for training models like
T5, BERT, etc.

code- 18

You might also like