Code
Code
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
This imports Torch, a popular deep learning library used for building and
training machine learning models.
It's commonly used to manage tensors (like NumPy arrays) for training of
deep learning models.
text into tokens (which are numerical representations that the model
understands), and convert output tokens back into human-readable text.
import os
from datetime import datetime
from PyPDF2 import PdfReader
app = Flask(name)
code- 1
is a Python framework that helps you build web apps (like websites or
Flask
APIs).
__name__ tells Flask where it should look for files, routes, etc.
If you are writing this in a file called app.py , then __name__ will be "__main__"
app.config['UPLOAD_FOLDER'] = 'uploads'
'uploads' is the name of the folder where those files will go.
model_name = 'finetuned_t5_summary'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
This folder should contain the fine-tuned T5 model and its configuration
files.
This model was likely trained earlier for a task like summarization.
A tokenizer’s job is to convert text into numbers (tokens) that the model
can understand.
This line loads the actual T5 model, which has been fine-tuned for a task
like summarization.
code- 2
A
GPU is a powerful chip used to handle large, complex calculations quickly —
perfect for machine learning and deep learning tasks.
@app.route('/')
It tells Flask:
📍 "When someone visits the homepage ( / ), run the function below."
The '/' means the root URL (e.g., http://localhost:5000/ or your main website
URL).
def index():
Flask will run this function when the user visits the route above ( / ).
Render (display) an HTML file called index.html (from your templates folder).
Pass a variable called summary_length into that HTML file, with the value 150 .
@app.route('/extract_text', methods=['POST'])
def extract_text():
file = request.files.get('pdf_file')
methods=['POST'] means this route will only respond to POST requests (not
GET requests).
code- 3
A POST request is used when you send data to the server (like
uploading a file).
📥 “When the user uploads a file and clicks submit, run the extract_text()
function.”
his line creates the full path where the uploaded PDF file will be saved.
file.save(filepath)
his line actually saves the uploaded PDF file to the path you just created.
The file is stored on your server or local folder so you can open and read it
later (
reader = PdfReader(filepath)
This line creates a PDF reader object from the file you uploaded.
text = ''
You’ll use this string to collect and store the text from all pages in the PDF.
code- 4
for page in reader.pages:
text += page.extract_text() or ''
page.extract_text() tries to pull out the text from the current page.
or '' means: If extract_text() returns None (if a page has no text), use an empty
string instead — this avoids errors.
“Send the cleaned extracted text back to the frontend in JSON format.”
The value is
text.strip() — this removes any extra spaces or newlines at the beginning or end
“Tell the user the file is not valid and send back an error message with code
400.”
@app.route('/summarize', methods=['POST'])
def summarize():
input_text = request.form['input_text']
summary_length = int(request.form.get('summary_length', 150))
Then we wrap it with int(...) to convert the value from a string to an integer
(since form inputs are text by default).
code- 5
if not input_text.strip():
return render_template('index.html', summary="Please enter some text.",
input_text=input_text, summary_length=summary_length)
original_word_count = len(preprocessed_text.split())
Counts the number of words in the input text.
tokenized_text = tokenizer.encode(
t5_input_text,
return_tensors='pt',
max_length=1024,
truncation=True
).to(device)
summary_ids = model.generate(
tokenized_text,
max_length=summary_length,
min_length=max(50, int(summary_length * 0.7)),
no_repeat_ngram_size=3,
length_penalty=1.0,
code- 6
num_beams=5,
early_stopping=True
)
try:
tts = gTTS(text=summary, lang='en')
audio_filename = "summary_audio.mp3"
audio_path = os.path.join("static", audio_filename)
tts.save(audio_path)
timestamp = datetime.now().timestamp()
audio_file_url = f"{audio_path}?v={timestamp}"
except Exception:
audio_file_url = None
code- 7
folder).
5-
Saves the generated speech as an MP3 file to the given path.
6-7
Adds a timestamp to the URL to avoid browser caching issues — ensures the
latest audio plays.
9-If there's an error (e.g. internet issue, gTTS failure), it sets audio_file_url to None
except Exception as e:
summary = f"An error occurred: {str(e)}"
return render_template('index.html', summary=summary, input_text=inpu
t_text, summary_length=summary_length)
Catches any error that happens in the try block and stores it in e .
2 line-
Creates a message showing what the error was, so the user knows something
went wrong.
3 line-
Reloads the page and shows the error message, while keeping the user’s
original input and summary length
return render_template(
'index.html',
summary=summary,
input_text=input_text,
audio_file=audio_file_url,
summary_length=summary_length,
original_word_count=original_word_count,
summary_word_count=summary_word_count
)
This tells Flask to show the index.html page and send data to it
code- 8
4. summary_length – the target length of the summary
if name == 'main':
Run this code only when you launch this file directly, not when it's imported
as a module.
os.makedirs("uploads", exist_ok=True)
os.makedirs("static", exist_ok=True)
1.create folder named
uploads and static if they don't already exist.
app.run(debug=True)
Starts the Flask web server in debug mode:
JavaScript Functions
countWords(text) : Counts words in given text.
Updates word counts live for both input and summary textareas.
extractText() : Sends the uploaded PDF to backend, extracts text, and fills the
input textarea.
EVALUATION CODE-
1. Importing Libraries
python
CopyEdit
code- 9
import pandas as pd
from evaluate import load as load_metric
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Dataset
Loads your fine-tuned T5 model and tokenizer from the specified folder.
code- 10
💻 4. Set Device (GPU if available)
python
CopyEdit
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
def __len__(self):
return len(self.inputs)
📚 6. Create DataLoader
python
CopyEdit
dataset = SummaryDataset(val_df)
code- 11
loader = DataLoader(dataset, batch_size=4)
✍️ 8. Initialize Lists
python
CopyEdit
generated = []
references = []
code- 12
e)
generated.extend(preds)
references.extend(refs)
code- 13
🖨️ 12. Print Final Results
python
CopyEdit
📊 Evaluation Results:")
print("\n
print(f"ROUGE-1 F1 Score: {rouge_scores['rouge1']:.4f}")
print(f"ROUGE-2 F1 Score: {rouge_scores['rouge2']:.4f}")
print(f"ROUGE-L F1 Score: {rouge_scores['rougeL']:.4f}")
print(f"BLEU Score: {bleu_scores['bleu']:.4f}")
1. Import Libraries
python
CopyEdit
import pandas as pd, torch, transformers, etc.
File handling ( os )
code- 14
Then it automatically finds and renames the correct text and summary
columns.
🧠 3. Load Tokenizer
python
CopyEdit
tokenizer = T5Tokenizer.from_pretrained("t5-base")
➡️ Loads a tokenizer that converts text into tokens that the T5 model can
understand.
🧪 5. Create DataLoaders
python
CopyEdit
train_loader = DataLoader(...), val_loader = DataLoader(...)
code- 15
python
CopyEdit
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model.to(device)
➡️ Loads the pre-trained T5 model and moves it to GPU (if available) or CPU.
🧪 7. Setup Optimizer and Scheduler
python
CopyEdit
optimizer = AdamW(...), scheduler = get_scheduler(...)
➡️ Uses AdamW optimizer and linear learning rate scheduler to control how the
model learns.
🔁 8. Training Loop
python
CopyEdit
for epoch in range(num_epochs): ...
Computes loss
code- 16
model.save_pretrained(model_path)
➡️ Saves the trained model and tokenizer to a folder for future use.
✅ In short:
This code:
Let me know if you want a version that works on Google Colab or adds
validation accuracy tracking!
4o
You said:
WHAT IS Adamw
explain in short and easy language
ChatGPT said:
🤖 What is AdamW ?
AdamW is an optimizer used to train machine learning models like T5.
🧠 In simple words:
It helps the model learn by adjusting its weights during training.
It keeps the model from memorizing training data too much (by applying
weight decay).
code- 17
It's recommended by Hugging Face Transformers for training models like
T5, BERT, etc.
code- 18