0% found this document useful (0 votes)
10 views6 pages

PDF File Extraction

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

PDF File Extraction

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Problem statement

1.Summerzation
2.Table Extraction
3. Extract in the Exact Value
Objective
• Summarization: Develop an automated text summarization system to
generate concise and coherent summaries from lengthy documents.
• Table Extraction: Create an efficient algorithm to identify and extract
tables from various document formats.
• Exact Value Extraction: Implement a method to accurately extract
numerical data and relevant information from the identified tables.
Summarizing PDF

1.Hugging Face Transformers: Utilize the transformers library from Hugging Face, which provides access to powerful pre-trained models.
2.BART Model: Use the BART model (facebook/bart-base) for summarization tasks, leveraging its auto-regressive nature.

3.Extract Text: Extract text content from the first page of the PDF using pdfplumber.
4.Tokenization: Tokenize the text using the BART tokenizer to prepare it for the model input.
5.Summarization: Use BART model to generate a summary of the text and decode the summary for final output.
Extracting Tables from PDF

1.Use pdfplumber library: Utilize pdfplumber for reliable and accurate extraction of tables from the PDF document.
2.Page-by-page processing: Process each page of the PDF individually to extract text content and tables.

3.Table extraction: Employ pdfplumber extract_tables() 's method to extract tables from each page.
4.Storing results: Store the extracted tables for each page in a list, creating a list of tables for all pages.
Generating Python Code from PDF
• Process Each Line: Iterate through each line of text extracted from the
PDF.
• Custom Function: Create a function to generate Python code that
prints the text.
• Skip Empty Lines: Exclude empty lines from the output code to avoid
unnecessary print statements.
• Customized Information: Modify code generation to fit specific
requirements, such as filtering by keywords or formatting.
• Output Code: Store the generated Python code in a variable for
further use or presentation.

You might also like