Skip to content

Latest commit

 

History

History
70 lines (55 loc) · 3.17 KB

extract-text-specific-rectangle-pdf-radpdfprocessing.md

File metadata and controls

70 lines (55 loc) · 3.17 KB
title description type page_title slug tags res_type category ticketid
Extracting Text Within a Specific Rectangle in PDF Documents
Learn how to extract text from a specified rectangle area within PDF pages using RadPdfProcessing.
how-to
How to Extract Text from a Specified Area in PDF Pages
extract-text-specific-rectangle-pdf-radpdfprocessing
radpdfprocessing, pdf, textfragment, cropbox, extract, text
kb
knowledge-base
1653594

Environment

Version Product Author
2024.2.426 RadPdfProcessing Desislava Yordanova

Description

Learn how to extract the text from specific rectangular areas within PDF pages.

Solution

To extract text from a specific rectangle or crop box within a PDF page, you can utilize the [TextFragment]({%slug radpdfprocessing-model-textfragment%}) class along with its [MatrixPosition]({%slug radpdfprocessing-concepts-position%}) property. The following code snippet demonstrates how to load a PDF document, define a rectangle that represents the desired area from which text should be extracted, and iterate through the text fragments within each page. It checks if the position of the text fragment is contained within the specified rectangle and, if so, outputs the text.

        static void Main(string[] args)
        {
            string originalFilePath = @"WinForms PdfViewer.pdf";
            PdfFormatProvider provider = new PdfFormatProvider();
            RadFixedDocument croppedDocument = provider.Import(File.ReadAllBytes(originalFilePath));
            Rect middleRectangle = new Rect(croppedDocument.Pages.First().Size.Width/2, croppedDocument.Pages.First().Size.Height / 3, croppedDocument.Pages.First().Size.Width, croppedDocument.Pages.First().Size.Height / 3);

            foreach (RadFixedPage currentPage in croppedDocument.Pages)
            {
                foreach (var contentElement in currentPage.Content)
                {
                    TextFragment textFragment = contentElement as TextFragment;

                    if (textFragment != null)
                    {
                        string currentText = (contentElement as TextFragment).Text;
                        if (currentText==" ")
                        {
                            continue;
                        }
                        MatrixPosition position = textFragment.Position as MatrixPosition;
                        if (middleRectangle.Contains(position.Matrix.OffsetX, position.Matrix.OffsetY))
                        {
                            Debug.Write(currentText);
                        }
                }
            }
        }

The cropped middle part of the page is represented in the below screenshot:

Rectangle with text in PdfProcessing

The detected text is printed in the Output console:

Extracted text in PdfProcessing

See Also

  • [RadPdfProcessing Documentation]({%slug radpdfprocessing-overview%}})
  • [TextFragment]({%slug radpdfprocessing-model-textfragment%}})
  • [MatrixPosition]({%slug radpdfprocessing-concepts-position%})