Reading text from a specific rectangular area in C# and VB.NET

Extracting text based on coordinates from PDF documents is useful for tasks such as data extraction, form field analysis, and content filtering. This method allows you to precisely locate and retrieve specific information from a PDF document by defining the coordinates of the area containing the desired text. This can be particularly beneficial for automating data processing, document analysis, and information retrieval tasks in various industries and applications.

Below is a step-by-step guide to extract text at given coordinates from PDF documents using PDF.Net.

Input file: Asset Recovery Evaluation.pdf

Output result:

Step-by-Step Guide

Create a New Project
Open Visual Studio and create a new Console Application project.
Add PDF.Net Reference
Download the PDF.Net library and add it to your project. You can do this by right-clicking on your project in the Solution Explorer, selecting "Add Reference," and browsing to the PDF.Net DLL.
Write the Code to Extract Content
Below is a sample code snippet to extract text from a PDF document:

Complete code

GitHub

using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;

namespace Sample
{
    class Sample
    {
        /// <summary>
        /// Create a page tree.
        /// </summary>
        /// <remarks>
        /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
        /// </remarks>
        static void Main(string[] args)
        {
            // Path to the input PDF file
            string pdfFile = Path.GetFullPath(@"..\..\..\Asset Recovery Evaluation.pdf");


            // Define the boundaries of the rectangular area
            float areaLeft = 320;
            float areaRight = 440;
            float areaBottom = 734;
            float areaTop = 750;

            try
            {
                // Load the PDF document
                using (var document = PdfDocument.Load(pdfFile))
                {
                    // Retrieve the first page object
                    var page = document.Pages[0];

                    // Retrieve text content elements that are inside the specified area on the first page
                    var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
                    while (contentEnumerator.MoveNext())
                    {
                        if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                        {
                            var textElement = (PdfTextContent)contentEnumerator.Current;
                            var bounds = textElement.Bounds;
                            contentEnumerator.Transform.Transform(bounds);

                            if (bounds.Left > areaLeft && bounds.Right < areaRight && bounds.Bottom > areaBottom && bounds.Top < areaTop)
                            {
                                Console.WriteLine(textElement.ToString());
                            }
                        }
                    }
                }
                Console.WriteLine("Text extraction from the specified area completed successfully!");
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: " + ex.Message);
            }

        }
    }
}

Download

Run the Application
Build and run your application. If everything is set up correctly, the content from the specified PDF file will be extracted.

Additional Features

Extracting images from PDF files.
Converting PDF to other formats like DOCX, HTML, and images.
Merging and splitting PDF files.
Adding and reading interactive forms.

Conclusion

Extracting content from PDF documents based on specified boundaries using C# can be efficiently achieved with the help of SautinSoft's PDF .Net library. This powerful tool allows developers to precisely locate and extract text or other elements within a PDF by defining specific boundaries.

If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:

Name(optional):

Email:

Message:

Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.

Reading text from a specific rectangular area in C# and VB.NET

Step-by-Step Guide

Additional Features

Conclusion

The captcha is incorrect, please try again.

Questions and suggestions from you are always welcome!