Reading text from a specific rectangular area in C# and VB.NET

Extracting text based on coordinates from PDF documents is useful for tasks such as data extraction, form field analysis, and content filtering. This method allows you to precisely locate and retrieve specific information from a PDF document by defining the coordinates of the area containing the desired text. This can be particularly beneficial for automating data processing, document analysis, and information retrieval tasks in various industries and applications.

Below is a step-by-step guide to extract text at given coordinates from PDF documents using PDF.Net.

Output result:

Step-by-Step Guide

  1. Create a New Project

    Open Visual Studio and create a new Console Application project.

  2. Add PDF.Net Reference

    Download the PDF.Net library and add it to your project. You can do this by right-clicking on your project in the Solution Explorer, selecting "Add Reference," and browsing to the PDF.Net DLL.

  3. Write the Code to Extract Content

    Below is a sample code snippet to extract text from a PDF document:

  4. Complete code

    using System;
    using System.IO;
    using SautinSoft;
    using SautinSoft.Pdf;
    using SautinSoft.Pdf.Content;
    
    namespace Sample
    {
        class Sample
        {
            /// <summary>
            /// Create a page tree.
            /// </summary>
            /// <remarks>
            /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
            /// </remarks>
            static void Main(string[] args)
            {
                // Path to the input PDF file
                string pdfFile = Path.GetFullPath(@"..\..\..\Asset Recovery Evaluation.pdf");
    
    
                // Define the boundaries of the rectangular area
                float areaLeft = 320;
                float areaRight = 440;
                float areaBottom = 734;
                float areaTop = 750;
    
                try
                {
                    // Load the PDF document
                    using (var document = PdfDocument.Load(pdfFile))
                    {
                        // Retrieve the first page object
                        var page = document.Pages[0];
    
                        // Retrieve text content elements that are inside the specified area on the first page
                        var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
                        while (contentEnumerator.MoveNext())
                        {
                            if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                            {
                                var textElement = (PdfTextContent)contentEnumerator.Current;
                                var bounds = textElement.Bounds;
                                contentEnumerator.Transform.Transform(bounds);
    
                                if (bounds.Left > areaLeft && bounds.Right < areaRight && bounds.Bottom > areaBottom && bounds.Top < areaTop)
                                {
                                    Console.WriteLine(textElement.ToString());
                                }
                            }
                        }
                    }
                    Console.WriteLine("Text extraction from the specified area completed successfully!");
                }
                catch (Exception ex)
                {
                    Console.WriteLine("Error: " + ex.Message);
                }
    
            }
        }
    }

    Download

  5. Run the Application

    Build and run your application. If everything is set up correctly, the content from the specified PDF file will be extracted.

Additional Features

    PDF.Net offers various other features for handling PDF documents, such as:
  • Extracting images from PDF files.
  • Converting PDF to other formats like DOCX, HTML, and images.
  • Merging and splitting PDF files.
  • Adding and reading interactive forms.

Conclusion

Extracting content from PDF documents based on specified boundaries using C# can be efficiently achieved with the help of SautinSoft's PDF .Net library. This powerful tool allows developers to precisely locate and extract text or other elements within a PDF by defining specific boundaries.


If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:



Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.