How to Extract Text from PDF Documents using C#

SautinSoft.Pdf can read PDF files from C# or VB.NET applications at very high speeds; it can read the text of a 1,000 page PDF file (almost 500,000 words) in just 3 seconds.

Text extraction is fairly easy to perform. With a simple API and just a few lines of code, the entire text content from a PDF file can be extracted in a single String, ready for your further processing.

The text extraction method from PDF documents is essential for various industries and tasks such as data mining, information retrieval, content analysis, and document management. It allows for the automatic extraction of text data from PDF files, which can then be processed, analyzed, and utilized in a variety of ways. By using this method, users can easily extract and manipulate text content from PDF documents, enabling them to quickly search, edit, and repurpose the extracted text for their specific needs. Whether you are a researcher, a data analyst, a content creator, or a developer, the text extraction method from PDF files simplifies the task of working with textual information stored in PDF format.

Below is a step-by-step guide on how to extract text from PDF documents using PDF.Net.

Input file: simple text.pdf

Output result:

Step-by-Step Guide

  1. Create a New Project

    Open Visual Studio and create a new Console Application project.

  2. Add PDF.Net Reference

    Install PDF .Net form nuget

  3. Write the Code to Extract Text

    Below is a sample code snippet to extract text from a PDF document:

  4. using System.IO;
    using SautinSoft;
    using SautinSoft.Pdf;
    using SautinSoft.Pdf.Content;
    
    namespace Sample
    {
        class Sample
        {
            /// 
            /// Create a page tree.
            /// 
            /// 
            /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
            /// 
            static void Main(string[] args)
            {
                // Path to the input PDF file
                //string pdfFile = @"C:\path\to\your\document.pdf";
                string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
                try
                {
                    using (var document = PdfDocument.Load(pdfFile))
                    {
                        foreach (var page in document.Pages)
                        {
                            var text = page.Content.GetText(new PdfTextOptions
                            {
                                FontFace = new PdfFontFace("Consolas"),
                                Order = PdfTextOrder.Reading,
                                Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.Blank | PdfTextWhitespaces.NewLine
                            }).ToString();
                            Console.WriteLine(text);
                        }
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine("Error: " + ex.Message);
                }
            }
        }
    }
  5. Run the Application

    Build and run your application. If everything is set up correctly, the text from the specified PDF file will be extracted.

Additional Features

    PDF.Net offers various other features for handling PDF documents, such as:
  • Extracting images from PDF files.
  • Converting PDF to other formats like DOCX, HTML, and images.
  • Merging and splitting PDF files.
  • Adding and reading interactive forms.

Conclusion

Extracting text from PDF documents using PDF.Net is a simple and efficient process. With just a few lines of code, you can integrate powerful PDF text extraction capabilities into your applications. Whether you are working on a small project or a large-scale application, PDF.Net provides the tools you need to handle PDF documents effectively.

Extracting text from PDF documents is a common requirement for various applications, such as data analysis, content management, and document processing. PDF.Net by SautinSoft provides a powerful and easy-to-use solution for this task. Below is a step-by-step guide on how to extract text from PDF documents using PDF.Net.


If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:



Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.