Read text from PDF files in C# and VB.NET

SautinSoft.Pdf can read PDF files from C# or VB.NET applications at very high speeds; it can read the text of a 1,000 page PDF file (almost 500,000 words) in just 3 seconds.

Text extraction is fairly easy to perform. With a simple API and just a few lines of code, the entire text content from a PDF file can be extracted in a single String, ready for your further processing.

The following example shows how to easily read the text content of each page of a PDF document.

Complete code

using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;

namespace Sample
{
    class Sample
    {
        static void Main(string[] args)
        {
            string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
            
            using (var document = PdfDocument.Load(pdfFile))
            {
                foreach (var page in document.Pages)
                {
                    Console.WriteLine(page.Content.ToString());
                }
            }
        }
    }
}

Download.


Reading additional information about a text

SautinSoft.Pdf simplifies content manipulation on PDF pages by representing content as a sequence of parsed or compiled elements such as text, paths, and external objects (images and forms). See the Content Streams and Resources help page for more information

The PdfTextContenе element can be used to extract additional information such as text borders, fonts, and colors, as shown in the following example.

Complete code

using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;

class Program
{
	static void Main()
	{
		string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
		// Iterate through all PDF pages and through each page's content elements,
		// and retrieve only the text content elements.
		using (var document = PdfDocument.Load(pdfFile))
		{
			foreach (var page in document.Pages)
			{
				var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
				while (contentEnumerator.MoveNext())
				{
					if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
					{
						var textElement = (PdfTextContent)contentEnumerator.Current;
						var text = textElement.ToString();
						var font = textElement.Format.Text.Font;
						var color = textElement.Format.Fill.Color;
						var bounds = textElement.Bounds;

						contentEnumerator.Transform.Transform(ref bounds);
						// Read the text content element's additional information.
						Console.WriteLine($"Unicode text: {text}");
						Console.WriteLine($"Font name: {font.Face.Family.Name}");
						Console.WriteLine($"Font size: {font.Size}");
						Console.WriteLine($"Font style: {font.Face.Style}");
						Console.WriteLine($"Font weight: {font.Face.Weight}");
						if (color.TryGetRgb(out double red, out double green, out double blue))
						Console.WriteLine($"Color: Red={red}, Green={green}, Blue={blue}");
						Console.WriteLine($"Bounds: Left={bounds.Left:0.00}, Bottom={bounds.Bottom:0.00}, Right={bounds.Right:0.00}, Top={bounds.Top:0.00}");
						Console.WriteLine();
					}
				}
			}
		}
	}
}

            

Download.


Reading text from a specific rectangular area

With SautinSoft.Pdfallows you to extract text from a specific rectangular area of a PDF document. To do this, define the boundaries of the area of interest and retrieve only the PdfTextContent elements within it, as shown in the following example.

Complete code

using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;

class Program
{
	static void Main()
	{
		string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
		var pageIndex = 0;
		double areaLeft = 300, areaRight = 520, areaBottom = 720, areaTop = 510;
		using (var document = PdfDocument.Load(pdfFile))
		{
			// Retrieve first page object.
			var page = document.Pages[pageIndex];
			// Retrieve text content elements that are inside specified area on the first page.
			var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
			while (contentEnumerator.MoveNext())
			{
				if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
				{
					var textElement = (PdfTextContent)contentEnumerator.Current;
					var bounds = textElement.Bounds;
					contentEnumerator.Transform.Transform(ref bounds);

					if (bounds.Left > areaLeft && bounds.Right < areaRight &&
					bounds.Bottom > areaBottom && bounds.Top < areaTop)
					{
						Console.Write(textElement.ToString());
					}
				}
			}
		}
	}
}


            

Download.


If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:



Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.