SautinSoft.Pdf provides a very fast reading of PDF files from your C# or VB.NET application. It can read a 1,000 page PDF file full of text (almost 500,000 words) in just three seconds.
The text extraction is fairly straightforward to carry out. Using a simple API and just a few lines of code, you can quickly retrieve the entire text content from a PDF file as a single String, ready for your further processing.
The following example shows how you can easily read the text content of each page in your PDF document.
using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;
namespace Sample
{
class Sample
{
static void Main(string[] args)
{
string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
using (var document = PdfDocument.Load(pdfFile))
{
foreach (var page in document.Pages)
{
Console.WriteLine(page.Content.ToString());
}
}
}
}
}
SautinSoft.Pdf simplifies PDF page content operations by representing the content as a sequence of parsed, or compiled, elements, such as text, path, and external objects (images and forms). For more information see the Content Streams and Resources help page.
The PdfTextContenе elements can be used to extract additional information about a text such as its bounds, font, and color as shown in the next example.using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;
class Program
{
static void Main()
{
string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
// Iterate through all PDF pages and through each page's content elements,
// and retrieve only the text content elements.
using (var document = PdfDocument.Load(pdfFile))
{
foreach (var page in document.Pages)
{
var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (contentEnumerator.MoveNext())
{
if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
{
var textElement = (PdfTextContent)contentEnumerator.Current;
var text = textElement.ToString();
var font = textElement.Format.Text.Font;
var color = textElement.Format.Fill.Color;
var bounds = textElement.Bounds;
contentEnumerator.Transform.Transform(ref bounds);
// Read the text content element's additional information.
Console.WriteLine($"Unicode text: {text}");
Console.WriteLine($"Font name: {font.Face.Family.Name}");
Console.WriteLine($"Font size: {font.Size}");
Console.WriteLine($"Font style: {font.Face.Style}");
Console.WriteLine($"Font weight: {font.Face.Weight}");
if (color.TryGetRgb(out double red, out double green, out double blue))
Console.WriteLine($"Color: Red={red}, Green={green}, Blue={blue}");
Console.WriteLine($"Bounds: Left={bounds.Left:0.00}, Bottom={bounds.Bottom:0.00}, Right={bounds.Right:0.00}, Top={bounds.Top:0.00}");
Console.WriteLine();
}
}
}
}
}
}
With SautinSoft.Pdf, you can extract a PDF document's text from a specific rectangular area. To do this, you define the bounds of the targeted area and retrieve only the PdfTextContent elements that are within it, as shown in the next example.
using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;
class Program
{
static void Main()
{
string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
var pageIndex = 0;
double areaLeft = 300, areaRight = 520, areaBottom = 720, areaTop = 510;
using (var document = PdfDocument.Load(pdfFile))
{
// Retrieve first page object.
var page = document.Pages[pageIndex];
// Retrieve text content elements that are inside specified area on the first page.
var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (contentEnumerator.MoveNext())
{
if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
{
var textElement = (PdfTextContent)contentEnumerator.Current;
var bounds = textElement.Bounds;
contentEnumerator.Transform.Transform(ref bounds);
if (bounds.Left > areaLeft && bounds.Right < areaRight &&
bounds.Bottom > areaBottom && bounds.Top < areaTop)
{
Console.Write(textElement.ToString());
}
}
}
}
}
}
If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below: