Click or drag to resize

PdfFocusCOCROptionseOCRMode Enumeration

Represent OCR modes: Disable (default), All images, Automatic.

Namespace: SautinSoft
Assembly: SautinSoft.PdfFocus (in SautinSoft.PdfFocus.dll) Version: 2024.8.6
Syntax
public enum eOCRMode
Members
Member nameValueDescription
Disabled0 Don't make OCR (optical recognizing) for images at all. The images will be placed into resulting document as is.
AllImages1 Perform the OCR (optical recognizing) for every image. Note, the component will consider the every image as textual data scanned or photographed and try to recognize it. In any case (successfully or failing) recognizing (OCR) the all images will NOT be placed in the resulting document.
Auto2 Perform the OCR (optical recognizing) for images that looks as scanned or photographed text. Such images (after performing OCR) will be placed in the resulting document as text. Other images will be placed in the resulting document as images.
Example
Perform OCR using free Tesseract SDK in C#
using System.IO;
using SautinSoft;
using System;

namespace Example
{
    class Program
    {
        static void Main(string[] args)
        {
            // Before starting, we recommend to get a free 100-day key:
            // https://sautinsoft.com/start-for-free/

            // Apply the key here:
            // SautinSoft.PdfFocus.SetLicense("...");

            // Note: Please rebuild the project to restore Nuget packages.
            LoadScannedPdf();
        }

        /// <summary>
        /// Load a scanned PDF document with help of Tesseract OCR (free OCR library) and save the result as DOCX document.
        /// </summary>
        static void LoadScannedPdf()
        {
            // Here we'll load a scanned PDF document (perform OCR) containing a text on English, Russian and Vietnamese.
            // Next save the OCR result as a new DOCX document.

            // First steps:

            // 1. Download data files for English, Russian and Vietnamese languages.
            // Please download the files: eng.traineddata, rus.traineddata and vie.traineddata.
            // From here (good and fast): https://github.com/tesseract-ocr/tessdata_fast
            // or (best and slow): https://github.com/tesseract-ocr/tessdata_best

            // 2. Copy the files: eng.traineddata, rus.traineddata and vie.traineddata to
            // the folder "tessdata" in the Project root.

            // 3. Be sure that the folder "tessdata" also contains "pdf.ttf" file.

            // Let's start:
            string inpFile = Path.GetFullPath(@"..\..\..\scan.pdf");
            string outFile = "Result.docx";

            PdfFocus f = new PdfFocus();
            f.OCROptions.Mode = PdfFocus.COCROptions.eOCRMode.AllImages;

            f.OpenPdf(inpFile);
            bool result = false;
            if (f.PageCount > 0)
            {
                result = f.ToWord(outFile) == 0;
            }
            // Open the result for demonstration purposes.
            if (result)
            {                
                System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(outFile) { UseShellExecute = true });
            }
            else
                Console.WriteLine("Conversion failed!");
        }
    }
}
Perform OCR using free Tesseract SDK in VB.Net
Imports System.IO
Imports SautinSoft
Imports System

Namespace Example
    Friend Class Sample
        Shared Sub Main(ByVal args() As String)
            ' Before starting, we recommend to get a free 100-day key:
            ' https://sautinsoft.com/start-for-free/

            ' Apply the key here:
            ' SautinSoft.PdfFocus.SetLicense("...");

            ' Note: Please rebuild the project to restore Nuget packages.
            LoadScannedPdf()
        End Sub

        ''' <summary>
        ''' Load a scanned PDF document with help of Tesseract OCR (free OCR library) and save the result as DOCX document.
        ''' </summary>
        Private Shared Sub LoadScannedPdf()
            ' Here we'll load a scanned PDF document (perform OCR) containing a text on English, Russian and Vietnamese.
            ' Next save the OCR result as a new DOCX document.

            ' First steps:

            ' 1. Download data files for English, Russian and Vietnamese languages.
            ' Please download the files: eng.traineddata, rus.traineddata and vie.traineddata.
            ' From here (good and fast): https://github.com/tesseract-ocr/tessdata_fast
            ' or (best and slow): https://github.com/tesseract-ocr/tessdata_best

            ' 2. Copy the files: eng.traineddata, rus.traineddata and vie.traineddata to
            ' the folder "tessdata" in the Project root.

            ' 3. Be sure that the folder "tessdata" also contains "pdf.ttf" file.

            ' Let's start:
            Dim inpFile As String = Path.GetFullPath("..\..\..\scan.pdf")
            Dim outFile As String = "Result.docx"

            Dim f As New PdfFocus()
            f.OCROptions.Mode = PdfFocus.COCROptions.eOCRMode.AllImages

            f.OpenPdf(inpFile)
            Dim result As Boolean = False
            If f.PageCount > 0 Then
                result = f.ToWord(outFile) = 0
            End If
            ' Open the result for demonstration purposes.
            If result Then
                System.Diagnostics.Process.Start(New System.Diagnostics.ProcessStartInfo(outFile) With {.UseShellExecute = True})
            Else
                Console.WriteLine("Conversion failed!")
            End If
        End Sub
    End Class
End Namespace
See Also