How to use OCR option in Linux/MacOs


As you know, PDF Focus .Net supports OCR. There are no difficulties in deploying on the Windows platform. However, if we are talking about Linux/MacOs, you need to make additional settings and the ability to convert scanned pdf documents will please you with its quality.

So, let's get started. Step by step.

  1. Download a latest version of PDF Focus .Net from NuGet in your project:
  2. You need to add additional dependencies in the project of Visual Studio:
  3. <PackageReference Include="SkiaSharp.NativeAssets.Linux" Version="2.88.7" />
    <PackageReference Include="SkiaSharp.NativeAssets.macOS" Version="2.88.7" />
    <PackageReference Include="HarfBuzzSharp.NativeAssets.Linux" Version="*" />
    <PackageReference Include="HarfBuzzSharp.NativeAssets.macOS" Version="*" />
    <PackageReference Include="System.Reflection.Emit" Version="*" />
    C#

    An example of full *.proj file:

  4. To install Tesseract 5.x you can simply run the following command on your Linux/macOS:
  5. sudo apt install tesseract-ocr
    Tesseract uses Leptonica.
    Leptonica is a pedagogically-oriented open source library containing software that is broadly useful for image processing and image analysis applications.
    sudo apt install libleptonica-dev
    If you wish to install the Developer Tools which can be used for training, run the following command:
    sudo apt install libtesseract-dev
    The following instructions are for building on Linux, which also can be applied to other UNIX like operating systems.
  • Now, you need to add some code for converting scanned PDF to editable Word document:
  • Please be sure, that you specify the correct path for the folder “tessdata”. “OcrLanguage” must be installed like your original PDF document: Eng, Ger, Fra, etc. Once your project is launched, all dependencies will be downloaded and additional folders will be added to the debug folder:

    Important! Because Tesseract is configured to control the Windows OS, you need to manually add two files to the “x64” folder:

    Libleptonica-1.82.0.so
    libtesseract50.so

    You need to download it from the Internet or our full code sample for OCR Linux/MacOs:

  • After you have done the above steps. Run the last command in the terminal: "dotnet restore" and "dotnet run":
  • The result of the conversion will be an editable Word-file that you can edit and save.

    The full code sample you may download directly from GitHub in C#: The link


    If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:



    Questions and suggestions from you are always welcome!

    We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.