Extract right-to-left data from documents in C# and .NET

Processing RTL documents is a common need for international businesses, government agencies, educational institutions, and many other organizations. Virtual assistants, automated translation systems, search engines, and archival storage systems all require accurate data extraction with support for RTL languages. In this article, we'll discuss how to easily implement this task using C# and .NET, using the popular SDK SautinSoft Document.NET.

Many documents, especially those in Arabic, Hebrew, and other right-to-left (RTL) languages, require a special approach during processing. The standards and structures of such documents may differ from those typical for left-to-left (LTR) text.

Tasks related to extracting text and data from RTL documents:

  • Automating the processing of large volumes of documents;
  • Implementing a search and analysis system;
  • Migrating data between systems supporting different languages ​​and text directions;
  • Ensuring the accurate operation of OCR and recognition systems, where text directionality is important.

Without proper processing, there is a risk of losing important data or misinterpreting the content.

Input file:

table of content input

Output result:

table of content output

Complete code

using SautinSoft.Document;
using System;
using System.IO;
using System.Linq;
using System.Reflection.Metadata;

namespace Sample
{
    class Sample
    {
        static void Main(string[] args)
        {
            // Get your free trial key here:   
            // https://sautinsoft.com/start-for-free/

            ConvertRTLcontent();
        }

        /// <summary>
        /// How to convert documents with Right-To-Left content to HTML.
        /// </summary>
        /// <remarks>
        /// Details: https://sautinsoft.com/products/document/help/net/developer-guide/convert-documents-with-right-to-left-content-to-html.php
        /// </remarks>
        public static void ConvertRTLcontent()
        {
            string sourcePath = @"..\..\..\RTL.docx";
            string destPath = "RTL.html";
            
            // Load document with arabic, hindi, hebrew content.
            DocumentCore dc = DocumentCore.Load(sourcePath);
           
            // Save the document as HTML.
            dc.Save(destPath, new HtmlFixedSaveOptions());

            // Show the source and the dest documents.
            System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(sourcePath) { UseShellExecute = true });
            System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(destPath) { UseShellExecute = true });
        }
    }
}

Download

Imports SautinSoft.Document
Imports System
Imports System.IO
Imports System.Linq
Imports System.Reflection.Metadata

Namespace Sample
	Friend Class Sample
		Shared Sub Main(ByVal args() As String)
			' Get your free trial key here:   
			' https://sautinsoft.com/start-for-free/

			ConvertRTLcontent()
		End Sub

		''' <summary>
		''' How to convert documents with Right-To-Left content to HTML.
		''' </summary>
		''' <remarks>
		''' Details: https://sautinsoft.com/products/document/help/net/developer-guide/convert-documents-with-right-to-left-content-to-html.php
		''' </remarks>
		Public Shared Sub ConvertRTLcontent()
			Dim sourcePath As String = "..\..\..\RTL.docx"
			Dim destPath As String = "RTL.html"

			' Load document with arabic, hindi, hebrew content.
			Dim dc As DocumentCore = DocumentCore.Load(sourcePath)

			' Save the document as HTML.
			dc.Save(destPath, New HtmlFixedSaveOptions())

			' Show the source and the dest documents.
			System.Diagnostics.Process.Start(New System.Diagnostics.ProcessStartInfo(sourcePath) With {.UseShellExecute = True})
			System.Diagnostics.Process.Start(New System.Diagnostics.ProcessStartInfo(destPath) With {.UseShellExecute = True})
		End Sub
	End Class
End Namespace

Download


If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:


Captcha

Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.