Convert PDF to separate HTML pages in C# and .NET
Document conversion is one of the most demanded tasks in the field of automated information processing. The need to convert PDF files into a format convenient for viewing and editing – HTML – is particularly pressing. In this article, we'll look at how to implement this process on the .NET platform using the PDF Focus .NET component from SautinSoft library. We'll explain the advantages of this approach in detail, as well as explore its practical significance and use cases.
The practical application of this functionality covers a wide range of tasks:
- Creating online documentation libraries or e-books, where each PDF file is divided into individual HTML pages.
- Automating the archiving and retrieval of information within documents.
- Integration with learning systems, where training materials are presented as converted PDF openers.
- Developing web interfaces that display documents without the need to install additional programs or plugins.
How does conversion work: basic concepts and sample code.
The goal is to take a single PDF file and split it into individual HTML pages – page by page – while preserving the structure and formatting.
What is the benefit of this approach?
- Modularity: Each page is converted into a separate HTML file, significantly simplifying maintenance and updates.
- Automation: Can be integrated into CRM, CMS, or automated document processing systems.
- High conversion quality: The library's internal algorithms preserve structure, fonts, and graphics.
- Easy integration: The library's API is well-suited for use in ASP.NET projects, Windows applications, and services.
What are some interesting aspects to consider?
- Large PDF processing: When working with very large files, it is recommended to use asynchronous or streaming methods to avoid memory errors.
- API customization: You can further define processing parameters, such as including scans (OCR) and managing the element structure.
- Security: When processing confidential data, it is important to protect the path and files.
- Licensing: PDF Focus is a commercial product, so a license is required.
Complete code
using System;
using System.IO;
namespace Sample
{
class Sample
{
static void Main(string[] args)
{
// Before starting, we recommend to get a free key:
// https://sautinsoft.com/start-for-free/
// Apply the key here:
// SautinSoft.PdfFocus.SetLicense("...");
// Convert PDF to separate HTMLs.
// Each PDF page will be converted to a single HTML document.
string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
DirectoryInfo htmlDir = new DirectoryInfo("htmls");
if (!htmlDir.Exists)
htmlDir.Create();
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
f.HtmlOptions.IncludeImageInHtml = false;
// Path (must exist) to a directory to store images after converting.
f.HtmlOptions.ImageFolder = htmlDir.FullName;
f.OpenPdf(pdfFile);
if (f.PageCount > 0)
{
// Convert each PDF page to separate HTML document.
// simple text.html, simple text.html ... simple text.html.
for (int page = 1; page <= f.PageCount; page++)
{
f.HtmlOptions.Title = $"Page {page}";
f.HtmlOptions.ImageSubFolder = String.Format("page{0}_images", page);
string htmlString = f.ToHtml(page, page);
// Save htmlString to file
string htmlFile = Path.Combine(htmlDir.FullName, $"Page{page}.html");
File.WriteAllText(htmlFile, htmlString);
// Let's open only 1st and last pages.
if (page == 1 || page == f.PageCount)
{
// Open the result for demonstration purposes.
System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(htmlFile) { UseShellExecute = true });
}
}
}
}
}
}
Imports System
Imports System.IO
Namespace Sample
Friend Class Sample
Shared Sub Main(ByVal args() As String)
' Before starting, we recommend to get a free key:
' https://sautinsoft.com/start-for-free/
' Apply the key here
' SautinSoft.PdfFocus.SetLicense("...");
' Convert PDF to separate HTMLs.
' Each PDF page will be converted to a single HTML document.
Dim pdfFile As String = Path.GetFullPath("..\..\..\simple text.pdf")
Dim htmlDir As New DirectoryInfo("htmls")
If Not htmlDir.Exists Then
htmlDir.Create()
End If
Dim f As New SautinSoft.PdfFocus()
f.HtmlOptions.IncludeImageInHtml = False
' Path (must exist) to a directory to store images after converting.
f.HtmlOptions.ImageFolder = htmlDir.FullName
f.OpenPdf(pdfFile)
If f.PageCount > 0 Then
' Convert each PDF page to separate HTML document.
' simple text.html, simple text.html ... simple text.html.
For page As Integer = 1 To f.PageCount
f.HtmlOptions.Title = $"Page {page}"
f.HtmlOptions.ImageSubFolder = String.Format("page{0}_images", page)
Dim htmlString As String = f.ToHtml(page, page)
' Save htmlString to file
Dim htmlFile As String = Path.Combine(htmlDir.FullName, $"Page{page}.html")
File.WriteAllText(htmlFile, htmlString)
' Let's open only 1st and last pages.
If page = 1 OrElse page = f.PageCount Then
' Open the result for demonstration purposes.
System.Diagnostics.Process.Start(New System.Diagnostics.ProcessStartInfo(htmlFile) With {.UseShellExecute = True})
End If
Next page
End If
End Sub
End Class
End Namespace
If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below: