How to perform OCR for a PDF document using C# and VB.NET

25 Nov 20224 minutes to read

Essential PDF provides support for Optical Character Recognition with the help of Google’s Tesseract OCR engine. With a few lines of code, a scanned PDF document containing a raster image is converted into a searchable and selectable PDF document.

NOTE

Starting with v20.1.0.x, if you reference Syncfusion OCR processor assemblies from trial setup or from the NuGet feed, you also have to include a license key in your projects. Please refer to this link to know about registering Syncfusion license key in your application to use our components.

To use the Syncfusion OCR processor library in your application, you need to add reference to the following set of assemblies.

Syncfusion assemblies

  1. Syncfusion.Compression.Base.dll
  2. Syncfusion.Pdf.Base.dll
  3. Syncfusion.OcrProcessor.Base.dll

Tesseract assemblies

  • Syncfusion.Tesseract.dll (Tesseract Engine Version 4.0)
  • liblept168.dll (Leptonica image processing library used by Tesseract engine)

Steps to perform OCR on a entire PDF document programmatically

1.Create a new C# Windows Forms application project.
WF sample creation step1

2.Install Syncfusion.Pdf.OCR.WinForms NuGet packages as reference to your .NET Framework application from NuGet.org.
Install NuGet

3.Include the following namespaces in the Form1.cs file.

using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.OCRProcessor

4.Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the location of the assemblies are passed as a parameter to the OCR processor.

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/")
Dim processor As New OCRProcessor("TesseractBinaries/")

5.Place the Tesseract language data {E.g eng.traineddata} in the local system and provide a path to the OCR processor.

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");
processor.PerformOCR(lDoc, @"TessData/");
Dim processor As New OCRProcessor("TesseractBinaries/")
processor.PerformOCR(lDoc, "TessData/")

6.Use the following code snippet to process OCR on a entire PDF document.

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor("TesseractBinaries/4.0/x86/"))
{
    //Load the PDF document
    PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");

    //Set OCR language to process
    processor.Settings.Language = Languages.English;

    //Set the tesseract version 
    processor.Settings.TesseractVersion = TesseractVersion.Version4_0;

    //Process OCR by providing the PDF document and Tesseract data
    processor.PerformOCR(loadedDocument, "Tessdata/");

    //Save the OCR processed PDF document in the disk
    loadedDocument.Save("Sample.pdf");
    loadedDocument.Close(true);
}
'Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll) 
Using processor As OCRProcessor = New OCRProcessor("TesseractBinaries/4.0/x86/")

    'Load the PDF document
    Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")

    'Set OCR language to process
    processor.Settings.Language = Languages.English

    'Set the tesseract version 
    processor.Settings.TesseractVersion = TesseractVersion.Version4_0

    'Process OCR by providing the PDF document and Tesseract data
    processor.PerformOCR(loadedDocument, "Tessdata/")

    'Save the OCR processed PDF document in the disk
    loadedDocument.Save("Sample.pdf")
    loadedDocument.Close(True)

End Using

You can download a complete working sample from GitHub.

By executing the program, you will get the PDF document (contains selectable text) as follows.
output-pdf