How to perform OCR for a PDF document using C# and VB.NET
20 Jan 20254 minutes to read
Essential® PDF provides support for Optical Character Recognition with the help of Google’s Tesseract OCR engine. With a few lines of code, a scanned PDF document containing a raster image is converted into a searchable and selectable PDF document.
NOTE
Starting with v20.1.0.x, if you reference Syncfusion® OCR processor assemblies from trial setup or from the NuGet feed, you also have to include a license key in your projects. Please refer to this link to know about registering Syncfusion® license key in your application to use our components.
To use the Syncfusion® OCR processor library in your application, you need to add reference to the following set of assemblies.
Syncfusion assemblies
- Syncfusion.Compression.Base.dll
- Syncfusion.Pdf.Base.dll
- Syncfusion.OcrProcessor.Base.dll
Tesseract assemblies
- Syncfusion.Tesseract.dll (Tesseract Engine Version 4.0)
- liblept168.dll (Leptonica image processing library used by Tesseract engine)
Steps to perform OCR on a entire PDF document programmatically
1.Create a new C# Windows Forms application project.
2.Install Syncfusion.Pdf.OCR.WinForms NuGet packages as reference to your .NET Framework application from NuGet.org.
3.Include the following namespaces in the Form1.cs file.
using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.OCRProcessor
4.Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the location of the assemblies are passed as a parameter to the OCR processor.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/")
Dim processor As New OCRProcessor("TesseractBinaries/")
5.Place the Tesseract language data {E.g eng.traineddata} in the local system and provide a path to the OCR processor.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");
processor.PerformOCR(lDoc, @"TessData/");
Dim processor As New OCRProcessor("TesseractBinaries/")
processor.PerformOCR(lDoc, "TessData/")
6.Use the following code snippet to process OCR on a entire PDF document.
//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor("TesseractBinaries/4.0/x86/"))
{
//Load the PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Set OCR language to process
processor.Settings.Language = Languages.English;
//Set the tesseract version
processor.Settings.TesseractVersion = TesseractVersion.Version4_0;
//Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(loadedDocument, "Tessdata/");
//Save the OCR processed PDF document in the disk
loadedDocument.Save("Sample.pdf");
loadedDocument.Close(true);
}
'Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
Using processor As OCRProcessor = New OCRProcessor("TesseractBinaries/4.0/x86/")
'Load the PDF document
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
'Set OCR language to process
processor.Settings.Language = Languages.English
'Set the tesseract version
processor.Settings.TesseractVersion = TesseractVersion.Version4_0
'Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(loadedDocument, "Tessdata/")
'Save the OCR processed PDF document in the disk
loadedDocument.Save("Sample.pdf")
loadedDocument.Close(True)
End Using
You can download a complete working sample from GitHub.
By executing the program, you will get the PDF document (contains selectable text) as follows.