How to perform OCR for a PDF document using C# and VB.NET
26 Apr 20244 minutes to read
Essential PDF provides support for Optical Character Recognition with the help of Google’s Tesseract OCR engine. With a few lines of code, a scanned PDF document containing a raster image is converted into a searchable and selectable PDF document.
NOTE
Starting with v20.1.0.x, if you reference Syncfusion OCR processor assemblies from trial setup or from the NuGet feed, you also have to include a license key in your projects. Please refer to this link to know about registering Syncfusion license key in your application to use our components.
To use the Syncfusion OCR processor library in your application, you need to add reference to the following set of assemblies.
Syncfusion assemblies
- Syncfusion.Compression.Base.dll
- Syncfusion.Pdf.Base.dll
- Syncfusion.OcrProcessor.Base.dll
Tesseract assemblies
- Syncfusion.Tesseract.dll (Tesseract Engine Version 4.0)
- liblept168.dll (Leptonica image processing library used by Tesseract engine)
Steps to perform OCR on a entire PDF document programmatically
1.Create a new C# Windows Forms application project.
2.Install Syncfusion.Pdf.OCR.WinForms NuGet packages as reference to your .NET Framework application from NuGet.org.
3.Include the following namespaces in the Form1.cs file.
using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.OCRProcessor
4.Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the location of the assemblies are passed as a parameter to the OCR processor.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/")
Dim processor As New OCRProcessor("TesseractBinaries/")
5.Place the Tesseract language data {E.g eng.traineddata} in the local system and provide a path to the OCR processor.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");
processor.PerformOCR(lDoc, @"TessData/");
Dim processor As New OCRProcessor("TesseractBinaries/")
processor.PerformOCR(lDoc, "TessData/")
6.Use the following code snippet to process OCR on a entire PDF document.
//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor("TesseractBinaries/4.0/x86/"))
{
//Load the PDF document
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
//Set OCR language to process
processor.Settings.Language = Languages.English;
//Set the tesseract version
processor.Settings.TesseractVersion = TesseractVersion.Version4_0;
//Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(loadedDocument, "Tessdata/");
//Save the OCR processed PDF document in the disk
loadedDocument.Save("Sample.pdf");
loadedDocument.Close(true);
}
'Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
Using processor As OCRProcessor = New OCRProcessor("TesseractBinaries/4.0/x86/")
'Load the PDF document
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument("Input.pdf")
'Set OCR language to process
processor.Settings.Language = Languages.English
'Set the tesseract version
processor.Settings.TesseractVersion = TesseractVersion.Version4_0
'Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(loadedDocument, "Tessdata/")
'Save the OCR processed PDF document in the disk
loadedDocument.Save("Sample.pdf")
loadedDocument.Close(True)
End Using
You can download a complete working sample from GitHub.
By executing the program, you will get the PDF document (contains selectable text) as follows.