Perform OCR in Linux

3 Jul 20243 minutes to read

The Syncfusion .NET OCR library is used to extract text from scanned PDFs and images in the Linux application with the help of Google’s Tesseract Optical Character Recognition engine.

Pre-requisites

The following Linux dependencies should be installed where the conversion takes place.

sudo apt-get update
sudo apt-get install libgdiplus
sudo apt-get install libc6-dev

Steps to convert HTML to PDF in .NET Core application on Linux

Step 1: Execute the following command in the Linux terminal to create a new .NET Core Console application.

dotnet new console

OCR Linux Step1

Step 2: Install the Syncfusion.PDF.OCR.Net NuGet package as a reference to your .NET Core application NuGet.org.

dotnet add package Syncfusion.PDF.OCR.Net -v xx.x.x.xx -s https://www.nuget.org/

OCR Linux Step2

Step 3: Include the following namespaces in Program.cs file.

using Syncfusion.OCRProcessor;
using Syncfusion.Pdf;
using Syncfusion.Pdf.Parsing;

NOTE

Beginning from version 21.1.x, the default configuration includes the addition of the TesseractBinaries and Tesseract language data folder paths, eliminating the requirement to explicitly provide these paths.

Step 4: Add code sample to perform OCR on entire PDF document using PerformOCR method of the OCRProcessor class.

string docPath = ("input.pdf");

//Initialize the OCR processor
using (OCRProcessor processor = new OCRProcessor())
{
    //Load the PDF document 
    FileStream stream = new FileStream(docPath, FileMode.Open, FileAccess.Read);
    PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);

    //Language to process the OCR
    processor.Settings.Language = Languages.English;
    //Process OCR by providing loaded PDF document, Data dictionary and language
    processor.PerformOCR(lDoc);

    //Save the OCR processed PDF document in the disk
    MemoryStream streamData = new MemoryStream();
    lDoc.Save(streamData);
    File.WriteAllBytes("Output.pdf", streamData.ToArray());
    lDoc.Close(true);
}

Step 5: Execute the following command to restore the NuGet packages.

dotnet restore

OCR Linux Step3

Step 6: Execute the following command in the terminal to build the application.

dotnet build

OCR Linux Step4

Step 7: Execute the following command in the terminal to run the application.

dotnet run

OCR Linux Step5

By executing the program, you will get the PDF document as follows. The output will be saved in parallel to the program.cs file.
OCR Linux Output

A complete working sample can be downloaded from Github.