Working with Optical Character Recognition (OCR)

3 Jul 20247 minutes to read

Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data.

The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine.

The Syncfusion OCR processor library works seamlessly in various platforms: Azure App Services, Azure Functions, AWS Textract, Docker, WinForms, WPF, Blazor, ASP.NET MVC, ASP.NET Core with Windows, MacOS and Linux.

NOTE

Starting with v20.1.0.x, if you reference Syncfusion OCR processor assemblies from the trial setup or the NuGet feed, you also have to include a license key in your projects. Please refer to this link to learn more about registering the Syncfusion license key in your application to use its components.

Key features

  • Create a searchable PDF from scanned PDF.
  • Zonal text extraction from the scanned PDF.
  • Preserve Unicode characters.
  • Extract text from the image.
  • Create a searchable PDF from large scanned PDF documents.
  • Create a searchable PDF from rotated scanned PDF.
  • Get OCRed text and its bounds from a scanned PDF document.
  • Native call.
  • Customizing the temp folder.
  • Performing OCR with different Page Segmentation Mode.
  • Performing OCR with different OCR Engine Mode.
  • White List.
  • Black List.
  • Image into searchable PDF or PDF/A.
  • Improved accessibility.
  • Post-processing.
  • Compatible with .NET Framework 4.5 and above.
  • Compatible with .NET Core 2.0 and above.

Install .NET OCR library

Include the OCR library in your project using two approaches.

  • NuGet Package Required (Recommended)
  • Assemblies Required

NOTE

Starting with v21.1.x, If you reference the Syncfusion OCR processor library from the NuGet feed, the package structure has been changed. The TesseractBinaries and Tesseract language data paths has been automatically added and do not need to add it manually.

Directly install the NuGet package to your application from nuget.org.

Platform(s) NuGet Package
(.NET Core, .NET 5, .NET 6 and .NET 7) Windows, Linux and Mac

Syncfusion.PDF.OCR.NET.nupkg

Windows Forms

Syncfusion.Pdf.OCR.WinForms.nupkg

WPF

Syncfusion.Pdf.OCR.Wpf.nupkg

ASP.NET

Syncfusion.Pdf.OCR.AspNet.nupkg

ASP.NET MVC4

Syncfusion.Pdf.OCR.AspNet.Mvc4.nupkg

ASP.NET MVC5

Syncfusion.Pdf.OCR.AspNet.Mvc5.nupkg

ASP.NET Core (.NET 5, .NET 6 and .NET 7) Windows, Linux and Mac

Syncfusion.PDF.OCR.Net.Core.nupkg

Assemblies Required

Get the following required assemblies by downloading the OCR library installer. Download and install the OCR library for Windows, Linux, and Mac respectively. Please refer to the advanced installation steps for more details.

Syncfusion assemblies

Platform(s) Assemblies
Windows Forms, WPF, ASP.NET, and ASP.NET MVC
  • Syncfusion.OCRProcessor.Base.dll
  • Syncfusion.Pdf.Base.dll
  • Syncfusion.Compression.Base.dll
.NET Standard 2.0
  • Syncfusion.OCRProcessor.Portable.dll
  • Syncfusion.PdfImaging.Portable.dll
  • Syncfusion.Pdf.Portable.dll
  • Syncfusion.Compression.Portable.dll
  • SkiaSharp

    package
.NET 5/.NET 6
  • Syncfusion.OCRProcessor.NET.dll
  • Syncfusion.PdfImaging.NET.dll
  • Syncfusion.Pdf.NET.dll
  • Syncfusion.Compression.NET.dll
  • SkiaSharp

    package

Prerequisites

The Syncfusion OCR processor internally uses Tesseract libraries to perform OCR, so please copy the necessary Tessdata and TesseractBinaries folders from the NuGet package folder to the project folder to use the OCR feature.

Prerequisites for Windows

Please refer to the following code sample for windows.

  • C#
  • OCRProcessor processor = new OCRProcessor();
  • C#
  • processor.PerformOCR(lDoc);

    Download the language packages from the following link.
    https://github.com/tesseract-ocr/tessdata

    NOTE

    From 16.1.0.24 OCR is not a part of Essential Studio and is available as a separate package (OCR Processor) under the Add-On section in the following link.
    https://www.syncfusion.com/downloads/latest-version

    Prerequisites for Linux

    Install the “libgdiplus” and “libc6-dev” packages. Please refer to the following commands to install the packages.

  • C#
  • sudo apt-get update
    sudo apt-get install libgdiplus
    sudo apt-get install libc6-dev

    Please refer to the following code snippet for Linux.

  • C#
  • OCRProcessor processor = new OCRProcessor();
  • C#
  • processor.PerformOCR(lDoc);

    Download the language packages from the following link.
    https://github.com/tesseract-ocr/tessdata

    Prerequisites for Mac

    Install the “libgdiplus” and “tesseract” packages in the Mac machine where the OCR operations occur. Please refer to the following commands to install this package.

  • C#
  • brew install mono-libgdiplus
    brew install tesseract

    Please refer to the following code sample for Mac.

  • C#
  • OCRProcessor processor = new OCRProcessor();
  • C#
  • processor.PerformOCR(lDoc);

    Download the language packages from the following link.
    https://code.google.com/p/tesseract-ocr/downloads/list

    Get Started with OCR

    To quickly get started with extracting text from scanned PDF documents in .NET using the Syncfusion OCR processor Library, refer to this video tutorial:

    Perform OCR using C#

    Integrating the OCR processor library in any .NET application is simple. Please refer to the following steps to perform OCR in your .NET application.

    Steps to perform OCR on a entire PDF document in .NET application

    Step 1: Create a new .NET console application.
    Create .NET console Step1

    In project configuration window, name your project and select Next.
    Create .NET console Step1

    Step 2: Install Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET application from nuget.org.
    Create .NET console Step3

    Step 3:Please use the OCR language data for other languages using the following link.

    Tesseract language data

    Step 4: Include the following namespace in your class file.

  • C#
  • using Syncfusion.OCRProcessor;
    using Syncfusion.Pdf.Parsing;

    Step 5: Use the following code sample to perform OCR on the entire PDF document using PerformOCR method of the OCRProcessor class in Program.cs file.

  • C#
  • //Initialize the OCR processor.
    using (OCRProcessor processor = new OCRProcessor())
    {
        //Load an existing PDF document.
        FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
        PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(stream);
        //Set OCR language to process.
        processor.Settings.Language = Languages.English;
        //Process OCR by providing the PDF document.
        processor.PerformOCR(pdfLoadedDocument);
        //Create file stream.
        using (FileStream outputFileStream = new FileStream(@"Output.pdf", FileMode.Create, FileAccess.ReadWrite))
        {
            //Save the PDF document to file stream.
            pdfLoadedDocument.Save(outputFileStream);
        }
        //Close the document.
        pdfLoadedDocument.Close(true);
    }

    By executing the program, you will get the PDF document as follows.
    Output screenshot

    A complete working sample can be downloaded from GitHub.

    Perform OCR in Linux

    The Syncfusion .NET OCR library supports performing OCR in Linux. Refer to this section for more information about performing OCR on an entire PDF document in Linux.

    Perform OCR in Docker

    The Syncfusion .NET OCR library supports performing OCR in Docker. Refer to this section for more information about performing OCR on an entire PDF document in Docker.

    Perform OCR in Mac

    «««< HEAD
    The Syncfusion .NET OCR library supports performing OCR on Mac. Refer to this section for more information about performing OCR on an entire PDF document on Mac.
    =======

    Perform OCR in ASP.NET Core

    The Syncfusion .NET OCR library supports performing OCR in ASP.NET Core. Refer to this section for more information about performing OCR on an entire PDF document in ASP.NET Core.

    Perform OCR in ASP.NET MVC

    The Syncfusion .NET OCR library supports performing OCR in ASP.NET MVC. Refer to this section for more information about performing OCR on an entire PDF document in ASP.NET MVC.

    Perform OCR in Blazor

    The Syncfusion .NET OCR library supports performing OCR in Blazor. Refer to this section for more information about performing OCR on an entire PDF document in Blazor.

    Perform OCR in Azure

    The Syncfusion .NET OCR library supports performing OCR in Azure. Refer to this section for more information about performing OCR on an entire PDF document in Azure.

    Perform OCR in Azure Vision

    The Syncfusion .NET OCR library supports performing OCR with Azure Vision (external engine). Refer to this section for more information about performing OCR on an entire PDF document.

    Perform OCR in AWS Textract

    The Syncfusion .NET OCR library supports performing OCR with AWS Textract. Refer to this section for more information about performing OCR on an entire PDF document in AWS.

    Features

    Refer to this section for more information about features in PDF OCR. Get the details, code examples and demo from this section.

    Troubleshooting

    Refer to this section for troubleshooting PDF OCR failures.