Perform OCR in ASP.NET MVC

3 Jul 20243 minutes to read

The Syncfusion .NET OCR library is used to extract text from scanned PDFs and images in ASP.NET MVC application with the help of Google’s Tesseract Optical Character Recognition engine.

Steps to perform OCR on entire PDF document in ASP.NET MVC

Step 1: Create a new C# ASP.NET Web Application (.NET Framework) project.
ASP.NET MVC application creation

Step 2: In the project configuration windows, name your project and click Create.
ASP.NET MVC project configuration1

Step 3: Install the Syncfusion.Pdf.OCR.AspNet.Mvc5 NuGet package as a reference to your .NET applications from NuGet.org.
OCR ASP.NET MVC NuGet package installation

NOTE

Beginning from version 21.1.x, the default configuration includes the addition of the TesseractBinaries and Tesseract language data folder paths, eliminating the requirement to explicitly provide these paths.

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, you also have to add “Syncfusion.Licensing” assembly reference and include a license key in your projects. Please refer to this link to know about registering Syncfusion license key in your application to use our components.

Step 4: Include the following namespaces in the HomeController.cs file.

using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Parsing;

Step 5: Add a new button in the Index.cshtml as follows.

@{Html.BeginForm("PerformOCR", "Home", FormMethod.Post);
   {
      <div>
         <input type="submit" value="Perform OCR" style="width:150px;height:27px" />
      </div>
   }
   Html.EndForm();
}

Step 6: Add a new action method named PerformOCR in the HomeController.cs file and use the following code sample to perform OCR on the entire PDF document using PerformOCR method of the OCRProcessor class.

//Initialize the OCR processor.
using (OCRProcessor processor = new OCRProcessor())
{
   FileStream fileStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
   //Load a PDF document.
   PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
   //Set OCR language to process.
   processor.Settings.Language = Languages.English;
   processor.Settings.TesseractVersion = TesseractVersion.Version3_05;
   //Process OCR by providing the PDF document.
   processor.PerformOCR(lDoc);
   //Open the document in browser after saving it.
   lDoc.Save("Output.pdf", HttpContext.ApplicationInstance.Response, Syncfusion.Pdf.HttpReadType.Save);
   //Close the document.
   lDoc.Close(true);
   return View();
}

By executing the program, you will get a PDF document as follows.
OCR ASP.NET MVC output PDF document

A complete working sample can be downloaded from the Github.

Click here to explore the rich set of Syncfusion PDF library features.