Perform OCR in ASP.NET Core

3 Jul 20243 minutes to read

The Syncfusion .NET OCR library is used to extract text from the scanned PDFs and images in the ASP.NET Core application with the help of Google’s Tesseract Optical Character Recognition engine.

Steps to perform OCR on entire PDF document in ASP.NET Core application

Step 1: Create a new C# ASP.NET Core Web Application project.
Create ASP.NET Core Web application

Step 2: In configuration windows, name your project and click Next.
ASP.NET Core project configuration1
ASP.NET Core project configuration2

Step 3: Install the Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET Standard applications from NuGet.org.
PDF OCR ASP.NET Core NuGet package

NOTE

  1. Beginning from version 21.1.x, the default configuration includes the addition of the TesseractBinaries and Tesseract language data folder paths, eliminating the requirement to explicitly provide these paths.
  2. Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, you also have to add “Syncfusion.Licensing” assembly reference and include a license key in your projects. Please refer to this link to know about registering Syncfusion license key in your application to use our components.

Step 4: A default controller with the name HomeController.cs gets added to the creation of the ASP.NET Core MVC project. Include the following namespaces in that HomeController.cs file.

  • C#
  • using Syncfusion.OCRProcessor;
    using Syncfusion.Pdf.Parsing;

    Step 5: Add a new button in index.cshtml as follows.

  • C#
  • @{Html.BeginForm("PerformOCR", "Home", FormMethod.Post);
       {
          <div>
             <input type="submit" value="Perform OCR" style="width:150px;height:27px" />
          </div>
       }
       Html.EndForm();
    }

    Step 6: Add a new action method named PerformOCR in the HomeController.cs and use the following code sample to perform OCR on the entire PDF document using PerformOCR method of the OCRProcessor class.

  • C#
  • //Initialize the OCR processor.
    using (OCRProcessor processor = new OCRProcessor())
    {
       FileStream fileStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
       //Load a PDF document.
       PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
       //Set OCR language to process.
       processor.Settings.Language = Languages.English;
       //Process OCR by providing the PDF document.
       processor.PerformOCR(lDoc);
       //Create memory stream.
       MemoryStream stream = new MemoryStream();
       //Save the document to memory stream.
       lDoc.Save(stream);
       lDoc.Close();
       //Set the position as '0'.
       stream.Position = 0;
       //Download the PDF document in the browser.
       FileStreamResult fileStreamResult = new FileStreamResult(stream, "application/pdf");
       fileStreamResult.FileDownloadName = "Sample.pdf";
       return fileStreamResult;
    }

    By executing the program, you will get a PDF document as follows.
    OCR ASP.NET_Core Output

    A complete working sample can be downloaded from the Github.

    Click here to explore the rich set of Syncfusion PDF library features.