Perform OCR in ASP.NET MVC

11 Jan 20233 minutes to read

The Syncfusion .NET OCR library is used to extract text from scanned PDFs and images in ASP.NET MVC application with the help of Google’s Tesseract Optical Character Recognition engine.

Steps to perform OCR on entire PDF document in ASP.NET MVC

Step 1: Create a new C# ASP.NET Web Application (.NET Framework) project.
convert_OCR_ASP.NET_MVC1

Step 2: In the project configuration windows, name your project and click Create.
convert_OCR_ASP.NET_MVC2
convert_OCR_ASP.NET_MVC3

Step 3: Install the Syncfusion.Pdf.OCR.AspNet.Mvc5 NuGet package as a reference to your .NET applications from NuGet.org.
convert_OCR_ASP.NET_MVC4

Step 4: Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the assemblies location is passed as a parameter to the OCR processor.

  • C#
  • OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");

    Step 5: Place the Tesseract language data {e.g, eng.traineddata} in the local system and provide a path to the OCR processor. Please use the OCR language data for other languages using the following link.

    Tesseract language data

  • C#
  • OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");
    processor.PerformOCR(lDoc,@"TessData/");

    Step 6: Include the following namespaces in the HomeController.cs file.

  • C#
  • using Syncfusion.OCRProcessor;
    using Syncfusion.Pdf.Parsing;

    Step 7: Add a new button in the Index.cshtml as follows.

  • C#
  • @{Html.BeginForm("PerformOCR", "Home", FormMethod.Post);
       {
          <div>
             <input type="submit" value="Perform OCR" style="width:150px;height:27px" />
          </div>
       }
       Html.EndForm();
    }

    Step 8: Add a new action method named PerformOCR in the HomeController.cs file and use the following code sample to perform OCR on the entire PDF document using PerformOCR method of the OCRProcessor class.

  • C#
  • //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll).
    using (OCRProcessor processor = new OCRProcessor("TesseractBinaries/3.05/x86/"))
    {
       FileStream fileStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
       //Load a PDF document.
       PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
       //Set OCR language to process.
       processor.Settings.Language = Languages.English;
       processor.Settings.TesseractVersion = TesseractVersion.Version3_05;
       //Process OCR by providing the PDF document and Tesseract data.
       processor.PerformOCR(lDoc, "Tessdata/");
       //Open the document in browser after saving it.
       lDoc.Save("Output.pdf", HttpContext.ApplicationInstance.Response, Syncfusion.Pdf.HttpReadType.Save);
       //Close the document.
       lDoc.Close(true);
       return View();
    }

    By executing the program, you will get a PDF document as follows.
    Convert OCR ASP.NET_MVC output

    A complete working sample can be downloaded from the Github.