Perform OCR in Azure using C#

11 Jan 20239 minutes to read

The Syncfusion .NET OCR library is used to extract text from scanned PDFs and images in Azure with the help of Google’s Tesseract Optical Character Recognition engine.

Azure App Service Windows

Steps to perform OCR on entire PDF document in Azure App Service

Step 1: Create a new ASP.NET Core MVC application.
Convert OCR Azure NetCore Step1

Step 2: In configuration windows, name your project and click Next.
Convert OCR Azure NetCore Step2
Convert OCR Azure NetCore Step3

Step 3: Install the Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET Core application NuGet.org.
Convert OCR Azure NetCore Step4

Step 4: Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the assemblies location is passed as a parameter to the OCR processor.

  • C#
  • OCRProcessor processor = new OCRProcessor(@"Tesseractbinaries/Windows");

    Step 5: Place the Tesseract language data {e.g, eng.traineddata} in the local system and provide a path to the OCR processor. Please use the OCR language data for other languages using the following link.

    Tesseract language data

  • C#
  • OCRProcessor processor = new OCRProcessor(@"Tesseractbinaries/Windows");
    processor.PerformOCR(lDoc, "tessdata/");

    Step 6: Add a new button in index.cshtml as follows.

  • C#
  • @{
        Html.BeginForm("PerformOCR", "Home", FormMethod.Get);
        {
            <br />
            <div>
                <input type="submit" value="Perform OCR" style="width:150px;height:27px" />
            </div>
        }
        Html.EndForm();
    }

    Step 7: Include the following namespaces in the HomeController.cs file.

  • C#
  • using Syncfusion.OCRProcessor;
    using Syncfusion.Pdf.Parsing;
    using Microsoft.AspNetCore.Hosting.IHostingEnvironment;

    Step 8: Add the code samples for performing OCR on the entire PDF document using PerformOCR method of the OCRProcessor class.

  • C#
  • public IActionResult PerformOCR()
    {
        //Initialize the OCR processor with tesseract binaries folder path.
        OCRProcessor processor = new OCRProcessor("Tesseractbinaries/Windows/");
        //Load a PDF document.
        PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
        //Set OCR language to process.
        processor.Settings.Language = Languages.English;
        //Perform OCR with input document and tessdata (Language packs).
        string ocr = processor.PerformOCR(lDoc, "Tessdata/");
        //Save the document. 
        MemoryStream stream = new MemoryStream();
        lDoc.Save(stream);
        return File(stream.ToArray(), System.Net.Mime.MediaTypeNames.Application.Pdf, "OCR_Azure.pdf");
    }

    Step 9: Now, check the OCR creation in the local machine.

    Steps to publish as Azure App Service

    Step 1: Right-click the project and click Publish.
    Convert OCR Azure NetCore Step5

    Step 2: Create a new profile in the publish target window.
    Convert OCR Azure NetCore Step6
    Convert OCR Azure NetCore Step7

    Step 3: Create an App Service using an Azure subscription and select a hosting plan based on the environment.
    Convert OCR Azure NetCore Step8

    Step 4: Configure the Hosting plan.
    Convert OCR Azure NetCore Step9

    Step 5: After creating a profile, click Publish.
    Convert OCR Azure NetCore Step10

    Now, the published webpage will open in the browser, then click the Perform OCR button then perform OCR on a PDF document.
    Convert OCR Azure NetCore Step11
    Convert OCR Azure NetCore Step11

    A complete work sample for performing OCR on a PDF document in Azure App Service on Windows can be downloaded from GitHub.

    Azure Functions

    Steps to perform OCR on the entire PDF document in Azure Functions

    Step 1: Create the Azure function project.
    Convert OCR Azure Functions Step1

    Step 2: Select the framework to Azure Functions and select HTTP triggers as follows.
    Convert OCR Azure Functions Step2
    Convert OCR Azure Functions Step3

    Step 3: Install the Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET Core application NuGet.org.
    Convert OCR Azure Functions Step4

    Step 4: Tesseract assemblies are not added as a reference. They must be kept in the local machine, and the assemblies location is passed as a parameter to the OCR processor.

  • C#
  • OCRProcessor processor = new OCRProcessor(@"Tesseractbinaries/Windows");

    Step 5: Place the Tesseract language data {E.g, eng.traineddata} in the local system and provide a path to the OCR processor. Please use the OCR language data for other languages using the following link.

    Tesseract language data

  • C#
  • OCRProcessor processor = new OCRProcessor(@"Tesseractbinaries/Windows");
    processor.PerformOCR(lDoc, "tessdata/");

    Step 6: Include the following namespaces in the Function1.cs file to perform OCR for a PDF document using C#.

  • C#
  • using System;
    using System.IO;
    using System.Threading.Tasks;
    using Syncfusion.OCRProcessor;
    using Syncfusion.Pdf.Graphics;
    using Syncfusion.Pdf;
    using System.Net.Http;
    using Syncfusion.Pdf.Parsing;
    using System.Net.Http.Headers;
    using System.Net;
    using Microsoft.Azure.WebJobs.Host;
    using Microsoft.Azure.WebJobs;
    using Microsoft.Azure.WebJobs.Extensions.Http;

    Step 7: Add the following code sample in the Function1 class to perform OCR for a PDF document using PerformOCR method of the OCRProcessor class in Azure Functions.

  • C#
  • [FunctionName("Function1")]
    public static async Task<HttpResponseMessage> Run([HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequestMessage req, TraceWriter log, ExecutionContext executionContext)
    {
        MemoryStream ms = new MemoryStream();
        try
        {
            string path = Path.GetFullPath(Path.Combine(executionContext.FunctionAppDirectory, "bin\\Tesseractbinaries\\Windows"));
            OCRProcessor processor = new OCRProcessor(path);
            FileStream stream = new FileStream(Path.Combine(executionContext.FunctionAppDirectory, "Data", "Input.pdf"), FileMode.Open);
            //Load a PDF document.
            PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
            //Set OCR language to process.
            processor.Settings.Language = Languages.English;
            //Perform OCR with input document and tessdata (Language packs).
            string ocr = processor.PerformOCR(lDoc, Path.Combine(executionContext.FunctionAppDirectory, "tessdata"));            
            //Save the PDF document.  
            lDoc.Save(ms);
            ms.Position = 0;
        }
        catch (Exception ex)
        {
            //Add a page to the document.
            PdfDocument document = new PdfDocument();
            PdfPage page = document.Pages.Add();
            //Create PDF graphics for the page.
            PdfGraphics graphics = page.Graphics;
            //Set the standard font.
            PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 6);
            //Draw the text.
            graphics.DrawString(ex.ToString(), font, PdfBrushes.Black, new Syncfusion.Drawing.PointF(0, 0));
            ms = new MemoryStream();
            //Save the PDF document.  
            document.Save(ms);
        }
        HttpResponseMessage response = new HttpResponseMessage(HttpStatusCode.OK);
        response.Content = new ByteArrayContent(ms.ToArray());
        response.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment")
        {
            FileName = "Output.pdf"
        };
        response.Content.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/pdf");
        return response;
    }

    Step 8: Now, check the OCR creation in the local machine.

    Steps to publish as Azure Functions

    Step 1: Right-click the project and click Publish. Then, create a new profile in the Publish Window. So, create the Azure Function App with a consumption plan.
    Convert OCR Azure Functions Step5
    Convert OCR Azure Functions Step6
    Convert OCR Azure Functions Step7

    Step 2: After creating the profile, click Publish.
    Convert OCR Azure Functions Step8

    Step 3: Now, go to the Azure portal and select the Functions Apps. After running the service, click Get function URL > Copy. Include the URL as a query string in the URL. Then, paste it into the new browser tab. You will get a PDF document as follows.
    Convert OCR Azure Functions Step9

    A complete working sample can be downloaded from GitHub.