OCR Processor Troubleshooting

12 Oct 20236 minutes to read

Tesseract has not been initialized exception

Exception Tesseract has not been initialized exception.
Reason The exception may occur if the tesseract binaries and tessdata files are unavailable on the provided path.
Solution1 Set proper tesseract binaries and tessdata folder with all files and inner folders. The tessdata folder name is case-sensitive and should not change.

  • C#
  • //TesseractBinaries - path of the folder tesseract binaries. 
    OCRProcessor processor = new OCRProcessor(@"TesseractBinaries/");
    
    //TessData - path of the folder containing the language pack
    processor.PerformOCR(lDoc, @"TessData/");
    Solution2 Ensure that your data file version is 3.02 since the OCR processor is built with the Tesseract version 3.02.

    Exception has been thrown by the target of an invocation

    Exception Exception has been thrown by the target of an invocation.
    Reason If the tesseract binaries are not in the required structure.
    Solution To resolve this exception, ensure the tesseract binaries are in the following structure.

    The tessdata and tesseract binaries folder are automatically added to the bin folder of the application. The assemblies should be in the following structure.

    1.bin\Debug\net7.0\runtimes\win-x64\native\leptonica-1.80.0.dll,libSyncfusionTesseract.dll
    2.bin\Debug\net7.0\runtimes\win-x86\native\leptonica-1.80.0.dll,libSyncfusionTesseract.dll
    Reason 1 An exception may occur due to missing or mismatched assemblies of the Tesseract binaries and Tesseract data from the OCR processor.
    Reason 2 An exception may occur due to the VC++ 2015 redistributable files missing in the machine where the OCR processor takes place.
    Solution Install the VC++ 2015 redistributable files in your machine to overcome an exception. So, please select both file and install it.
    Refer to the following screenshot:
    Visual C++ 2015 Redistributable file

    Please find the download link Visual C++ 2015 Redistributable file,
    Visual C++ 2015 Redistributable file

    Can’t be opened because the developer’s identity cannot be confirmed

    Exception Can't be opened because the developer's identity cannot be confirmed.
    Reason This error may occur during the initial loading of the OCR processor in Mac environments.
    Solution To resolve this issue, refer this link for more details.

    The OCR processor doesn’t process languages other than English

    Exception The OCR processor doesn't process languages other than English.
    Reason This issue may occur if the input image has other languages. The language and tessdata are unavailable for those languages.
    Solution The essential PDF supports all the languages the Tesseract engine supports in the OCR processor. The dictionary packs for the languages can be downloaded from the following online location:
    https://code.google.com/p/tesseract-ocr/downloads/list

    It is also mandatory to change the corresponding language code in the OCRProcessor.Settings.Language property.
    For example, to perform the optical character recognition in German, the property should be set as
    "processor.Settings.Language = "deu";"

    Text does not recognize properly when performing OCR on a PDF document with low-quality images

    Issue Text does not recognize properly when performing OCR on a PDF document with low-quality images
    Reason The presence of low quality images in the input PDF document may be the cause of this issue.
    Solution By using the best tessdata, we can improve the OCR results. For more information,
    please refer to the links below.
    https://github.com/tesseract-ocr/tessdata_best

    Note:

    For better performance, kindly use the fast tessdata which is mentioned in below link,
    https://github.com/tesseract-ocr/tessdata_fast

    OCR not working on Mac: Exception has been thrown by the target of an invocation

    Issue Syncfusion.Pdf.PdfException: Exception has been thrown by the target of an invocation" in the Mac machine.
    Reason The problem occurs due to a mismatch in the dependency package versions on your Mac machine.
    Solution To resolve this problem, you should install and utilize Tesseract 5 on your Mac machine. Refer to the following steps for installing Tesseract 5 and integrating it into an OCR processing workflow.

    1.Execute the following command to install Tesserat 5.

  • C#
  • brew install tesseract


    If the "brew" is not installed on your machine, you can install it using the following command.

  • C#
  • /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"


    2.Once Tesseract 5 is successfully installed, you can configure the path to the latest binaries by copying the location of the Tesseract folder and setting it as the Tesseract binaries path when setting up the OCR processor. Refer to the example code below:

  • C#
  • //Initialize the OCR processor by providing the path of tesseract binaries.
    using (OCRProcessor processor = new OCRProcessor("/opt/homebrew/Cellar/tesseract/5.3.2/lib"))


    3.Add the TessDataPath from bin folder. Refer to the example code below:
    </br>
  • C#
  • using (OCRProcessor processor = new OCRProcessor("/opt/homebrew/Cellar/tesseract/5.3.2/lib"))
    {
        FileStream fileStream = new FileStream("../../../Input.pdf", FileMode.Open, FileAccess.Read);
        //Load a PDF document.
        PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
        //Set OCR language to process.
        processor.Settings.Language = Languages.English;
        //Process OCR by providing the PDF document.
        processor.TessDataPath = "runtimes/tessdata";
        processor.PerformOCR(lDoc);
        //Create file stream.
        using (FileStream outputFileStream = new FileStream("Output.pdf", FileMode.Create, FileAccess.ReadWrite))
        {
            //Save the PDF document to file stream.
            lDoc.Save(outputFileStream);
        }
        //Close the document.
        lDoc.Close(true);
    }

    Method PerformOCR() causes problems and ignores the tesseract files under WSL.

    Issue Method PerformOCR() causes problem and ignores the tesseract files under WSL
    Reason Tesseract binaries in WSL are missing.
    Solution To resolve this problem, you should install and utilize Leptonica and Tesseract on your machine. Refer to the following steps for installing Leptonica and Tesseract,

    1. Install the leptonica.
  • C#
  • sudo apt-get install libleptonica-dev




    2.Install the tesseract.
  • C#
  • sudo apt-get install tesseract-ocr-eng




    3. Copy the binaries (liblept.so and libtesseract.so) to the missing files exception folder in the project location.
  • C#
  • cp /usr/lib/x86_64-linux-gnu/liblept.so /home/syncfusion/linuxdockersample/linuxdockersample/bin/Debug/net7.0/liblept1753.so

  • C#
  • cp /usr/lib/x86_64-linux-gnu/libtesseract.so.4 /home/syncfusion/linuxdockersample/linuxdockersample/bin/Debug/net7.0/libSyncfusionTesseract.so