tesseract.md
57KB

OCR in .NET using Tesseract

Tesseract has been around the block a few times. It was originally developed in the 80s by HP and released as open source software by Google in 2005. The most current version is Tesseract 5️⃣ which we can install via nuget package. Are there better options out there? For sure, but in terms of compatability not really. The more powerful OCR engines that can detect handwriting usually require a GPU. That is a big ask if you are trying to host the application yourself.

Setup

Getting started is easy on .NET

It is the first thing to pop up when searching for it on Nuget

tesseract_nuget.png

Note: After installation, you'll notice a x64 and x86 folder in your project, this is part of the nuget package and is required for operation. That means you'll need to deploy the folders along with your application

We now need a language file, you can download those from here or you can grab the english one directly from here

You need to place these language files within a /tessdata folder (it is a hard coded requirement that the language files reside within this folder name)

trained_data_file.png

Make sure to set the build output to copy if newer

The build folder should look like this:

build_output.png

That's it! We should be ready to use the engine now

Basic Usage

Here is a basic usage example

using System;
using Tesseract;

namespace Tesseract_Testing
{    
    internal class Program    
    {        
        public static void Main(string[] args)        
        {            
            string imagePath = @".\testfiles\Random Text.tiff";
            using (var engine = new TesseractEngine(@".\tessdata", "eng", EngineMode.Default))            
            {                
                using (var img = Pix.LoadFromFile(imagePath))                
                {                    
                    using (var page = engine.Process(img))                    
                    {                        
                        Console.WriteLine(page.GetText());                    
                    }                
                }            
            }        
        }    
    }
}

All we are doing here is loading our test file and printing the OCR results to the console

test_file.png

test_file_output.png

Trained Data Files

There are 100 languages that can be OCR'd with Tesseract

Each language is trained into a traineddata file and there are 3 levels of trained data available

The same languages and scripts are available from https://github.com/tesseract-ocr/tessdata_best. tessdata_best provides slow language and script models. These models are needed for training. They also can give better OCR results, but the recognition takes much more time.

Both tessdata_fast and tessdata_best only support the LSTM OCR engine.