Training-Tesseract-OCR-5-in-Docker-Containers

Training Tesseract 5 in Docker

This guide provides step-by-step instructions for training Tesseract 5 in a Docker container. Docker allows you to create a reproducible environment for training Tesseract OCR models. By following the steps outlined below, you can set up a Docker container with Ubuntu, install Tesseract 5 and the necessary training tools, obtain training data, organize the data, and start the training process.

Create Ubuntu container

Open the terminal.
Pull the Ubuntu Docker image:
```
docker pull ubuntu
```
If you are interested in a specific version, you can specify it:
```
docker pull ubuntu:22.04
```
Run the Docker image:
```
docker run -ti --rm ubuntu /bin/bash
```
Note: By default, the Docker Ubuntu image does not have the lsb_release command available. You can use the cat command to check the OS information instead.
Check the OS version:
```
cat /etc/os-release
```
If the lsb-release package is not installed, update the package sources and install it:
```
apt update && apt install lsb-core
```
Verify the OS version again:
```
lsb_release -a
```
Create a shared directory between your host system and the Docker container: In the container’s terminal, create a directory named Docker_Share:
```
mkdir -p Docker_Share
```
Verify that the directory was created:
```
ls
```
In a separate terminal on your host machine, check the current running container ID:
```
docker ps
```
Make note of the container ID.
Save the Docker container state as a new image:
```
docker commit -p container_id new_image_name
```
For example:
```
docker commit -p 3409ehfu384f myubuntu
```
Replace container_id with the ID of the container obtained in the previous step, and new_image_name with the desired name for the new image.
Verify that the new image was created:
```
docker images
```
Stop the Docker container:
```
docker stop container_id
```
Replace container_id with the ID of the container obtained earlier.
Restart the container with the shared data:
```
docker run -ti -v /host/machine/dir:/Docker_Share image_name /bin/bash
```
For example:
```
docker run -ti -v C:\training_data:/Docker_Share myubuntu /bin/bash
```
Replace /host/machine/dir with the directory path on your host machine that you want to share with the container, image_name with the name of the new image created in the previous step, and /bin/bash to start the container with a terminal.

Install Tesseract 5 in the container

In the container’s terminal, update the package sources and install Git:
```
apt update && apt install git
```
Clone the Tesseract repository:
```
git clone https://github.com/tesseract-ocr/tesseract.git
```
Verify that the tesseract directory was created:
```
ls
```

Install auxiliary libraries required for Tesseract:

apt update && apt install autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev libwebpdemux2 libwebp-dev libopenjp2-7-dev libgif-dev libarchive-dev libcurl4-openssl-dev libicu-dev libpango1.0-dev libcairo2-dev libleptonica-dev

Navigate to the /tesseract directory:
```
cd /tesseract
```
Run the autogen.sh script:
```
./autogen.sh
```
Run the configure script:
```
./configure
```
Build and install Tesseract OCR 5:
```
make
make install
ldconfig
```
Install the Tesseract training tools:
```
make training
make training-install
```

Clone the tesstrain repository:

git clone https://github.com/tesseract-ocr/tesstrain.git

Navigate to the tesstrain directory:
```
cd /tesseract/tesstrain
```

Install wget and the required Python libraries:

apt update && apt install wget python3-pip
pip install -r requirements.txt

Fetch language data:
```
make tesseract-langdata
```

Get Training Data

To train a Tesseract OCR model, you need the following training data:

[lang].[font].exp[number].tif (line string image file)
[lang].[font].exp[number].gt.txt (ground truth text file)

For example:

chi_tra.DFKai.exp0.tif
chi_tra.DFKai.exp0.gt.txt

Optional training data includes:

[lang].[font].exp[number].box

The .box files contain information about character positions in the image, improving the training process and model accuracy.

Move all the training data into the directory shared with the Docker container. For example, if your shared directory on the host machine is C:\training_data, place all the .gt.txt, .tif, and .box files in that directory.

Organize Training Data

Copy the training data from the shared directory to the appropriate location:
```
cp -r /Docker_Share /tesseract/tesstrain/data/[lang].[font]-ground-truth
```
Replace [lang].[font] with the appropriate language and font information.
Download the traineddata files you need from the tessdata_best repository. Make sure to download the eng.traineddata file for any language you are training. For example, if you are training Chinese Traditional (chi_tra), download the chi_tra.traineddata file.
Move the downloaded traineddata files into the shared directory. For example, move eng.traineddata and chi_tra.traineddata to C:\training_data on the host machine.
Move the traineddata files to the default training directory:
```
mv /Docker_Share/*.traineddata /usr/local/share/tessdata/
```
Now your training data is organized and ready for training the new model.

Start training

Navigate to the training directory:
```
cd /tesseract/tesstrain
```

If you have .box files and want to avoid overwriting them during the training process, modify the Makefile:

apt update && apt install nano
cd /tesseract/tesstrain
nano Makefile

Locate the lines starting with %.box and comment them out.

Original lines:

%.box: %.png %.gt.txt
    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@"

%.box: %.bin.png %.gt.txt
    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@"

%.box: %.nrm.png %.gt.txt
    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@"

%.box: %.raw.png %.gt.txt
    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@"

%.box: %.tif %.gt.txt
    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"

Modified lines:

# %.box: %.png %.gt.txt
#    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@"

# %.box: %.bin.png %.gt.txt
#    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@"

# %.box: %.nrm.png %.gt.txt
#    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@"

# %.box: %.raw.png %.gt.txt
#    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@"

# %.box: %.tif %.gt.txt
#    PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"

Press Ctrl + O and then Enter to save the modified Makefile. Press Ctrl + X to exit the editor.

Start training a new model:
```
make training MODEL_NAME=[lang].[font] TESSDATA=/usr/local/share/tessdata
```
Replace [lang].[font] with the appropriate language and font information.
If you want to fine-tune an existing model, use the START_MODEL parameter:
```
make training MODEL_NAME=[lang].[font] START_MODEL=[lang] TESSDATA=/usr/local/share/tessdata
```
Replace [lang].[font] with the appropriate language and font information.
After training, you can find the traineddata of the new model in the default output path:
```
cd /tesseract/tesstrain/data/[lang].[font]
ls
```
Replace [lang].[font] with the appropriate language and font information.
Copy the traineddata of the new model to the shared directory:
```
cp /tesseract/tesstrain/data/[lang].[font]/[lang].[font].traineddata /Docker_Share
```
Replace [lang].[font] with the appropriate language and font information.

The traineddata file will now be available in the shared directory on your host machine.

Reference

For detailed steps and additional information, please refer to the following resources:

How to Run Ubuntu as a Docker Container
[Compilation guide for various platforms tessdoc](https://tesseract-ocr.github.io/tessdoc/Compiling.html)
GitHub - tesseract-ocr/tesstrain: Train Tesseract LSTM with make