This guide provides step-by-step instructions for training Tesseract 5 in a Docker container. Docker allows you to create a reproducible environment for training Tesseract OCR models. By following the steps outlined below, you can set up a Docker container with Ubuntu, install Tesseract 5 and the necessary training tools, obtain training data, organize the data, and start the training process.
Open the terminal.
Pull the Ubuntu Docker image:
docker pull ubuntu
If you are interested in a specific version, you can specify it:
docker pull ubuntu:22.04
Run the Docker image:
docker run -ti --rm ubuntu /bin/bash
Note: By default, the Docker Ubuntu image does not have the lsb_release command available. You can use the cat command to check the OS information instead.
Check the OS version:
cat /etc/os-release
If the lsb-release package is not installed, update the package sources and install it:
apt update && apt install lsb-core
Verify the OS version again:
lsb_release -a
Create a shared directory between your host system and the Docker container:
In the container’s terminal, create a directory named Docker_Share:
mkdir -p Docker_Share
Verify that the directory was created:
ls
In a separate terminal on your host machine, check the current running container ID:
docker ps
Make note of the container ID.
Save the Docker container state as a new image:
docker commit -p container_id new_image_name
For example:
docker commit -p 3409ehfu384f myubuntu
Replace container_id with the ID of the container obtained in the previous step, and new_image_name with the desired name for the new image.
Verify that the new image was created:
docker images
Stop the Docker container:
docker stop container_id
Replace container_id with the ID of the container obtained earlier.
Restart the container with the shared data:
docker run -ti -v /host/machine/dir:/Docker_Share image_name /bin/bash
For example:
docker run -ti -v C:\training_data:/Docker_Share myubuntu /bin/bash
Replace /host/machine/dir with the directory path on your host machine that you want to share with the container, image_name with the name of the new image created in the previous step, and /bin/bash to start the container with a terminal.
In the container’s terminal, update the package sources and install Git:
apt update && apt install git
Clone the Tesseract repository:
git clone https://github.com/tesseract-ocr/tesseract.git
Verify that the tesseract directory was created:
ls
Install auxiliary libraries required for Tesseract:
apt update && apt install autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev libwebpdemux2 libwebp-dev libopenjp2-7-dev libgif-dev libarchive-dev libcurl4-openssl-dev libicu-dev libpango1.0-dev libcairo2-dev libleptonica-dev
Navigate to the /tesseract directory:
cd /tesseract
Run the autogen.sh script:
./autogen.sh
Run the configure script:
./configure
Build and install Tesseract OCR 5:
make
make install
ldconfig
Install the Tesseract training tools:
make training
make training-install
Clone the tesstrain repository:
git clone https://github.com/tesseract-ocr/tesstrain.git
Navigate to the tesstrain directory:
cd /tesseract/tesstrain
Install wget and the required Python libraries:
apt update && apt install wget python3-pip
pip install -r requirements.txt
Fetch language data:
make tesseract-langdata
To train a Tesseract OCR model, you need the following training data:
For example:
Optional training data includes:
The .box files contain information about character positions in the image, improving the training process and model accuracy.
Move all the training data into the directory shared with the Docker container. For example, if your shared directory on the host machine is C:\training_data, place all the .gt.txt, .tif, and .box files in that directory.
Copy the training data from the shared directory to the appropriate location:
cp -r /Docker_Share /tesseract/tesstrain/data/[lang].[font]-ground-truth
Replace [lang].[font] with the appropriate language and font information.
Download the traineddata files you need from the tessdata_best repository. Make sure to download the eng.traineddata file for any language you are training. For example, if you are training Chinese Traditional (chi_tra), download the chi_tra.traineddata file.
Move the downloaded traineddata files into the shared directory. For example, move eng.traineddata and chi_tra.traineddata to C:\training_data on the host machine.
Move the traineddata files to the default training directory:
mv /Docker_Share/*.traineddata /usr/local/share/tessdata/
Now your training data is organized and ready for training the new model.
Navigate to the training directory:
cd /tesseract/tesstrain
If you have .box files and want to avoid overwriting them during the training process, modify the Makefile:
apt update && apt install nano
cd /tesseract/tesstrain
nano Makefile
Locate the lines starting with %.box and comment them out.
Original lines:
%.box: %.png %.gt.txt
PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@"
%.box: %.bin.png %.gt.txt
PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@"
%.box: %.nrm.png %.gt.txt
PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@"
%.box: %.raw.png %.gt.txt
PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@"
%.box: %.tif %.gt.txt
PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"
Modified lines:
# %.box: %.png %.gt.txt
# PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@"
# %.box: %.bin.png %.gt.txt
# PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@"
# %.box: %.nrm.png %.gt.txt
# PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@"
# %.box: %.raw.png %.gt.txt
# PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@"
# %.box: %.tif %.gt.txt
# PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"
Press Ctrl + O and then Enter to save the modified Makefile. Press Ctrl + X to exit the editor.
Start training a new model:
make training MODEL_NAME=[lang].[font] TESSDATA=/usr/local/share/tessdata
Replace [lang].[font] with the appropriate language and font information.
If you want to fine-tune an existing model, use the START_MODEL parameter:
make training MODEL_NAME=[lang].[font] START_MODEL=[lang] TESSDATA=/usr/local/share/tessdata
Replace [lang].[font] with the appropriate language and font information.
After training, you can find the traineddata of the new model in the default output path:
cd /tesseract/tesstrain/data/[lang].[font]
ls
Replace [lang].[font] with the appropriate language and font information.
Copy the traineddata of the new model to the shared directory:
cp /tesseract/tesstrain/data/[lang].[font]/[lang].[font].traineddata /Docker_Share
Replace [lang].[font] with the appropriate language and font information.
The traineddata file will now be available in the shared directory on your host machine.
For detailed steps and additional information, please refer to the following resources:
| [Compilation guide for various platforms | tessdoc](https://tesseract-ocr.github.io/tessdoc/Compiling.html) |