2020-06-04 18:25:23 +08:00
## Text recognition
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
### Data preparation
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
PaddleOCR pupports two data formats: `lmdb` used to train public data and debug algorithms; `General Data` to train your own data:
2020-05-13 20:27:45 +08:00
2020-06-04 18:25:23 +08:00
Please set the dataset as follows:
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
The default storage path for training data is `PaddleOCR/train_data` , if you already have a data set on your disk, just create a soft link to the data set directory:
2020-05-13 17:09:12 +08:00
```
ln -sf < path / to / dataset > < path / to / paddle_detection > /train_data/dataset
```
2020-06-04 18:25:23 +08:00
* Data download
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
If you do not have a data set locally, you can download it on the official website [icdar2015 ](http://rrc.cvc.uab.es/?ch=4&com=downloads ). Also refer to [DTRB ](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here ), download the lmdb format dataset required by benchmark
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
* Use your own dataset:
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
If you want to use your own data for training, please refer to the following to organize your data.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
- Training set
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
First put the training pictures in the same folder (train_images), and use a txt file (rec_gt_train.txt) to record the picture path and label.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
* Note: by default, please split the image path and image label with \t, if you use other methods to split, it will cause training error
2020-05-13 17:09:12 +08:00
```
2020-06-04 18:25:23 +08:00
" Image file name Image annotation "
2020-05-13 17:09:12 +08:00
train_data/train_0001.jpg 简单可依赖
train_data/train_0002.jpg 用科技让复杂的世界更简单
```
2020-06-04 18:25:23 +08:00
PaddleOCR provides a label file for training the icdar2015 dataset, which can be downloaded in the following ways:
2020-05-14 15:09:32 +08:00
```
2020-06-04 18:25:23 +08:00
# Training set label
2020-05-14 15:09:32 +08:00
wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt
2020-06-04 18:25:23 +08:00
# Test Set Label
2020-05-15 15:17:55 +08:00
wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt
2020-05-14 15:09:32 +08:00
```
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
The final training set should have the following file structure:
2020-05-13 17:09:12 +08:00
2020-05-13 20:27:45 +08:00
```
2020-05-13 17:09:12 +08:00
|-train_data
2020-05-14 14:33:15 +08:00
|-ic15_data
|- rec_gt_train.txt
2020-05-14 15:09:32 +08:00
|- train
|- word_001.png
|- word_002.jpg
|- word_003.jpg
2020-05-14 14:33:15 +08:00
| ...
2020-05-13 20:27:45 +08:00
```
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
- Test set
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
Similar to the training set, the test set also needs to provide a folder containing all pictures (test) and a rec_gt_test.txt. The structure of the test set is as follows:
2020-05-13 17:09:12 +08:00
2020-05-13 20:27:45 +08:00
```
2020-05-13 17:09:12 +08:00
|-train_data
2020-05-14 14:33:15 +08:00
|-ic15_data
2020-05-14 15:09:32 +08:00
|- rec_gt_test.txt
|- test
|- word_001.jpg
|- word_002.jpg
|- word_003.jpg
2020-05-14 14:33:15 +08:00
| ...
2020-05-13 20:27:45 +08:00
```
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
- Dictionary
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
Finally, a dictionary ({word_dict_name}.txt) needs to be provided so that when the model is trained, all the characters that appear can be mapped to the dictionary index.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
Therefore, the dictionary needs to contain all the characters that you want to be recognized correctly. {word_dict_name}.txt needs to be written in the following format and saved in the `utf-8` encoding format:
2020-05-13 17:09:12 +08:00
2020-05-13 20:27:45 +08:00
```
l
2020-05-13 17:09:12 +08:00
d
a
2020-05-13 20:27:45 +08:00
d
r
2020-05-13 17:09:12 +08:00
n
2020-05-13 20:27:45 +08:00
```
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
word_dict.txt There is a single word in each line, which maps characters and numeric indexes together, and "and" will be mapped to [2 5 1]
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
`ppocr/utils/ppocr_keys_v1.txt` is a Chinese dictionary with 6623 characters,
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
`ppocr/utils/ic15_dict.txt` is an English dictionary with 36 characters,
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
You can use them as needed.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
To customize the dic file, please modify the `character_dict_path` field in `configs/rec/rec_icdar15_train.yml` and set `character_type` to `ch` .。
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
### Start training
PaddleOCR provides training scripts, evaluation scripts, and prediction scripts. In this section, the CRNN recognition model will be used as an example:
First download the pretrain model, you can download the trained model to finetune on the icdar2015 data
2020-05-15 15:17:55 +08:00
```
2020-05-14 15:55:56 +08:00
cd PaddleOCR/
2020-06-04 18:25:23 +08:00
# Download the pre-trained model of MobileNetV3
2020-05-14 15:55:56 +08:00
wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/rec_mv3_none_bilstm_ctc.tar
2020-06-04 18:25:23 +08:00
# Decompress model parameters
2020-05-14 15:55:56 +08:00
cd pretrain_models
tar -xf rec_mv3_none_bilstm_ctc.tar & & rm -rf rec_mv3_none_bilstm_ctc.tar
```
2020-06-04 18:25:23 +08:00
Start training:
2020-05-14 15:55:56 +08:00
2020-05-13 17:09:12 +08:00
```
2020-06-04 18:25:23 +08:00
# Set PYTHONPATH path
2020-05-13 17:09:12 +08:00
export PYTHONPATH=$PYTHONPATH:.
2020-06-04 18:25:23 +08:00
# GPU training Support single card and multi-card training, specify the card number through CUDA_VISIBLE_DEVICES
2020-05-13 17:09:12 +08:00
export CUDA_VISIBLE_DEVICES=0,1,2,3
2020-06-04 18:25:23 +08:00
# Training icdar15 English data
2020-05-14 15:09:32 +08:00
python3 tools/train.py -c configs/rec/rec_icdar15_train.yml
2020-05-13 17:09:12 +08:00
```
2020-06-04 18:25:23 +08:00
PaddleOCR supports alternating training and evaluation. You can modify `eval_batch_step` in `configs/rec/rec_icdar15_train.yml` to set the evaluation frequency. By default, it is evaluated every 500 iter. By default, the best acc model is saved as `output/rec_CRNN/best_accuracy` during the evaluation process.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
If the verification set is large, the test will be time-consuming. It is recommended to reduce the number of evaluations, or evaluate after training.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
* Tip: You can use the `-c` parameter to select multiple model configurations under the `configs/rec/` path for training. The recognition algorithms supported by PaddleOCR are:
2020-05-18 11:57:21 +08:00
2020-06-04 18:25:23 +08:00
| Configuration file | Algorithm name | backbone | trans | seq | pred |
2020-05-18 11:57:21 +08:00
| :--------: | :-------: | :-------: | :-------: | :-----: | :-----: |
| rec_chinese_lite_train.yml | CRNN | Mobilenet_v3 small 0.5 | None | BiLSTM | ctc |
| rec_icdar15_train.yml | CRNN | Mobilenet_v3 large 0.5 | None | BiLSTM | ctc |
| rec_mv3_none_bilstm_ctc.yml | CRNN | Mobilenet_v3 large 0.5 | None | BiLSTM | ctc |
| rec_mv3_none_none_ctc.yml | Rosetta | Mobilenet_v3 large 0.5 | None | None | ctc |
| rec_mv3_tps_bilstm_ctc.yml | STARNet | Mobilenet_v3 large 0.5 | tps | BiLSTM | ctc |
| rec_mv3_tps_bilstm_attn.yml | RARE | Mobilenet_v3 large 0.5 | tps | BiLSTM | attention |
| rec_r34_vd_none_bilstm_ctc.yml | CRNN | Resnet34_vd | None | BiLSTM | ctc |
| rec_r34_vd_none_none_ctc.yml | Rosetta | Resnet34_vd | None | None | ctc |
| rec_r34_vd_tps_bilstm_attn.yml | RARE | Resnet34_vd | tps | BiLSTM | attention |
| rec_r34_vd_tps_bilstm_ctc.yml | STARNet | Resnet34_vd | tps | BiLSTM | ctc |
2020-06-04 18:25:23 +08:00
For training Chinese data, it is recommended to use `rec_chinese_lite_train.yml` . If you want to try the effect of other algorithms on the Chinese data set, please refer to the following instructions to modify the configuration file:
2020-05-18 11:57:21 +08:00
2020-06-04 18:25:23 +08:00
Take `rec_mv3_none_none_ctc.yml` as an example:
2020-05-18 11:57:21 +08:00
```
Global:
...
2020-06-04 18:25:23 +08:00
# Modify image_shape to fit long text
2020-05-18 11:57:21 +08:00
image_shape: [3, 32, 320]
...
2020-06-04 18:25:23 +08:00
# Modify character type
2020-05-18 11:57:21 +08:00
character_type: ch
2020-06-04 18:25:23 +08:00
# Add a custom dictionary, such as modify the dictionary, please point the path to the new dictionary
2020-05-18 11:57:21 +08:00
character_dict_path: ./ppocr/utils/ppocr_keys_v1.txt
...
2020-06-04 18:25:23 +08:00
# Modify reader type
2020-05-18 11:57:21 +08:00
reader_yml: ./configs/rec/rec_chinese_reader.yml
...
...
```
2020-06-04 18:25:23 +08:00
**Note that the configuration file for prediction/evaluation must be consistent with the training.**
2020-05-18 11:57:21 +08:00
2020-05-13 20:27:45 +08:00
2020-06-04 18:25:23 +08:00
### Evaluation
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
The evaluation data set can be modified via `configs/rec/rec_icdar15_reader.yml` setting of `label_file_path` in EvalReader.
2020-05-13 17:09:12 +08:00
```
export CUDA_VISIBLE_DEVICES=0
2020-06-04 18:25:23 +08:00
# GPU evaluation, Global.checkpoints is the weight to be tested
2020-05-14 15:09:32 +08:00
python3 tools/eval.py -c configs/rec/rec_chinese_lite_train.yml -o Global.checkpoints={path/to/weights}/best_accuracy
2020-05-13 17:09:12 +08:00
```
2020-06-04 18:25:23 +08:00
### prediction
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
* Training engine prediction
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
The model trained using PaddleOCR can be quickly predicted by the following script.
2020-05-13 17:09:12 +08:00
2020-06-04 18:25:23 +08:00
The default prediction picture is stored in `infer_img` , and the weight is specified via `-o Global.checkpoints` :
2020-05-13 17:09:12 +08:00
```
2020-06-04 18:25:23 +08:00
# Predict English results
2020-05-19 12:01:10 +08:00
python3 tools/infer_rec.py -c configs/rec/rec_chinese_lite_train.yml -o Global.checkpoints={path/to/weights}/best_accuracy TestReader.infer_img=doc/imgs_words/en/word_1.jpg
2020-05-13 17:09:12 +08:00
```
2020-05-16 01:20:42 +08:00
2020-06-04 18:25:23 +08:00
Input image:
2020-05-16 01:20:42 +08:00
![](./imgs_words/en/word_1.png)
2020-06-04 18:25:23 +08:00
Get the prediction result of the input image:
2020-05-16 01:20:42 +08:00
```
2020-05-17 17:12:26 +08:00
infer_img: doc/imgs_words/en/word_1.png
2020-05-16 01:20:42 +08:00
index: [19 24 18 23 29]
word : joint
```
2020-06-04 18:25:23 +08:00
The configuration file used for prediction must be consistent with the training. For example, you completed the training of the Chinese model through `python3 tools/train.py -c configs/rec/rec_chinese_lite_train.yml` ,
You can use the following command to predict the Chinese model.
2020-05-16 01:20:42 +08:00
```
2020-06-04 18:25:23 +08:00
# Predict Chinese results
2020-05-19 12:01:10 +08:00
python3 tools/infer_rec.py -c configs/rec/rec_chinese_lite_train.yml -o Global.checkpoints={path/to/weights}/best_accuracy TestReader.infer_img=doc/imgs_words/ch/word_1.jpg
2020-05-16 01:20:42 +08:00
```
2020-06-04 18:25:23 +08:00
Input image:
2020-05-13 17:09:12 +08:00
2020-05-16 01:20:42 +08:00
![](./imgs_words/ch/word_1.jpg)
2020-05-14 14:36:47 +08:00
2020-06-04 18:25:23 +08:00
Get the prediction result of the input image:
2020-05-13 17:09:12 +08:00
```
2020-05-16 01:20:42 +08:00
infer_img: doc/imgs_words/ch/word_1.jpg
2020-05-14 14:33:15 +08:00
index: [2092 177 312 2503]
word : 韩国小馆
2020-05-13 17:09:12 +08:00
```