Merge pull request #923 from tink2123/add_dict_folder

Add dict and corpus folder
This commit is contained in:
xiaoting 2020-10-13 15:25:12 +08:00 committed by GitHub
commit 65a472cd7c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
16 changed files with 40 additions and 26 deletions

View File

@ -12,7 +12,7 @@ Global:
image_shape: [3, 32, 320]
max_text_length: 25
character_type: french
character_dict_path: ./ppocr/utils/french_dict.txt
character_dict_path: ./ppocr/utils/dict/french_dict.txt
loss_type: ctc
distort: true
use_space_char: false

View File

@ -12,7 +12,7 @@ Global:
image_shape: [3, 32, 320]
max_text_length: 25
character_type: german
character_dict_path: ./ppocr/utils/german_dict.txt
character_dict_path: ./ppocr/utils/dict/german_dict.txt
loss_type: ctc
distort: true
use_space_char: false

View File

@ -12,7 +12,7 @@ Global:
image_shape: [3, 32, 320]
max_text_length: 25
character_type: japan
character_dict_path: ./ppocr/utils/japan_dict.txt
character_dict_path: ./ppocr/utils/dict/japan_dict.txt
loss_type: ctc
distort: true
use_space_char: false

View File

@ -12,7 +12,7 @@ Global:
image_shape: [3, 32, 320]
max_text_length: 25
character_type: korean
character_dict_path: ./ppocr/utils/korean_dict.txt
character_dict_path: ./ppocr/utils/dict/korean_dict.txt
loss_type: ctc
distort: true
use_space_char: false

View File

@ -221,11 +221,11 @@ demo/cxx/ocr/
1. ppocr_keys_v1.txt是中文字典文件如果使用的 nb 模型是英文数字或其他语言的模型,需要更换为对应语言的字典。
PaddleOCR 在ppocr/utils/下存放了多种字典,包括:
```
french_dict.txt # 法语字典
german_dict.txt # 德语字典
dict/french_dict.txt # 法语字典
dict/german_dict.txt # 德语字典
ic15_dict.txt # 英文字典
japan_dict.txt # 日语字典
korean_dict.txt # 韩语字典
dict/japan_dict.txt # 日语字典
dict/korean_dict.txt # 韩语字典
ppocr_keys_v1.txt # 中文字典
```

View File

@ -185,11 +185,11 @@ demo/cxx/ocr/
If the nb model is used for English recognition or other language recognition, dictionary file should be replaced with a dictionary of the corresponding language.
PaddleOCR provides a variety of dictionaries under ppocr/utils/, including:
```
french_dict.txt # french
german_dict.txt # german
dict/french_dict.txt # french
dict/german_dict.txt # german
ic15_dict.txt # english
japan_dict.txt # japan
korean_dict.txt # korean
dict/japan_dict.txt # japan
dict/korean_dict.txt # korean
ppocr_keys_v1.txt # chinese
```

View File

@ -325,7 +325,7 @@ python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words_en/word_336.png
需要通过 `--vis_font_path` 指定可视化的字体路径,`doc/` 路径下有默认提供的小语种字体,例如韩文识别:
```
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/korean_dict.txt" --vis_font_path="doc/korean.ttf"
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/dict/korean_dict.txt" --vis_font_path="doc/korean.ttf"
```
![](../imgs_words/korean/1.jpg)

View File

@ -120,19 +120,19 @@ word_dict.txt 每行有一个单字,将字符与数字索引映射在一起,
`ppocr/utils/ic15_dict.txt` 是一个包含36个字符的英文字典
`ppocr/utils/french_dict.txt` 是一个包含118个字符的法文字典
`ppocr/utils/dict/french_dict.txt` 是一个包含118个字符的法文字典
`ppocr/utils/japan_dict.txt` 是一个包含4399个字符的法文字典
`ppocr/utils/dict/japan_dict.txt` 是一个包含4399个字符的法文字典
`ppocr/utils/korean_dict.txt` 是一个包含3636个字符的法文字典
`ppocr/utils/dict/korean_dict.txt` 是一个包含3636个字符的法文字典
`ppocr/utils/german_dict.txt` 是一个包含131个字符的法文字典
`ppocr/utils/dict/german_dict.txt` 是一个包含131个字符的法文字典
您可以按需使用。
目前的多语言模型仍处在demo阶段会持续优化模型并补充语种**非常欢迎您为我们提供其他语言的字典和字体**
如您愿意可将字典文件提交至 [utils](../../ppocr/utils) 我们会在Repo中感谢您。
如您愿意可将字典文件提交至 [dict](../../ppocr/utils/dict) 将语料文件提交至[corpus](../../ppocr/utils/corpus)我们会在Repo中感谢您。
- 自定义字典
@ -269,7 +269,7 @@ PaddleOCR也提供了多语言的 `configs/rec/multi_languages` 路径下的
Global:
...
# 添加自定义字典,如修改字典请将路径指向新字典
character_dict_path: ./ppocr/utils/french_dict.txt
character_dict_path: ./ppocr/utils/dict/french_dict.txt
# 训练时添加数据增强
distort: true
# 识别空格

View File

@ -330,7 +330,7 @@ If you need to predict other language models, when using inference model predict
You need to specify the visual font path through `--vis_font_path`. There are small language fonts provided by default under the `doc/` path, such as Korean recognition:
```
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/ utils/korean_dict.txt" --vis_font_path="doc/korean.ttf"
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/dict/korean_dict.txt" --vis_font_path="doc/korean.ttf"
```
![](../imgs_words/korean/1.jpg)

View File

@ -112,18 +112,18 @@ In `word_dict.txt`, there is a single word in each line, which maps characters a
`ppocr/utils/ic15_dict.txt` is an English dictionary with 63 characters
`ppocr/utils/french_dict.txt` is a French dictionary with 118 characters
`ppocr/utils/dict/french_dict.txt` is a French dictionary with 118 characters
`ppocr/utils/japan_dict.txt` is a French dictionary with 4399 characters
`ppocr/utils/dict/japan_dict.txt` is a French dictionary with 4399 characters
`ppocr/utils/korean_dict.txt` is a French dictionary with 3636 characters
`ppocr/utils/dict/korean_dict.txt` is a French dictionary with 3636 characters
`ppocr/utils/german_dict.txt` is a French dictionary with 131 characters
`ppocr/utils/dict/german_dict.txt` is a French dictionary with 131 characters
You can use it on demand.
The current multi-language model is still in the demo stage and will continue to optimize the model and add languages. **You are very welcome to provide us with dictionaries and fonts in other languages**,
If you like, you can submit the dictionary file to [utils](../../ppocr/utils) and we will thank you in the Repo.
If you like, you can submit the dictionary file to [dict](../../ppocr/utils/dict) or corpus file to [corpus](../../ppocr/utils/corpus) and we will thank you in the Repo.
To customize the dict file, please modify the `character_dict_path` field in `configs/rec/rec_icdar15_train.yml` and set `character_type` to `ch`.
@ -259,7 +259,7 @@ Global:
...
# Add a custom dictionary, if you modify the dictionary
# please point the path to the new dictionary
character_dict_path: ./ppocr/utils/french_dict.txt
character_dict_path: ./ppocr/utils/dict/french_dict.txt
# Add data augmentation during training
distort: true
# Identify spaces

View File

@ -0,0 +1,6 @@
# Waiting for your contribution
PaddleOCR welcomes you to provide multilingual corpus for us to synthesize more data to optimize the model.
If you are interested, you can submit the corpus text to this directory and name it with {language}_corpus.txt.
PaddleOCR thanks for your contribution.

View File

@ -0,0 +1,8 @@
# 欢迎贡献语料
PaddleOCR非常欢迎你提供多语言的语料以供我们合成更多数据来优化模型。
如你感兴趣,可将语料文本提交到此目录,并以 {语言}_corpus.txt 命名PaddleOCR团队感谢你的贡献。