Merge pull request #923 from tink2123/add_dict_folder
Add dict and corpus folder
This commit is contained in:
commit
65a472cd7c
|
@ -12,7 +12,7 @@ Global:
|
|||
image_shape: [3, 32, 320]
|
||||
max_text_length: 25
|
||||
character_type: french
|
||||
character_dict_path: ./ppocr/utils/french_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/french_dict.txt
|
||||
loss_type: ctc
|
||||
distort: true
|
||||
use_space_char: false
|
||||
|
|
|
@ -12,7 +12,7 @@ Global:
|
|||
image_shape: [3, 32, 320]
|
||||
max_text_length: 25
|
||||
character_type: german
|
||||
character_dict_path: ./ppocr/utils/german_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/german_dict.txt
|
||||
loss_type: ctc
|
||||
distort: true
|
||||
use_space_char: false
|
||||
|
|
|
@ -12,7 +12,7 @@ Global:
|
|||
image_shape: [3, 32, 320]
|
||||
max_text_length: 25
|
||||
character_type: japan
|
||||
character_dict_path: ./ppocr/utils/japan_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/japan_dict.txt
|
||||
loss_type: ctc
|
||||
distort: true
|
||||
use_space_char: false
|
||||
|
|
|
@ -12,7 +12,7 @@ Global:
|
|||
image_shape: [3, 32, 320]
|
||||
max_text_length: 25
|
||||
character_type: korean
|
||||
character_dict_path: ./ppocr/utils/korean_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/korean_dict.txt
|
||||
loss_type: ctc
|
||||
distort: true
|
||||
use_space_char: false
|
||||
|
|
|
@ -221,11 +221,11 @@ demo/cxx/ocr/
|
|||
1. ppocr_keys_v1.txt是中文字典文件,如果使用的 nb 模型是英文数字或其他语言的模型,需要更换为对应语言的字典。
|
||||
PaddleOCR 在ppocr/utils/下存放了多种字典,包括:
|
||||
```
|
||||
french_dict.txt # 法语字典
|
||||
german_dict.txt # 德语字典
|
||||
dict/french_dict.txt # 法语字典
|
||||
dict/german_dict.txt # 德语字典
|
||||
ic15_dict.txt # 英文字典
|
||||
japan_dict.txt # 日语字典
|
||||
korean_dict.txt # 韩语字典
|
||||
dict/japan_dict.txt # 日语字典
|
||||
dict/korean_dict.txt # 韩语字典
|
||||
ppocr_keys_v1.txt # 中文字典
|
||||
```
|
||||
|
||||
|
|
|
@ -185,11 +185,11 @@ demo/cxx/ocr/
|
|||
If the nb model is used for English recognition or other language recognition, dictionary file should be replaced with a dictionary of the corresponding language.
|
||||
PaddleOCR provides a variety of dictionaries under ppocr/utils/, including:
|
||||
```
|
||||
french_dict.txt # french
|
||||
german_dict.txt # german
|
||||
dict/french_dict.txt # french
|
||||
dict/german_dict.txt # german
|
||||
ic15_dict.txt # english
|
||||
japan_dict.txt # japan
|
||||
korean_dict.txt # korean
|
||||
dict/japan_dict.txt # japan
|
||||
dict/korean_dict.txt # korean
|
||||
ppocr_keys_v1.txt # chinese
|
||||
```
|
||||
|
||||
|
|
|
@ -325,7 +325,7 @@ python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words_en/word_336.png
|
|||
需要通过 `--vis_font_path` 指定可视化的字体路径,`doc/` 路径下有默认提供的小语种字体,例如韩文识别:
|
||||
|
||||
```
|
||||
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/korean_dict.txt" --vis_font_path="doc/korean.ttf"
|
||||
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/dict/korean_dict.txt" --vis_font_path="doc/korean.ttf"
|
||||
```
|
||||
![](../imgs_words/korean/1.jpg)
|
||||
|
||||
|
|
|
@ -120,19 +120,19 @@ word_dict.txt 每行有一个单字,将字符与数字索引映射在一起,
|
|||
|
||||
`ppocr/utils/ic15_dict.txt` 是一个包含36个字符的英文字典,
|
||||
|
||||
`ppocr/utils/french_dict.txt` 是一个包含118个字符的法文字典
|
||||
`ppocr/utils/dict/french_dict.txt` 是一个包含118个字符的法文字典
|
||||
|
||||
`ppocr/utils/japan_dict.txt` 是一个包含4399个字符的法文字典
|
||||
`ppocr/utils/dict/japan_dict.txt` 是一个包含4399个字符的法文字典
|
||||
|
||||
`ppocr/utils/korean_dict.txt` 是一个包含3636个字符的法文字典
|
||||
`ppocr/utils/dict/korean_dict.txt` 是一个包含3636个字符的法文字典
|
||||
|
||||
`ppocr/utils/german_dict.txt` 是一个包含131个字符的法文字典
|
||||
`ppocr/utils/dict/german_dict.txt` 是一个包含131个字符的法文字典
|
||||
|
||||
|
||||
您可以按需使用。
|
||||
|
||||
目前的多语言模型仍处在demo阶段,会持续优化模型并补充语种,**非常欢迎您为我们提供其他语言的字典和字体**,
|
||||
如您愿意可将字典文件提交至 [utils](../../ppocr/utils) ,我们会在Repo中感谢您。
|
||||
如您愿意可将字典文件提交至 [dict](../../ppocr/utils/dict) 将语料文件提交至[corpus](../../ppocr/utils/corpus),我们会在Repo中感谢您。
|
||||
|
||||
- 自定义字典
|
||||
|
||||
|
@ -269,7 +269,7 @@ PaddleOCR也提供了多语言的, `configs/rec/multi_languages` 路径下的
|
|||
Global:
|
||||
...
|
||||
# 添加自定义字典,如修改字典请将路径指向新字典
|
||||
character_dict_path: ./ppocr/utils/french_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/french_dict.txt
|
||||
# 训练时添加数据增强
|
||||
distort: true
|
||||
# 识别空格
|
||||
|
|
|
@ -330,7 +330,7 @@ If you need to predict other language models, when using inference model predict
|
|||
You need to specify the visual font path through `--vis_font_path`. There are small language fonts provided by default under the `doc/` path, such as Korean recognition:
|
||||
|
||||
```
|
||||
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/ utils/korean_dict.txt" --vis_font_path="doc/korean.ttf"
|
||||
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/korean/1.jpg" --rec_model_dir="./your inference model" --rec_char_type="korean" --rec_char_dict_path="ppocr/utils/dict/korean_dict.txt" --vis_font_path="doc/korean.ttf"
|
||||
```
|
||||
![](../imgs_words/korean/1.jpg)
|
||||
|
||||
|
|
|
@ -112,18 +112,18 @@ In `word_dict.txt`, there is a single word in each line, which maps characters a
|
|||
|
||||
`ppocr/utils/ic15_dict.txt` is an English dictionary with 63 characters
|
||||
|
||||
`ppocr/utils/french_dict.txt` is a French dictionary with 118 characters
|
||||
`ppocr/utils/dict/french_dict.txt` is a French dictionary with 118 characters
|
||||
|
||||
`ppocr/utils/japan_dict.txt` is a French dictionary with 4399 characters
|
||||
`ppocr/utils/dict/japan_dict.txt` is a French dictionary with 4399 characters
|
||||
|
||||
`ppocr/utils/korean_dict.txt` is a French dictionary with 3636 characters
|
||||
`ppocr/utils/dict/korean_dict.txt` is a French dictionary with 3636 characters
|
||||
|
||||
`ppocr/utils/german_dict.txt` is a French dictionary with 131 characters
|
||||
`ppocr/utils/dict/german_dict.txt` is a French dictionary with 131 characters
|
||||
|
||||
You can use it on demand.
|
||||
|
||||
The current multi-language model is still in the demo stage and will continue to optimize the model and add languages. **You are very welcome to provide us with dictionaries and fonts in other languages**,
|
||||
If you like, you can submit the dictionary file to [utils](../../ppocr/utils) and we will thank you in the Repo.
|
||||
If you like, you can submit the dictionary file to [dict](../../ppocr/utils/dict) or corpus file to [corpus](../../ppocr/utils/corpus) and we will thank you in the Repo.
|
||||
|
||||
|
||||
To customize the dict file, please modify the `character_dict_path` field in `configs/rec/rec_icdar15_train.yml` and set `character_type` to `ch`.
|
||||
|
@ -259,7 +259,7 @@ Global:
|
|||
...
|
||||
# Add a custom dictionary, if you modify the dictionary
|
||||
# please point the path to the new dictionary
|
||||
character_dict_path: ./ppocr/utils/french_dict.txt
|
||||
character_dict_path: ./ppocr/utils/dict/french_dict.txt
|
||||
# Add data augmentation during training
|
||||
distort: true
|
||||
# Identify spaces
|
||||
|
|
|
@ -0,0 +1,6 @@
|
|||
# Waiting for your contribution
|
||||
|
||||
PaddleOCR welcomes you to provide multilingual corpus for us to synthesize more data to optimize the model.
|
||||
|
||||
If you are interested, you can submit the corpus text to this directory and name it with {language}_corpus.txt.
|
||||
PaddleOCR thanks for your contribution.
|
|
@ -0,0 +1,8 @@
|
|||
# 欢迎贡献语料
|
||||
|
||||
PaddleOCR非常欢迎你提供多语言的语料,以供我们合成更多数据来优化模型。
|
||||
|
||||
如你感兴趣,可将语料文本提交到此目录,并以 {语言}_corpus.txt 命名,PaddleOCR团队感谢你的贡献。
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue