MindOCR是一个基于MindSpore的OCR开发和应用开源工具箱它可以帮助用户训练和应用最佳的文本检测和识别模型资源-CSDN文库

共719个文件

py：394个

md：144个

yaml：86个

版权申诉

人工智能

56 浏览量 2024-02-06 11:22:57 上传评论收藏 13.11MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

MindOCR是一个基于MindSpore的OCR开发和应用开源工具箱它可以帮助用户训练和应用最佳的文本检测和识别模型（719个子文件）

main.cpp 23KB

db_postprocess.cpp 10KB

rec_pre_node.cpp 9KB

module_manager.cpp 9KB

utils.cpp 8KB

rec_infer_node.cpp 8KB

collect_node.cpp 7KB

command_parser.cpp 6KB

det_post_node.cpp 6KB

det_pre_node.cpp 6KB

det_infer_node.cpp 6KB

cls_pre_node.cpp 6KB

config_parser.cpp 6KB

cls_infer_node.cpp 6KB

module_base.cpp 5KB

hand_out_node.cpp 5KB

cls_post_node.cpp 4KB

rec_post_node.cpp 3KB

rec_postprocess.cpp 3KB

profile.cpp 943B

.flake8 859B

.gitignore 2KB

blocking_queue.h 3KB

utils.h 3KB

module_manager.h 2KB

module_base.h 2KB

module_factory.h 2KB

data_type.h 2KB

db_postprocess.h 2KB

config_parser.h 2KB

command_parser.h 2KB

det_pre_node.h 1KB

cls_post_node.h 1KB

collect_node.h 1KB

rec_pre_node.h 1KB

cls_infer_node.h 1KB

rec_post_node.h 1KB

rec_infer_node.h 1KB

cls_pre_node.h 1KB

det_infer_node.h 1KB

profile.h 1KB

constant.h 1KB

rec_postprocess.h 1KB

hand_out_node.h 1KB

det_post_node.h 963B

status_code.h 536B

MANIFEST.in 66B

yolov8_structure.jpeg 1.19MB

example.jpg 1.39MB

example_ser.jpg 987KB

db_mobilenetv3_ppocrv3_param_map.json 45KB

svtr_ppocrv3_ch_param_map.json 40KB

ser_vi_layoutxlm_param_map.json 37KB

LICENSE 11KB

frequently_asked_questions.md 49KB

frequently_asked_questions.md 48KB

inference_thirdparty_quickstart.md 44KB

inference_thirdparty_quickstart.md 41KB

README.md 26KB

README_CN.md 26KB

README.md 24KB

README_CN.md 23KB

inference_quickstart.md 22KB

README.md 22KB

README_CN.md 22KB

README.md 22KB

README_CN.md 21KB

inference_quickstart.md 21KB

README.md 21KB

README.md 20KB

README_CN.md 20KB

README_CN.md 19KB

yaml_configuration.md 17KB

README.md 16KB

README_CN.md 16KB

README.md 16KB

README_CN_PP-OCRv3.md 16KB

README_CN.md 16KB

README_CN_PP-OCRv3.md 15KB

training_detection_custom_dataset.md 15KB

convert_tutorial.md 15KB

yaml_configuration.md 15KB

README_CN.md 15KB

training_recognition_custom_dataset.md 14KB

training_detection_custom_dataset.md 14KB

inference_tutorial.md 14KB

README.md 13KB

convert_tutorial.md 13KB

inference_tutorial.md 12KB

README.md 12KB

training_recognition_custom_dataset.md 12KB

model_template.md 11KB

README.md 11KB

README_CN.md 11KB

README.md 11KB

README_CN.md 11KB

model_template_CN.md 10KB

README.md 9KB

distribute_train.md 8KB

共 719 条

English | [中文](README_CN.md) # DBNet and DBNet++  > DBNet: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947) > DBNet++: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304) ## 1. Introduction ### DBNet DBNet is a segmentation-based scene text detection method. Segmentation-based methods are gaining popularity for scene text detection purposes as they can more accurately describe scene text of various shapes, such as curved text. The drawback of current segmentation-based SOTA methods is the post-processing of binarization (conversion of probability maps into text bounding boxes) which often requires a manually set threshold (reduces prediction accuracy) and complex algorithms for grouping pixels (resulting in a considerable time cost during inference). To eliminate the problem described above, DBNet integrates an adaptive threshold called Differentiable Binarization(DB) into the architecture. DB simplifies post-processing and enhances the performance of text detection.Moreover, it can be removed in the inference stage without sacrificing performance.[[1](#references)] <img alt="Figure 1. Overall DBNet architecture" src="https://user-images.githubusercontent.com/16683750/225589619-d50c506c-e903-4f59-a316-8b62586c73a9.png" width="800"/> Figure 1. Overall DBNet architecture The overall architecture of DBNet is presented in _Figure 1._ It consists of multiple stages: 1. Feature extraction from a backbone at different scales. ResNet-50 is used as a backbone, and features are extracted from stages 2, 3, 4, and 5. 2. The extracted features are upscaled and summed up with the previous stage features in a cascade fashion. 3. The resulting features are upscaled once again to match the size of the largest feature map (from the stage 2) and concatenated along the channel axis. 4. Then, the final feature map (shown in dark blue) is used to predict both the probability and threshold maps by applying 3×3 convolutional operator and two de-convolutional operators with stride 2. 5. The probability and threshold maps are merged into one approximate binary map by the Differentiable binarization module. The approximate binary map is used to generate text bounding boxes. ### DBNet++ DBNet++ is an extension of DBNet and thus replicates its architecture. The only difference is that instead of concatenating extracted and scaled features from the backbone as DBNet did, DBNet++ uses an adaptive way to fuse those features called Adaptive Scale Fusion (ASF) module (Figure 2). It improves the scale robustness of the network by fusing features of different scales adaptively. By using ASF, DBNet++’s ability to detect text instances of diverse scales is distinctly strengthened.[[2](#references)] <img alt="Figure 2. Overall DBNet++ architecture" src="https://user-images.githubusercontent.com/16683750/236786997-13823b9c-ecaa-4bc5-8037-71299b3baffe.png" width="800"/> Figure 2. Overall DBNet++ architecture <img alt="Figure 3. Detailed architecture of the Adaptive Scale Fusion module" src="https://user-images.githubusercontent.com/16683750/236787093-c0c78d8f-e4f4-4c5e-8259-7120a14b0e31.png" width="700"/> Figure 3. Detailed architecture of the Adaptive Scale Fusion module ASF consists of two attention modules – stage-wise attention and spatial attention, where the latter is integrated in the former as described in the Figure 3. The stage-wise attention module learns the weights of the feature maps of different scales. While the spatial attention module learns the attention across the spatial dimensions. The combination of these two modules leads to scale-robust feature fusion. DBNet++ performs better in detecting text instances of diverse scales, especially for large-scale text instances where DBNet may generate inaccurate or discrete bounding boxes. ## 2. General purpose models Here we present general purpose models that were trained on wide variety of tasks (real-world photos, street views, documents, etc.) and challenges (straight texts, curved texts, long text lines, etc.) with two primary languages: Chinese and English. These models can be used right off-the-shelf in your applications or for initialization of your models. The models were trained on 12 public datasets (CTW, LSVT, RCTW-17, TextOCR, etc.) that contain wide range of images. The training set has 153,511 images and the validation set has 9,786 images. The test set consists of 598 images manually selected from the above-mentioned datasets. <div align="center"> | **Model** | **Context** | **Backbone** | **Languages** | **F-score on Our Test Set** | **Throughput** | **Download** | |-----------|----------------|--------------|-------------------|:---------------------------:|----------------|----------------------------------------------------------------------------------------------------------| | DBNet | D910x8-MS2.0-G | ResNet-50 | Chinese + English | 83.41% | 256 img/s | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50_ch_en_general-a5dbb141.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50_ch_en_general-a5dbb141-912f0a90.mindir) | | DBNet++ | D910x4-MS2.0-G | ResNet-50 | Chinese + English | 84.30% | 104 img/s | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50_ch_en_general-884ba5b9.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50_ch_en_general-884ba5b9-b3f52398.mindir) | </div> > The input_shape for exported DBNet MindIR and DBNet++ MindIR in the links are `(1,3,736,1280)` and `(1,3,1152,2048)`, respectively. ## 3. Results DBNet and DBNet++ were trained on the ICDAR2015, MSRA-TD500, SCUT-CTW1500, Total-Text, and MLT2017 datasets. In addition, we conducted pre-training on the SynthText dataset and provided a URL to download pretrained weights. All training results are as follows: ### ICDAR2015 <div align="center"> | **Model** | **Context** | **Backbone** | **Pretrained** | **Recall** | **Precision** | **F-score** | **Train T.** | **Throughput** | **Recipe** | **Download** | |---------------------|----------------|---------------|----------------|------------|---------------|-------------|--------------|----------------|-------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | DBNet | D910x1-MS2.0-G | MobileNetV3 | ImageNet | 76.31% | 78.27% | 77.28% | 10 s/epoch | 100 img/s | [yaml](db_mobilenetv3_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_mobilenetv3-62c44539.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_mobilenetv3-62c44539-f14c6a13.mindir) | | DBNet | D910x8-MS2.3-G | MobileNetV3 | ImageNet | 76.22% | 77.98% | 77.09% | 1.1 s/epoch | 960 img/s | [yaml](db_mobilenetv3_icdar15_8p.yaml) | Coming soon

评论收藏

内容反馈

版权申诉