# hocr-spec-python
Validation of hOCR close to the specs
<!-- BEGIN-MARKDOWN-TOC -->
* [Rationale](#rationale)
* [Installation](#installation)
* [Command line interface](#command-line-interface)
* [API example](#api-example)
<!-- END-MARKDOWN-TOC -->
## Rationale
[hOCR](https://github.com/kba/hocr-spec) is a flavor of HTML for encoding
the results of Optical Character Recognition (OCR) engines. It is supported by
most OCR engines, such as
[tesseract](https://github.com/tesseract-ocr/tesseract),
[ocropus/ocropy](https://github.com/tmbdev/ocropy) and
[kraken](https://github.com/mittagessen/kraken).
The [hOCR specifications](https://github.com/kba/hocr-spec) is at the same time
very simple (hOCR is just HTML) and hard to implement, due to its terseness and
lack of up-to-date code samples. This project aims to implement the rules
defined by the specs from the ground up to serve as a validation tool and
reference implementation. It is meant to help hOCR implementers and support
tools like [hocr-tools](https://github.com/tmbdev/hocr-tools).
## Installation
Use pip:
```sh
# System-wide:
sudo pip install [--user] hocr-spec
# For current user:
pip install --user hocr-spec
```
From source:
```sh
git clone https://github.com/kba/hocr-spec-python
cd hocr-spec-python
# System-wide:
sudo python setup.py install
# For current user:
python setup.py install --user
```
## Command line interface
<!-- BEGIN-EVAL echo; ./hocr-spec -h |sed 's/^/ /' -->
usage: hocr-spec [-h] [--format {text,bool,ansi,xml}]
[--profile {relaxed,standard}]
[--implicit_capabilities CAPABILITY]
[--skip-check {attributes,classes,metadata,properties}]
[--parse-strict] [--silent]
sources [sources ...]
positional arguments:
sources hOCR file to check or '-' to read from STDIN
optional arguments:
-h, --help show this help message and exit
--format {text,bool,ansi,xml}, -f {text,bool,ansi,xml}
Report format
--profile {relaxed,standard}, -p {relaxed,standard}
Validation profile
--implicit_capabilities CAPABILITY, -C CAPABILITY
Enable this capability. Use '*' to enable all
capabilities. In addition to the 'ocr*' classes, you
can use ['ocrp_dir', 'ocrp_font', 'ocrp_lang',
'ocrp_nlp', 'ocrp_poly']
--skip-check {attributes,classes,metadata,properties}, -X {attributes,classes,metadata,properties}
Skip one check
--parse-strict Parse HTML with less tolerance for errors
--silent, -s Don't produce any output but signal success with exit
code.
<!-- END-EVAL -->
## API example
```python
from hocr_spec import HocrValidator
validator = HocrValidator()
report = validator.validate('/path/to/sample.hocr')
print(report.format('xml'))
# <report valid='false'>...</report>
```
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
资源分类:Python库 所属语言:Python 资源全名:hocr-spec-0.2.0.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
hocr-spec-0.2.0.tar.gz (22个子文件)
hocr-spec-0.2.0
PKG-INFO 5KB
LICENSE 1KB
hocr-spec 2KB
hocr_spec.egg-info
PKG-INFO 5KB
requires.txt 19B
not-zip-safe 1B
SOURCES.txt 433B
entry_points.txt 50B
top_level.txt 10B
dependency_links.txt 1B
pbr.json 47B
setup.cfg 955B
requirements.txt 19B
AUTHORS 185B
setup.py 134B
CHANGELOG.md 678B
ChangeLog 644B
README.md 3KB
hocr_spec
validate.py 4KB
cli.py 2KB
spec.py 27KB
__init__.py 157B
共 22 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功