PyPI官网下载|ocrmypdf-8.1.0.tar.gz资源-CSDN文库

版权申诉

173 浏览量 2022-01-28 23:06:57 上传评论收藏 7.12MB GZ 举报

共309个文件

bin：166个

py：51个

pdf：38个

《PyPI官网下载 | ocrmypdf-8.1.0.tar.gz——OCR与Python库解析》在信息技术领域，Python编程语言以其简洁易读的语法和强大的库支持深受开发者喜爱。PyPI（Python Package Index）是Python软件的官方仓库，提供了一个便捷的平台供用户搜索、下载和分享Python模块。本文将详细介绍PyPI下载的资源——`ocrmypdf-8.1.0.tar.gz`，以及与其相关的分布式系统概念——Zookeeper和云原生技术。 `ocrmypdf-8.1.0.tar.gz`是一个来源于PyPI的压缩包，它包含了OCR（Optical Character Recognition，光学字符识别）库`ocrmypdf`的8.1.0版本。OCR技术允许我们从图像中识别并提取文本，从而将扫描文档或图片转换为可编辑的文本格式。`ocrmypdf`是一个开源的Python库，专为PDF文档设计，可以方便地将PDF中的图像部分转换为机器可读的文本，极大地提升了处理非结构化信息的效率。在`ocrmypdf-8.1.0`这个压缩包中，我们可以期待找到`setup.py`安装脚本，`README`文件用于介绍项目详情，`LICENSE`文件规定了软件的使用许可，以及源代码文件和可能的测试套件等。通过解压这个包并运行`pip install .`命令，我们可以将`ocrmypdf`库安装到本地环境，然后在Python项目中利用其提供的API进行OCR操作。在标签中提到了“Zookeeper”，这是Apache的一个分布式协调服务。Zookeeper常被用于管理分布式系统中的配置信息、命名服务、集群同步、分组服务等，确保在分布式环境中的一致性和可靠性。虽然`ocrmypdf`本身不直接涉及Zookeeper，但在大型企业或复杂系统的部署中，可能需要借助Zookeeper来管理和协调多个`ocrmypdf`服务实例。另一个标签“云原生”（Cloud Native）则代表了一种构建和运行应用程序的方法，强调微服务、容器化、声明式API以及持续交付和devops文化。`ocrmypdf`作为Python库，可以很好地适应云原生环境。例如，它可以通过Docker容器化，实现轻量级部署，跨平台运行，并且能够轻松地与其他微服务集成。在Kubernetes这样的云原生编排平台上，`ocrmypdf`可以作为一个独立的服务被调度和管理，实现高效的OCR任务处理。 `ocrmypdf-8.1.0.tar.gz`是一个包含OCR功能的Python库，适用于处理PDF文档中的图像文本。而Zookeeper和云原生概念则反映了在分布式系统和现代化云计算环境下，如何高效、可靠地部署和使用这类工具。对于开发者来说，掌握这些知识将有助于构建更强大、更灵活的应用程序。

资源推荐

资源详情

资源评论

收起资源包目录

PyPI 官网下载 | ocrmypdf-8.1.0.tar.gz （309个子文件）

hocr.bin 111KB

hocr.bin 107KB

hocr.bin 78KB

hocr.bin 47KB

hocr.bin 44KB

hocr.bin 33KB

hocr.bin 20KB

hocr.bin 18KB

pdf.bin 12KB

pdf.bin 11KB

pdf.bin 10KB

pdf.bin 8KB

pdf.bin 6KB

pdf.bin 5KB

txt.bin 4KB

pdf.bin 4KB

hocr.bin 4KB

pdf.bin 4KB

pdf.bin 3KB

txt.bin 3KB

pdf.bin 3KB

hocr.bin 2KB

txt.bin 2KB

txt.bin 1KB

hocr.bin 1KB

hocr.bin 702B

txt.bin 526B

txt.bin 425B

txt.bin 362B

txt.bin 351B

txt.bin 181B

stdout.bin 123B

stdout.bin 122B

stdout.bin 119B

stderr.bin 79B

stderr.bin 78B

txt.bin 77B

stderr.bin 55B

共 309 条

OCRmyPDF ======== [![Travis build status][travis]](https://travis-ci.org/jbarlow83/OCRmyPDF) [![PyPI version][pypi]](https://pypi.org/project/ocrmypdf/) ![Homebrew version][homebrew] ![ReadTheDocs][docs] [travis]: https://travis-ci.org/jbarlow83/OCRmyPDF.svg?branch=master "Travis build status" [pypi]: https://img.shields.io/pypi/v/ocrmypdf.svg "PyPI version" [homebrew]: https://img.shields.io/homebrew/v/ocrmypdf.svg "Homebrew version" [docs]: https://readthedocs.org/projects/ocrmypdf/badge/?version=latest "RTD" OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ```bash ocrmypdf # it's a scriptable command line program -l eng+fra # it supports multiple languages --rotate-pages # it can fix pages that are misrotated --deskew # it can deskew crooked PDFs! --title "My PDF" # it can change output metadata --jobs 4 # it uses multiple cores by default --output-type pdfa # it produces PDF/A by default input_scanned.pdf # takes PDF input (or images) output_searchable.pdf # produces validated PDF output ``` [See the release notes for details on the latest changes](https://ocrmypdf.readthedocs.io/en/latest/release_notes.html). Main features ------------- - Generates a searchable [PDF/A](https://en.wikipedia.org/?title=PDF/A) file from a regular PDF - Places OCR text accurately below the image to ease copy / paste - Keeps the exact resolution of the original embedded images - When possible, inserts OCR information as a "lossless" operation without disrupting any other content - Optimizes PDF images, often producing files smaller than the input file - If requested deskews and/or cleans the image before performing OCR - Validates input and output files - Distributes work across all available CPU cores - Uses [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) engine - Supports more than [100 languages](https://github.com/tesseract-ocr/tessdata) recognized by Tesseract - Battle-tested on thousands of PDFs, a test suite and continuous integration For details: please consult the [documentation](https://ocrmypdf.readthedocs.io/en/latest/). Motivation ---------- I searched the web for a free command line tool to OCR PDF files on Linux/UNIX: I found many, but none of them were really satisfying. - Either they produced PDF files with misplaced text under the image (making copy/paste impossible) - Or they did not handle accents and multilingual characters - Or they changed the resolution of the embedded images - Or they generated ridiculously large PDF files - Or they crashed when trying to OCR - Or they did not produce valid PDF files - On top of that none of them produced PDF/A files (format dedicated for long time storage) ...so I decided to develop my own tool. Installation ------------ Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows. Users of Debian 9 or later or Ubuntu 16.10 or later may simply ```bash apt-get install ocrmypdf ``` and users of Fedora 29 or later may simply ```bash dnf install ocrmypdf ``` and macOS users with Homebrew may simply ```bash brew install ocrmypdf ``` For everyone else, [see our documentation](https://ocrmypdf.readthedocs.io/en/latest/installation.html) for installation steps. Languages --------- OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs: ```bash # Display a list of all Tesseract language packs apt-cache search tesseract-ocr # Debian/Ubuntu users apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language back ``` You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested. Documentation and support ------------------------- Once ocrmypdf is installed, the built-in help which explains the command syntax and options can be accessed via: ```bash ocrmypdf --help ``` Our [documentation is served on Read the Docs](https://ocrmypdf.readthedocs.io/en/latest/index.html). If you detect an issue, please: - Check whether your issue is already known - If no problem report exists on github, please create one here: <https://github.com/jbarlow83/OCRmyPDF/issues> - Describe your problem thoroughly - Append the console output of the script when running the debug mode (`-v 1` option) - If possible provide your input PDF file as well as the content of the temporary folder (using a file sharing service like Dropbox) Requirements ------------ Runs on CPython 3.5, 3.6 and 3.7. Requires external program installations of Ghostscript, Tesseract OCR, QPDF, and Leptonica. ocrmypdf is pure Python, but uses CFFI to portably generate library bindings. Press & Media ------------- - [Going paperless with OCRmyPDF](https://medium.com/@ikirichenko/going-paperless-with-ocrmypdf-e2f36143f46a) - [Converting a scanned document into a compressed searchable PDF with redactions](https://medium.com/@treyharris/converting-a-scanned-document-into-a-compressed-searchable-pdf-with-redactions-63f61c34fe4c) - [c't 1-2014, page 59](http://heise.de/-2279695): Detailed presentation of OCRmyPDF v1.0 in the leading German IT magazine c't - [heise Open Source, 09/2014: Texterkennung mit OCRmyPDF](http://heise.de/-2356670) License ------- The OCRmyPDF software is licensed under the GNU GPLv3. Certain files are covered by other licenses, as noted in their source files. The license for each test file varies, and is noted in tests/resources/README.rst. The documentation is licensed under Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA 4.0). OCRmyPDF versions prior to 6.0 were licensed under the MIT License. Disclaimer ---------- The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

评论收藏

内容反馈

版权申诉