## Installation
MinerU
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
```
Third-party software
```bash
# install
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
pip install einops==0.7.0
pip install transformers-stream-generator==0.0.5
pip install accelerate==0.33.0
# uninstall
pip uninstall transformer-engine
```
## Environment Configuration
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
## Usage
### Data Ingestion
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### Query
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## Example
````bash
# Start the es service
docker compose up -d
or
docker-compose up -d
# Set environment variables
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
# Ingest data
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
# Ask a question
python query.py -q 'how about the rights of men'
## outputs
Please answer the question based on the content within ```:
```
I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
```
My question is:how about the rights of men。
question: how about the rights of men
answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
````
## Development
`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
### API Interface
```python
from magic_pdf.integrations.rag.type import Node
class RagPageReader:
def get_rel_map(self) -> list[ElementRelation]:
# Retrieve the relationships between nodes
pass
...
class RagDocumentReader:
...
class DataReader:
def __init__(self, path_or_directory: str, method: str, output_dir: str):
pass
def get_documents_count(self) -> int:
"""Get the number of pdf documents"""
pass
def get_document_result(self, idx: int) -> RagDocumentReader | None:
"""Retrieve the parsed content of a specific pdf"""
pass
def get_document_filename(self, idx: int) -> Path:
"""Retrieve the path of a specific pdf"""
pass
```
Type Definitions
```python
class Node(BaseModel):
category_type: CategoryType = Field(description='Category') # Category
text: str | None = Field(description='Text content', default=None)
image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
anno_id: int = Field(description='Unique ID', default=-1)
latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
html: str | None = Field(description='HTML output for tables', default=None)
```
Tables can be stored in one of three formats: image, LaTeX, or HTML.
`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.
### Node Relationship Matrix
| | image_body | table_body |
| -------------- | ---------- | ---------- |
| image_caption | sibling | |
| table_caption | | sibling |
| table_footnote | | sibling |
没有合适的资源?快使用搜索试试~ 我知道了~
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式
共304个文件
py:182个
json:47个
md:19个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 121 浏览量
2024-09-24
16:22:20
上传
评论
收藏 14.74MB ZIP 举报
温馨提示
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
资源推荐
资源详情
资源评论
收起资源包目录
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式 (304个子文件)
__init__py 0B
Dockerfile 2KB
lid.176.ftz 916KB
.gitignore 369B
para_test_pdf_ids.ini 14KB
table.jpg 58KB
preproc_out.json 7.88MB
preproc_out.json 7.84MB
preproc_out.json 3.53MB
preproc_out.json 2.69MB
preproc_out.json 2.52MB
preproc_out.json 1.46MB
preproc_out.json 1.17MB
preproc_out.json 1.09MB
preproc_out.json 723KB
preproc_out.json 664KB
preproc_out.json 570KB
preproc_out.json 538KB
preproc_out.json 500KB
preproc_out.json 484KB
preproc_out.json 276KB
preproc_out.json 275KB
preproc_out.json 245KB
paper_recogPara.json 233KB
pdf_dic.json 214KB
preproc_out.json 203KB
preproc_out.json 190KB
demo1.json 187KB
preproc_out.json 182KB
preproc_out.json 166KB
vertical_en_blocks.json 165KB
preproc_out.json 164KB
preproc_out.json 148KB
preproc_out.json 148KB
preproc_out.json 129KB
preproc_out.json 126KB
demo2.json 113KB
test_metascan_classify_data.json 104KB
middle.json 96KB
preproc_out.json 80KB
pymu_textblocks.json 70KB
preproc_out.json 61KB
preproc_out.json 49KB
cli_test_01.model.json 18KB
vertical_blocks.json 15KB
preproc_2_parasplit_example.json 9KB
paras_test.json 8KB
images_tables_equations.json 2KB
cla.json 1KB
magic-pdf.template.json 310B
magic-pdf.json 215B
result.json 126B
magic-pdf.json 42B
cli_test_01.jsonl 6KB
LICENSE.md 34KB
cleaned_research_report_1f978cd81fb7260c8f7644039ec2c054.md 16KB
output_file_en_us.md 13KB
output_file_zh_cn.md 12KB
MinerU_CLA.md 7KB
README_zh-CN.md 7KB
README_Ubuntu_CUDA_Acceleration_en_US.md 4KB
README_Ubuntu_CUDA_Acceleration_zh_CN.md 4KB
README.md 4KB
README_Windows_CUDA_Acceleration_en_US.md 4KB
README_Windows_CUDA_Acceleration_zh_CN.md 4KB
how_to_download_models_zh_cn.md 3KB
how_to_download_models_en.md 3KB
FAQ_en_us.md 2KB
FAQ_zh_cn.md 2KB
pull_request_template.md 2KB
feature_request.md 1KB
README.md 157B
README_zh-CN.md 155B
demo2.pdf 1.72MB
one_page_with_table_image.pdf 1.34MB
research_report_1f978cd81fb7260c8f7644039ec2c054.pdf 801KB
cli_test_02.pdf 565KB
cli_test_01.pdf 556KB
cli_test_01.pdf 556KB
cli_test_01.pdf 556KB
cli_test_01.pdf 556KB
introduction.pdf 545KB
paper_recogPara.pdf 544KB
paper.pdf 466KB
small_ocr.pdf 443KB
demo1.pdf 329KB
one_page_with_table_image.2.pdf 281KB
declaration_of_the_rights_of_man_1789.pdf 12KB
MinerU-logo-hq.png 1.35MB
layout_example.png 559KB
spans_example.png 550KB
project_panorama_en.png 262KB
project_panorama_zh_cn.png 247KB
MinerU-logo.png 216KB
flowchart_zh_cn.png 106KB
flowchart_en.png 105KB
datalab_logo.png 96KB
poly.png 13KB
rag_data_api.png 9KB
detect_para.py 123KB
共 304 条
- 1
- 2
- 3
- 4
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 7362
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功