LaTeX_OCR:数学公式识别_latex图片识别资源-CSDN文库

共77个文件

py：35个

gif：10个

txt：9个

JupyterNotebook

需积分: 39 101 浏览量 2021-05-04 23:41:42 上传评论 7 收藏 44.46MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

LaTeX_OCR-master.zip （77个子文件）

LaTeX_OCR-master

visualization.zip 6.62MB

.gitignore 132B

.ipynb_checkpoints

visualize_attention-checkpoint.ipynb 468KB

art

architecture.jpg 111KB

predict.png 90KB

visualization_12_long.gif 3.13MB

visualization_6_short.gif 3.05MB

visualization_data.images_test.6.gif 2.49MB

visualization_6_long.gif 2.71MB

6.png 98KB

visualization_long.gif 3.26MB

12.png 99KB

visualization_data.images_test.2.gif 4.15MB

visualization_12_short.gif 3.44MB

visualization_prediction_short.gif 3.81MB

visualization_14_long.gif 3.26MB

visualization_14_short.gif 3.81MB

14.png 90KB

requirements.txt 179B

data

val.formulas.norm.txt 1.26MB

test.formulas.norm.txt 1.41MB

small.formulas

val.norm.txt 5KB

test.norm.txt 5KB

train.norm.txt 8KB

train.formulas.norm.txt 11.35MB

model

components

greedy_decoder_cell.py 2KB

seq2seq_torch.py 18KB

__init__.py 0B

attention_mechanism.py 6KB

DenseNet.py 119B

attention_cell.py 4KB

beam_search_decoder_cell.py 15KB

dynamic_decode.py 3KB

SimpleCNN.py 2KB

ResNet.py 398B

positional.py 3KB

evaluation

__init__.py 0B

image.py 3KB

text.py 4KB

img2seq_torch.py 11KB

__init__.py 0B

base.py 6KB

base_torch.py 10KB

utils

__init__.py 0B

general.py 6KB

image.py 7KB

lr_schedule.py 4KB

data_generator.py 8KB

text.py 5KB

img2seq.py 11KB

decoder.py 5KB

encoder.py 4KB

evaluate_txt.py 2KB

evaluate_img.py 2KB

gh-md-toc 8KB

LICENSE.txt 11KB

visualize_attention.ipynb 468KB

README.md 11KB

configs

vocab.json 159B

data_small.json 1018B

training.json 387B

data.json 985B

README.md 3KB

vocab_small.json 173B

training_small.json 390B

model.json 428B

build.py 1KB

dirty

t.py 3KB

decoder.ipynb 3KB

a.txt 282KB

test.py 108B

test.ipynb 20KB

visualize_attention.py 8KB

.vscode

settings.json 105B

makefile 2KB

predict.py 2KB

train.py 3KB

# LaTeX OCR * [1. 搭建环境](#1-搭建环境) * [Linux](#linux) * [Mac](#mac) * [2. 开始训练](#2-开始训练) * [生成小数据集、训练、评价](#生成小数据集训练评价) * [生成完整数据集、训练、评价](#生成完整数据集训练评价) * [3. 可视化](#3-可视化) * [可视化训练过程](#可视化训练过程) * [可视化预测过程](#可视化预测过程) * [4. 评价](#4-评价) * [5. 模型的具体实现细节](#5-模型的具体实现细节) * [总述](#总述) * [数据获取和数据处理](#数据获取和数据处理) * [模型构建](#模型构建) * [6. 踩坑记录](#6-踩坑记录) * [win10 用 GPU 加速训练](#win10-用-gpu-加速训练) * [如何可视化Attention层](#如何可视化attention层) * [致谢](#致谢) Seq2Seq + Attention + Beam Search。 ![](./art/6.png) ![](./art/visualization_6_short.gif) ![](./art/12.png) ![](./art/visualization_12_short.gif) ![](./art/14.png) ![](./art/visualization_14_short.gif) 结构 ![](./art/architecture.jpg) ## 1. 搭建环境 1. python3.5 + tensorflow1.12.2 2. latex (latex 转 pdf) 3. ghostscript (图片处理) 4. magick (pdf 转 png) ### Linux 一键安装 ```shell make install-linux ``` 或 1. 安装本项目依赖 ```shell virtualenv env35 --python=python3.5 source env35/bin/activate pip install -r requirements.txt ``` 2. 安装 latex (latex 转 pdf) ```shell sudo apt-get install texlive-latex-base sudo apt-get install texlive-latex-extra ``` 3. 安装 ghostscript ```shell sudo apt-get update sudo apt-get install ghostscript sudo apt-get install libgs-dev ``` 4. 安装[magick](https://www.imagemagick.org/script/install-source.php) (pdf 转 png) ```shell wget http://www.imagemagick.org/download/ImageMagick.tar.gz tar -xvf ImageMagick.tar.gz cd ImageMagick-7.*; \ ./configure --with-gslib=yes; \ make; \ sudo make install; \ sudo ldconfig /usr/local/lib rm ImageMagick.tar.gz rm -r ImageMagick-7.* ``` ### Mac 一键安装 ```shell make install-mac ``` 或 1. 安装本项目依赖 ```shell sudo pip install -r requirements.txt ``` 2. LaTeX 请自行安装 3. 安装[magick](https://www.imagemagick.org/script/install-source.php) (pdf 转 png) ```shell wget http://www.imagemagick.org/download/ImageMagick.tar.gz tar -xvf ImageMagick.tar.gz cd ImageMagick-7.*; \ ./configure --with-gslib=yes; \ make;\ sudo make install; \ rm ImageMagick.tar.gz rm -r ImageMagick-7.* ``` ## 2. 开始训练 ### 生成小数据集、训练、评价提供了样本量为 100 的小数据集，方便测试。只需 2 分钟就可以根据 `./data/small.formulas/` 下的公式生成用于训练的图片。一步训练 ``` make small ``` 或 1. 生成数据集用 LaTeX 公式生成图片，同时保存公式-图片映射文件，生成字典 __只用运行一次__ ```shell # 默认 python build.py # 或者 python build.py --data=configs/data_small.json --vocab=configs/vocab_small.json ``` 2. 训练 ``` # 默认 python train.py # 或者 python train.py --data=configs/data_small.json --vocab=configs/vocab_small.json --training=configs/training_small.json --model=configs/model.json --output=results/small/ ``` 3. 评价预测的公式 ``` # 默认 python evaluate_txt.py # 或者 python evaluate_txt.py --results=results/small/ ``` 4. 评价数学公式图片 ``` # 默认 python evaluate_img.py # 或者 python evaluate_img.py --results=results/small/ ``` ### 生成完整数据集、训练、评价根据公式生成 70,000+ 数学公式图片需要 `2`-`3` 个小时一步训练 ``` make full ``` 或 1. 生成数据集用 LaTeX 公式生成图片，同时保存公式-图片映射文件，生成字典 __只用运行一次__ ``` python build.py --data=configs/data.json --vocab=configs/vocab.json ``` 2. 训练 ``` python train.py --data=configs/data.json --vocab=configs/vocab.json --training=configs/training.json --model=configs/model.json --output=results/full/ ``` 3. 评价预测的公式 ``` python evaluate_txt.py --results=results/full/ ``` 4. 评价数学公式图片 ``` python evaluate_img.py --results=results/full/ ``` ## 3. 可视化 ### 可视化训练过程用 tensorboard 可视化训练过程小数据集 ``` cd results/small tensorboard --logdir ./ ``` 完整数据集 ``` cd results/full tensorboard --logdir ./ ``` ### 可视化预测过程打开 `visualize_attention.ipynb`，一步步观察模型是如何预测 LaTeX 公式的。或者运行 ```shell # 默认 python visualize_attention.py # 或者 python visualize_attention.py --image=data/images_test/6.png --vocab=configs/vocab.json --model=configs/model.json --output=results/full/ ``` 可在 `--output` 下生成预测过程的注意力图。 ## 4. 评价 | 指标 | 训练分数 | 测试分数 | | :-------------: | :------: | :------: | | perplexity | 1.39 | 1.44 | | EditDistance | 81.68 | 80.45 | | BLEU-4 | 78.21 | 75.42 | | ExactMatchScore | 13.93 | 12.44 | perplexity 是越接近1越好，其余3个指标是越大越好。ExactMatchScore 比较低，继续训练应该可以到 70 以上。机器不太好，训练太费时间了。 ## 5. 模型的具体实现细节 ### 总述首先我们获取到足够的公式，对公式进行规范化处理，方便划分出字典。然后通过规范化的公式使用脚本生成图片，具体用到了latex和ghostscript和magick，同时保存哪个公式生成哪个图片，保存为公式-图片映射文件。这样我们得到了3个数据集：规范化的公式集，图片集，公式-图片映射集，还有个附赠品：latex字典。这个字典决定了模型的上限，也就是说，模型预测出的公式只能由字典里的字符组成，不会出现字典以外的字符。然后构建模型。模型分为3部分，数据生成器，神经网络模型，使用脚本。数据生成器读取公式-图片映射文件，为模型提供(公式, 图片)的矩阵元组。神经网络模型是 Seq2Seq + Attention + Beam Search。Seq2Seq的Encoder是CNN，Decoder是LSTM。Encoder和Decoder之间插入Attention层，具体操作是这样：Encoder到Decoder有个扁平化的过程，Attention就是在这里插入的。随Attention插入的还有我们自定义的一个op，用来导出Attention的数据，做Attention的可视化。使用脚本包括构建脚本、训练脚本、测试脚本、预测脚本、评估脚本、可视化脚本。使用说明看上面的命令行就行。训练过程根据epoch动态调整LearningRate。decoder可以选择用`lstm`或`gru`，在`configs/model.json`里改就行。最后输出结果可以选择用 `beam_search` 或 `greedy`，也是在`configs/model.json`里改。 ### 数据获取和数据处理我们只要获取到正确的latex公式就行。因为我们可以使用脚本将latex渲染出图片，所以就不用图片数据了。原来我们想使用爬虫爬取[arXiv](https://arxiv.org/)的论文，然后通过正则表达式提取论文里的latex公式。但是最后我们发现已经有人做了这个工作，所以就用了他们的公式数据。[im2latex-100k , arXiv:1609.04938](https://zenodo.org/record/56198#.XKMMU5gzZBB) 现在我们获取到latex公式数据，下面进行规范化。 > 为什么要规范化：如果不规范化，我们构建字典时就只能是char wise，而latex中有很多是有特定排列的指令，比如`\lim`，这样模型需要花费额外的神经元来记住这些pattern，会使模型效果变差，也导致训�

评论收藏

内容反馈