# Image TO Latex
## dataset
- http://lstm.seas.harvard.edu/latex/data/
A general-purpose, deep learning-based system to decompile an image into presentational markup. For example, we can infer the LaTeX or HTML source from a rendered image.
<p align="center"><img src="http://lstm.seas.harvard.edu/latex/network.png" width="400"></p>
An example input is a rendered LaTeX formula:
<p align="center"><img src="http://lstm.seas.harvard.edu/latex/results/website/images/119b93a445-orig.png"></p>
The goal is to infer the LaTeX formula that can render such an image:
d s _ { 1 1 } ^ { 2 } = d x ^ { + } d x ^ { - } + l _ { p } ^ { 9 } \frac { p _ { - } } { r ^ { 7 } } \delta ( x ^ { - } ) d x ^ { - } d x ^ { - } + d x _ { 1 } ^ { 2 } + \; \cdots \; + d x _ { 9 } ^ { 2 }
Our model employs a convolutional network for text and layout recognition in tandem with an attention-based neural machine translation system. The use of attention additionally provides an alignment from the generated markup to the original source image:
<p align="center"><img src="http://lstm.seas.harvard.edu/latex/mathex.png"></p>
See [our website](http://lstm.seas.harvard.edu/latex/) for a complete interactive version of this visualization over the test set. Our paper (http://arxiv.org/pdf/1609.04938v1.pdf) provides more technical details of this model.
What You Get Is What You See: A Visual Markup Decompiler
Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush
# Prerequsites
Most of the code is written in [Torch](http://torch.ch), with Python for preprocessing.
### Torch
#### Model
The following lua libraries are required for the main model.
* tds
* class
* nn
* nngraph
* cunn
* cudnn
* cutorch
Note that currently we only support **GPU** since we use cudnn in the CNN part.
#### Preprocess
* Pillow
* numpy
Optional: We use Node.js and KaTeX for preprocessing [Installation](https://nodejs.org/en/)
##### pdflatex [Installaton](https://www.tug.org/texlive/)
Pdflatex is used for rendering LaTex during evaluation.
##### ImageMagick convert [Installation](http://www.imagemagick.org/script/index.php)
Convert is used for rending LaTex during evaluation.
##### Webkit2png [Installation](http://www.paulhammond.org/webkit2png/)
Webkit2png is used for rendering HTML during evaluation.
#### Evaluate
Python image-based evaluation
* python-Levenshtein
* matplotlib
* Distance
wget http://lstm.seas.harvard.edu/latex/third_party/Distance-0.1.3.tar.gz
tar zxf Distance-0.1.3.tar.gz
cd distance; sudo python setup.py install
##### Perl [Installation](https://www.perl.org/)
Perl is used for evaluating BLEU score.
# Usage
We assume that the working directory is `im2markup` throught this document.
The task is to convert an image into its presentational markup, so we need to specify a `data_base_dir` storing the images, a `label_path` storing all labels (e.g., latex formulas). Besides, we need to specify a `data_path` for the training (or test) data samples. The format of `data_path` shall look like:
<img_name1> <label_idx1>
<img_name2> <label_idx2>
<img_name3> <label_idx3>
where `<label_idx>` denotes the line index of the label (starting from 0).
## Quick Start (Math-to-LaTeX Toy Example)
To get started with, we provide a toy Math-to-LaTex example. We have a larger dataset [im2latex-100k-dataset](https://zenodo.org/record/56198#.V2p0KTXT6eA) of the same format but with much more samples.
### Preprocess
the dataset supported contains the image folder and formula lst file. especailly the formula lst file, it is decoded with the unix newline. so the function in the `scripts/preprocessing/preprocess_formulas.py` must be modified for the python 3.x .
As is mentioned in the dataset webset, we must use `open(formula_lst_dir,newline='\n)`, but when I modified like that ,an unicode error occured in the line 7489. What should we do? ps. my runing device is mac pro.
- open the formula file use vim, code like `vim formula.lst`
- type `:set fileencoding=utf-8` and saved the file `:wq`
- modified the function as `open(formula.lst,newline='\n',encoding='ISO-8859-1')`
- run success!
- Any questions? please create an issue.
The images in the dataset contain a LaTeX formula rendered on a full page. To accelerate training, we need to preprocess the images.
python scripts/preprocessing/preprocess_images.py --input-dir data/sample/images --output-dir data/sample/images_processed
The above command will crop the formula area, and group images of similar sizes to facilitate batching.
Next, the LaTeX formulas need to be tokenized or normalized.
python scripts/preprocessing/preprocess_formulas.py --mode normalize --input-file data/sample/formulas.lst --output-file data/sample/formulas.norm.lst
The above command will normalize the formulas. Note that this command will produce some error messages since some formulas cannot be parsed by the KaTeX parser.
Then we need to prepare train, validation and test files. We will exclude large images from training and validation set, and we also ignore formulas with too many tokens or formulas with grammar errors.
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/train.lst --output-path data/sample/train_filter.lst
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/validate.lst --output-path data/sample/validate_filter.lst
python scripts/preprocessing/preprocess_filter.py --no-filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/test.lst --output-path data/sample/test_filter.lst
Finally, we generate the vocabulary from training set. All tokens occuring less than (including) 1 time will be excluded from the vocabulary.
python scripts/preprocessing/generate_latex_vocab.py --data-path data/sample/train_filter.lst --label-path data/sample/formulas.norm.lst --output-file data/sample/latex_vocab.txt
### Train
For a complete set of parameters, run
th src/train.lua -h
The most important parameters for training are `data_base_dir`, which specifies where the images live; `data_path`, the training file; `label_path`, the LaTeX formulas, `val_data_path`, the validation file; `vocab_file`, the vocabulary file with one token per each line.
th src/train.lua -phase train -gpu_id 1 \
-model_dir model \
-input_feed -prealloc \
-data_base_dir data/sample/images_processed/ \
-data_path data/sample/train_filter.lst \
-val_data_path data/sample/validate_filter.lst \
-label_path data/sample/formulas.norm.lst \
-vocab_file data/sample/latex_vocab.txt \
-max_num_tokens 150 -max_image_width 500 -max_image_height 160 \
-batch_size 20 -beam_size 1
In the default setting, the log file will be put to `log.txt`. The log file records the training and validation perplexities. `model_dir` speicifies where the models should be saved. The default parameters are optimized for the full dataset. In order to overfit on this toy example, use flags `-learning_rate 0.05`, `-lr_decay 1.0` and `-num_epochs 30`, then after 30 epochs, the training perplexity can reach around 1.1 and the validation perplexity can only reach around 17.
### Test
After training, we can load a model and use it to test on test dataset. We provide a model trained on the [im2latex-100k-dataset](https://zenodo.org/record/56198#.V2p0KTXT6eA).
mkdir -p model/latex; wget -P model/latex/ http://lstm.seas.harvard.edu/latex/model/latex/final-model
Now we can load the model and test on test set. Note that in order to output the predictions, a flag `-visualize` must be set.
th src/train.lua -phase test -gpu_id 1 -load_model -model_di
