UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Jiabo Ye*, Anwen Hu*, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, Fei Huang

*Equal Contribution

Instruction-tuning dataset

Download the jsonl files and images from Mizukiluke/ureader-instruction-1.0.

The jsonl files can be placed in ureader_json/. And the images can be orgnized in such format:

ureader_images
├── ChartQA
├── DUE_Benchmark
│   ├── DeepForm
│   ├── DocVQA
│   ├── InfographicsVQA
│   ├── KleisterCharity
│   ├── TabFact
│   └── WikiTableQuestions
├── TextCaps
├── TextVQA
└── VisualMRC

Checkpoint

The checkponit is available on Huggingface model hub.

Training, Inference and Evaluation

Environment

Follow mPLUG-Owl to prepare your environment.

We validate the codes with:

PyTorch 1.13.1
CUDA 11.7
transformers 4.29.1.

Training

Prepare the checkpoint of mPLUG-Owl from https://huggingface.co/MAGAer13/mplug-owl-llama-7b. Put the download checkpoint in checkpoints/mplug-owl-llama-7b.

For A100 80G

bash scripts/train_it.sh

For V100 32G

bash scripts/train_it_v100.sh

If you are suffering from the NaN issues, pull the latest version of our repository may help. We have uncommented the loss_mask to prevent Nan loss caused by overlong text inputs.

Inference

We provide interface to build model and processer in pipeline/interface.py. You can refer to pipeline/evaluation.py for more specific usage.

A offline demo can be start by python -m app

Evaluation

Install java for pycocoevalcap.

sudo apt update
sudo apt install default-jdk

Download and unzip benchmark_files.zip at benchmark_files.
Download and unzip ureader_json.zip at ureader_json.
Pull checkpoint from https://huggingface.co/Mizukiluke/ureader-v1/tree/main(If you are suffering from connection issue with huggingface, we provide a zip download link) to checkpoints/ureader or use --eval_checkpoint to specify the weight which should be evaluated.

The evaluation consists of two stage.

In the first stage, we export the model output by runningNPROC_PER_NODE=1 bash scripts/eval/eval_benchmark.sh. You can also set distributed environment variables to enable distributed inference.

In the second stage, we evaluate the model output by running python -m pipeline.eval_utils.run_evaluation.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@misc{ye2023ureader,
      title={UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model}, 
      author={Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Guohai Xu and Chenliang Li and Junfeng Tian and Qi Qian and Ji Zhang and Qin Jin and Liang He and Xin Alex Lin and Fei Huang},
      year={2023},
      eprint={2310.05126},
      archivePrefix={arXiv},
      primaryClass={cs.CV}

@misc{ye2023mplugdocowl,
      title={mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding}, 
      author={Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Yuhao Dan and Chenlin Zhao and Guohai Xu and Chenliang Li and Junfeng Tian and Qian Qi and Ji Zhang and Fei Huang},
      year={2023},
      eprint={2307.02499},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs/sft		configs/sft
examples		examples
mplug_owl		mplug_owl
pipeline		pipeline
scripts		scripts
serve		serve
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
ds_config.json		ds_config.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Instruction-tuning dataset

Checkpoint

Training, Inference and Evaluation

Environment

Training

Inference

Evaluation

Citation

About

Releases

Packages

Languages

License

langxiaoaini/UReader

Folders and files

Latest commit

History

Repository files navigation

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Instruction-tuning dataset

Checkpoint

Training, Inference and Evaluation

Environment

Training

Inference

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages