Multimodal (text +markup language) pre-training for Document AI
MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM achieves the SOTA results on multiple datasets. For more details, please refer to our paper:
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding Junlong Li, Yiheng Xu, Lei Cui, Furu Wei, Preprint
The overview of our framework is as follows:
And the core XPath Embedding Layer is as follows:
******* New Nov 22th, 2021: Initial release of pre-trained models and fine-tuning code for MarkupLM *******
We pre-train MarkupLM on a subset of the CommonCrawl dataset.
Name | HuggingFace |
---|---|
MarkupLM-Base | microsoft/markuplm-base |
MarkupM-Large | microsoft/markuplm-large |
An example might be model = markuplm.from_pretrained("microsoft/markuplm-base")
.
conda create -n markuplmft python=3.7
conda activate markuplmft
git clone https://github.com/microsoft/unilm.git
cd unilm
cd markuplm
pip install -r requirements.txt
pip install -e .
Download the dataset from the official website.
Extract release.zip to /Path/To/WebSRC.
Download dataset_split.json from this link and put it into /Path/To/WebSRC.
cd ./examples/fine_tuning/run_websrc
python dataset_generation.py --root_dir /Path/To/WebSRC --version websrc1.0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run.py \
--train_file /Path/To/WebSRC/websrc1.0_train_.json \
--predict_file /Path/To/WebSRC/websrc1.0_dev_.json \
--root_dir /Path/To/WebSRC \
--model_name_or_path microsoft/markuplm-large \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--eval_all_checkpoints \
--per_gpu_train_batch_size 8 \
--warmup_ratio 0.1 \
--num_train_epochs 5
Download the dataset from the official website.
Update: the above website is down, please use this backup.
Unzip swde.zip, and extract everything in /sourceCode, make sure we have folders like auto / book / camera ... under this directory, and we name this path as /Path/To/SWDE.
cd ./examples/fine_tuning/run_swde
python pack_data.py \
--input_swde_path /Path/To/SWDE \
--output_pack_path /Path/To/SWDE/swde.pickle
python prepare_data.py \
--input_groundtruth_path /Path/To/SWDE/groundtruth \
--input_pickle_path /Path/To/SWDE/swde.pickle \
--output_data_path /Path/To/Processed_SWDE
And the needed data is in /Path/To/Processed_SWDE.
Take seed=1, vertical=nbaplayer as example.
CUDA_VISIBLE_DEVICES=0,1 python run.py \
--root_dir /Path/To/Processed_SWDE \
--vertical nbaplayer \
--n_seed 1 \
--n_pages 2000 \
--prev_nodes_into_account 4 \
--model_name_or_path microsoft/markuplm-base \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--num_train_epochs 10 \
--learning_rate 2e-5 \
--save_steps 1000000 \
--warmup_ratio 0.1 \
--overwrite_output_dir \
Some of the baseline results are from Chen et al., 2021.
Model | EM | F1 | POS |
---|---|---|---|
H-PLM (RoBERTa-Large) | 69.57 | 74.13 | 85.93 |
H-PLM (ELECTRA-Large) | 70.12 | 74.14 | 86.33 |
V-PLM (ELECTRA-Large) | 73.22 | 76.16 | 87.06 |
MarkupLM-Large | 74.43 | 80.54 | 90.15 |
The metric is page-level F1.
Model \ #Seed Sites | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Render-Full (Hao et al., 2011) | 84.30 | 86.00 | 86.80 | 88.40 | 88.60 |
FreeDOM-Full (Lin et al., 2020) | 82.32 | 86.36 | 90.49 | 91.29 | 92.56 |
SimpDOM (Zhou et al., 2021) | 83.06 | 88.96 | 91.63 | 92.84 | 93.75 |
MarkupLM-Large | 85.71 | 93.57 | 96.12 | 96.71 | 97.37 |
If you find markupLM useful in your research, please cite the following paper:
@article{li2021markuplm, title={MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding}, author={Junlong Li and Yiheng Xu and Lei Cui and Furu Wei}, year={2021}, eprint={2110.08518}, archivePrefix={arXiv}, primaryClass={cs.CL} }
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project. Microsoft Open Source Code of Conduct
For help or issues using MarkupLM, please submit a GitHub issue.
For other communications related to MarkupLM, please contact Lei Cui ([email protected]
), Furu Wei ([email protected]
).