LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/microsoft/unilmOfficialIn paperpytorch★ 22,060
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/PaddlePaddle/PaddleOCRpaddle★ 72,845
- github.com/facebookresearch/data2vec_visionpytorch★ 80
- github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/layoutlmv2paddle★ 0
- github.com/pwc-1/Paper-9/tree/main/layoutlmv2mindspore★ 0
- github.com/MindSpore-scientific/code-7/tree/main/LayoutLMv2none★ 0
- github.com/MindSpore-scientific-2/code-14/tree/main/layoutlmv2mindspore★ 0
- github.com/MS-P3/code3/tree/main/layoutlmv2mindspore★ 0
Abstract
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 0.8420), CORD (0.9493 0.9601), SROIE (0.9524 0.9781), Kleister-NDA (0.8340 0.8520), RVL-CDIP (0.9443 0.9564), and DocVQA (0.7295 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| RVL-CDIP | LayoutLMv2LARGE | Accuracy | 95.64 | — | Unverified |
| RVL-CDIP | LayoutLMv2BASE | Accuracy | 95.25 | — | Unverified |