LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

2020-12-29ACL 2021Code Available0· sign in to hype

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/MS-P3/code3/tree/main/layoutlmv2
mindspore★ 0
github.com/MindSpore-scientific/code-7/tree/main/LayoutLMv2
none★ 0
github.com/MindSpore-scientific-2/code-14/tree/main/layoutlmv2
mindspore★ 0
github.com/pwc-1/Paper-9/tree/main/layoutlmv2
mindspore★ 0
github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/layoutlmv2
paddle★ 0

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 0.8420), CORD (0.9493 0.9601), SROIE (0.9524 0.9781), Kleister-NDA (0.8340 0.8520), RVL-CDIP (0.9443 0.9564), and DocVQA (0.7295 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.

Tasks

Document Image Classification Document Layout Analysis document understanding Key Information Extraction Key-value Pair Extraction Language Modeling Language Modelling Relation Extraction Semantic entity labeling Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RVL-CDIP	LayoutLMv2LARGE	Accuracy	95.64	—	Unverified
RVL-CDIP	LayoutLMv2BASE	Accuracy	95.25	—	Unverified

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions