StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

2023-03-01Code Available0· sign in to hype

Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/PaddlePaddle/VIMER/tree/main/StrucTexT/v2
Officialpaddle★ 0

Abstract

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

Tasks

Document Image Classification image-classification Image Classification Language Modeling Language Modelling Masked Language Modeling Optical Character Recognition (OCR)Semantic entity labeling

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RVL-CDIP	StrucTexTv2 (large)	Accuracy	94.62	—	Unverified
RVL-CDIP	StrucTexTv2 (small)	Accuracy	93.4	—	Unverified

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Code

Abstract

Tasks

Benchmark Results

Reproductions