DocFormer: End-to-End Transformer for Document Understanding

2021-06-22ICCV 2021Code Available1· sign in to hype

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

Code Available — Be the first to reproduce this paper.

Code

github.com/shabie/docformer
pytorch★ 288

Abstract

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Tasks

Document Image Classification document understanding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RVL-CDIP	DocFormerBASE	Accuracy	96.17	—	Unverified
RVL-CDIP	DocFormer large	Accuracy	95.5	—	Unverified

DocFormer: End-to-End Transformer for Document Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions