Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

2021-02-18Code Available1· sign in to hype

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka

Code Available — Be the first to reproduce this paper.

Code

github.com/uakarsh/TiLT-Implementation
pytorch★ 18

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Tasks

Decoder Document Image Classification document understanding Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RVL-CDIP	TILT-Large	Accuracy	95.52	—	Unverified
RVL-CDIP	TILT-Base	Accuracy	95.25	—	Unverified

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions