DocVQA: A Dataset for VQA on Document Images

2020-07-01Code Available1· sign in to hype

Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

Code Available — Be the first to reproduce this paper.

Code

github.com/anisha2102/docvqa
pytorch★ 130
github.com/mineshmathew/DocVQA
none★ 21
github.com/mineshmathew/DocVQA/tree/master/BERT_baseline
none★ 0

Abstract

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org

Tasks

Question Answering Reading Comprehension Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
DocVQA test	Human	ANLS	0.94	—	Unverified
DocVQA test	BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline	ANLS	0.67	—	Unverified
DocVQA val	BERT LARGE Baseline	Accuracy	54.48	—	Unverified
DocVQA val	đm bk	bk lôn	0.66	—	Unverified

DocVQA: A Dataset for VQA on Document Images

Code

Abstract

Tasks

Benchmark Results

Reproductions