Efficient layout-aware pretraining for multimodal form understanding
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Layout-aware language models have been used to create multimodal representations for documents that are in image form, achieving relatively high accuracy in document understanding tasks. However, the large number of parameters in the resulting models makes building and using them prohibitive without access to high-performing processing units with large memory capacity. We propose an alternative approach that can create efficient representations without the need for a neural visual backbone. This leads to an 80% reduction in the number of parameters compared to the smallest SOTA model, widely expanding applicability. Despite using 2.5% of training data, we show competitive performance on two form understanding tasks: semantic labeling and link prediction.