SOTAVerified

OTE: Exploring Accurate Scene Text Recognition Using One Token

2024-01-01CVPR 2024Code Available0· sign in to hype

Jianjun Xu, Yuxin Wang, Hongtao Xie, Yongdong Zhang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

In this paper we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition. Based on this insight we introduce a new paradigm for STR called One Token rEcognizer (OTE). Specifically we implement an image-to-vector encoder to extract the fine-grained global semantics eliminating the need for sequential features. Furthermore an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations enabling both autoregressive and non-autoregressive decoding schemes. By executing decoding within a high-level representational space our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably due to introducing character-wise fine-grained information such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Our code is available at https://github.com/Xu-Jianjun/OTE.

Tasks

Reproductions