Self-supervised Character-to-Character Distillation for Text Recognition

2022-11-01ICCV 2023Code Available1· sign in to hype

Tongkun Guan, Wei Shen, Xue Yang, Qi Feng, Zekun Jiang, Xiaokang Yang

Code Available — Be the first to reproduce this paper.

Code

github.com/tongkunguan/ccd
OfficialIn paperpytorch★ 153

Abstract

When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code is available at https://github.com/TongkunGuan/CCD.

Tasks

Data Augmentation Representation Learning Scene Text Recognition Self-Learning Self-Supervised Learning self-supervised scene text recognition SSIM Super-Resolution Text Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CUTE80	CCD-ViT-Small(ARD_2.8M)	Accuracy	98.3	—	Unverified
CUTE80	CCD-ViT-Tiny(ARD_2.8M)	Accuracy	95.8	—	Unverified
CUTE80	CCD-ViT-Base(ARD_2.8M)	Accuracy	98.3	—	Unverified
HOST	CCD-ViT-Base	1:1 Accuracy	77.3	—	Unverified
ICDAR2013	CCD-ViT-Tiny(ARD_2.8M)	Accuracy	97.5	—	Unverified
ICDAR2013	CCD-ViT-Base(ARD_2.8M)	Accuracy	98.3	—	Unverified
ICDAR2013	CCD-ViT-Small(ARD_2.8M)	Accuracy	98.3	—	Unverified
IIIT5k	CCD-ViT-Tiny(ARD_2.8M)	Accuracy	97.1	—	Unverified
IIIT5k	CCD-ViT-Small(ARD_2.8M)	Accuracy	98	—	Unverified
IIIT5k	CCD-ViT-Base(ARD_2.8M)	Accuracy	98	—	Unverified
SVT	CCD-ViT-Base(ARD_2.8M)	Accuracy	97.8	—	Unverified
SVT	CCD-ViT-Tiny(ARD_2.8M)	Accuracy	96	—	Unverified
SVT	CCD-ViT-Small(ARD_2.8M)	Accuracy	96.4	—	Unverified
SVTP	CCD-ViT-Base	Accuracy	96.1	—	Unverified
SVTP	CCD-ViT-Small	Accuracy	92.7	—	Unverified
SVTP	CCD-ViT-Tiny	Accuracy	91.6	—	Unverified
WOST	CCD-ViT-Base	1:1 Accuracy	86	—	Unverified

Self-supervised Character-to-Character Distillation for Text Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions