SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization
Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, MingJie Sun, Wenjin Wu, Quan Chen, Peng Jiang
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper presents the Semantic-aWarE spatial-tEmporal Tokenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a Cross-attention Query AutoEncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only 25\% of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by 32.9\% w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by 57.1\% w.r.t rFVD on UCF-101 and 37.2\% w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.