Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering
Zhihao Yao, Jixuan Yin, Bo Li
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/yzh0905/pota-stcOfficialIn paperjax★ 1
Abstract
Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering (POTA), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, POTA first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a semantic consistency regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making POTA well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate POTA outperforms state-of-the-art methods. The code is available at: https://github.com/YZH0905/POTA-STC/tree/main.