A Survey on Bridging VLMs and Synthetic Data
Mohammad Ghiasvand Mohammadkhani, Saeedeh Momtazi, Hamid Beigy
Code Available — Be the first to reproduce this paper.
ReproduceCode
Abstract
Vision-language models (VLMs) have significantly advanced multimodal AI by learning joint representations of visual and textual data. However, their progress is hindered by challenges in acquiring high-quality, aligned datasets, including issues of cost, privacy, and scarcity. On the other hand, synthetic data, created through the use of generative AI—which can even include VLMs—offers a scalable and cost-effective solution to these challenges. This paper presents the first comprehensive survey on bridging VLMs and synthetic data, exploring both the role of synthetic data in VLMs and the role of VLMs in synthetic data. First, we provide a preliminary overview by briefly explaining the architecture of two basic VLMs and, after studying a large number of previous works, offer an extensive survey of the previously proposed methodologies and potential future directions in this area.