SOTAVerified

multimodal generation

Multimodal generation refers to the process of generating outputs that incorporate multiple modalities, such as images, text, and sound. This can be done using deep learning models that are trained on data that includes multiple modalities, allowing the models to generate output that is informed by more than one type of data.

For example, a multimodal generation model could be trained to generate captions for images that incorporate both text and visual information. The model could learn to identify objects in the image and generate descriptions of them in natural language, while also taking into account contextual information and the relationships between the objects in the image.

Multimodal generation can also be used in other applications, such as generating realistic images from textual descriptions or generating audio descriptions of video content. By combining multiple modalities in this way, multimodal generation models can produce more accurate and comprehensive output, making them useful for a wide range of applications.

Title	Date	Tasks	Status	Hype
RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction	Dec 24, 2024	Image Generationmultimodal generation	—Unverified	0
D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance	Dec 23, 2024	multimodal generation	CodeCode Available	0
LMFusion: Adapting Pretrained Language Models for Multimodal Generation	Dec 19, 2024	Image Generationmultimodal generation	—Unverified	0
Multimodal Latent Language Modeling with Next-Token Diffusion	Dec 11, 2024	Image GenerationLanguage Modeling	CodeCode Available	0
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	Nov 27, 2024	Image Generationmultimodal generation	CodeCode Available	1
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Nov 26, 2024	Decodermultimodal generation	—Unverified	0
Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines	Nov 25, 2024	multimodal generationRAG	CodeCode Available	1
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains	Nov 22, 2024	BenchmarkingCaption Generation	—Unverified	0
A Survey on Vision Autoregressive Model	Nov 13, 2024	3D GenerationBenchmarking	—Unverified	0
A Survey of Emerging Approaches and Advances in Video Generation	Nov 9, 2024	Image to Video GenerationLanguage Modeling	—Unverified	0

Title

Status

Hype

RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

—Unverified

D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

CodeCode Available

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

—Unverified

Multimodal Latent Language Modeling with Next-Token Diffusion

CodeCode Available

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation