SOTAVerified

Caption Generation

Papers

Showing 76100 of 310 papers

TitleStatusHype
Fine-Grained Video Captioning through Scene Graph Consolidation0
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models0
Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning0
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning0
Expertized Caption Auto-Enhancement for Video-Text RetrievalCode0
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 20230
MAMS: Model-Agnostic Module Selection Framework for Video Captioning0
Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing0
Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing0
Multi-LLM Collaborative Caption Generation in Scientific DocumentsCode0
Time Series Language Model for Descriptive Caption Generation0
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning0
Multimodal Preference Data Synthetic Alignment with Reward ModelCode0
Learning from Massive Human Videos for Universal Humanoid Pose Control0
From Simple to Professional: A Combinatorial Controllable Image Captioning AgentCode0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains0
Everything is a Video: Unifying Modalities through Next-Frame Prediction0
Grounded Video Caption Generation0
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsCode0
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning0
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer0
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving0
Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal FusionCode0
See It All: Contextualized Late Aggregation for 3D Dense Captioning0
Show:102550
← PrevPage 4 of 13Next →

No leaderboard results yet.