SOTAVerified

Caption Generation

Papers

Showing 150 of 310 papers

TitleStatusHype
GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning0
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real WorldCode2
SonicVerse: Multi-Task Learning for Music Feature-Informed CaptioningCode2
EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits0
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation0
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionCode2
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality EvaluationCode1
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID0
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance0
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks0
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives0
VideoMultiAgents: A Multi-Agent Framework for Video Question AnsweringCode1
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training0
3D CoCa: Contrastive Learners are 3D CaptionersCode0
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention0
Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering0
LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images0
Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic CognitionCode1
Large-scale Pre-training for Grounded Video Caption GenerationCode1
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification0
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models0
Fine-Grained Video Captioning through Scene Graph Consolidation0
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models0
Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning0
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning0
Expertized Caption Auto-Enhancement for Video-Text RetrievalCode0
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 20230
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language ModelsCode4
MAMS: Model-Agnostic Module Selection Framework for Video Captioning0
Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing0
Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing0
Multi-LLM Collaborative Caption Generation in Scientific DocumentsCode0
Time Series Language Model for Descriptive Caption Generation0
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning0
Multimodal Preference Data Synthetic Alignment with Reward ModelCode0
Learning from Massive Human Videos for Universal Humanoid Pose Control0
From Simple to Professional: A Combinatorial Controllable Image Captioning AgentCode0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language ModelsCode2
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains0
Everything is a Video: Unifying Modalities through Next-Frame Prediction0
Grounded Video Caption Generation0
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
Croc: Pretraining Large Multimodal Models with Cross-Modal ComprehensionCode1
MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based AnnotationsCode1
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsCode0
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning0
Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and TrainingCode2
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.