SOTAVerified

Caption Generation

Papers

Showing 2650 of 310 papers

TitleStatusHype
NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph EnrichmentCode1
VLIS: Unimodal Language Models Guide Multimodal Language GenerationCode1
Self-supervised Cross-view Representation Reconstruction for Change CaptioningCode1
RECAP: Retrieval-Augmented Audio CaptioningCode1
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query ResponseCode1
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense CaptioningCode1
Transferable Decoding with Visual Entities for Zero-Shot Image CaptioningCode1
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance SegmentationCode1
Visual Commonsense-aware Representation Network for Video CaptioningCode1
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive PruningCode1
Belief Revision based Caption Re-ranker with Visual Semantic InformationCode1
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using PatchesCode1
GL-RG: Global-Local Representation Granularity for Video CaptioningCode1
Spatiality-guided Transformer for 3D Dense Captioning on Point CloudsCode1
Injecting Semantic Concepts into End-to-End Image CaptioningCode1
Controllable Video Captioning with an Exemplar SentenceCode1
SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningCode1
Topic Scene Graph Generation by Attention Distillation from CaptionCode1
COSMic: A Coherence-Aware Generation Metric for Image DescriptionsCode1
End-to-End Dense Video Captioning with Parallel DecodingCode1
Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object LocalizationCode1
Connecting What to Say With Where to Look by Modeling Human Attention TracesCode1
Towards Accurate Text-based Image Captioning with Content Diversity ExplorationCode1
Human-like Controllable Image Captioning with Verb-specific Semantic RolesCode1
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsCode1
Show:102550
← PrevPage 2 of 13Next →

No leaderboard results yet.