SOTAVerified

Caption Generation

Papers

Showing 5175 of 310 papers

TitleStatusHype
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer NetworkCode1
TAP: Text-Aware Pre-training for Text-VQA and Text-CaptionCode1
Improving Image Captioning with Better Use of CaptionsCode1
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene GraphsCode1
Deep Reinforcement Learning For Sequence to Sequence ModelsCode1
Grad-CAM++: Improved Visual Explanations for Deep Convolutional NetworksCode1
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption GenerationCode1
Video captioning with recurrent networks based on frame- and video-level features and visual content classificationCode1
Microsoft COCO Captions: Data Collection and Evaluation ServerCode1
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionCode1
GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning0
EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits0
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation0
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID0
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance0
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks0
Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training0
3D CoCa: Contrastive Learners are 3D CaptionersCode0
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention0
Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering0
LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images0
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification0
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models0
Show:102550
← PrevPage 3 of 13Next →

No leaderboard results yet.