| D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding | Dec 2, 2021 | 3D dense captioning3D visual grounding | —Unverified | 0 |
| BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving | Jan 2, 2024 | Autonomous DrivingCaption Generation | —Unverified | 0 |
| EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer | Sep 17, 2024 | Audio GenerationCaption Generation | —Unverified | 0 |
| FaceGemma: Enhancing Image Captioning with Facial Attributes for Portrait Images | Sep 24, 2023 | AttributeCaption Generation | —Unverified | 0 |
| Cross-modal Coherence Modeling for Caption Generation | Jul 1, 2020 | Caption Generationcontrollable image captioning | —Unverified | 0 |
| Cross-Lingual Image Caption Generation | Aug 1, 2016 | Caption GenerationDependency Parsing | —Unverified | 0 |
| CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving | Aug 19, 2024 | Autonomous DrivingCaption Generation | —Unverified | 0 |
| Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains | Nov 22, 2024 | BenchmarkingCaption Generation | —Unverified | 0 |
| A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism | Mar 3, 2022 | Caption GenerationDecoder | —Unverified | 0 |
| Analysis of Convolutional Decoder for Image Caption Generation | Mar 8, 2021 | Caption GenerationData Augmentation | —Unverified | 0 |