| Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Oct 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling | Apr 24, 2018 | Image CaptioningReinforcement Learning | CodeCode Available | 0 | 5 |
| Discourse Parsing in Videos: A Multi-modal Appraoch | Mar 6, 2019 | Discourse ParsingVisual Dialog | CodeCode Available | 0 | 5 |
| GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation | May 28, 2018 | SentenceStory Generation | CodeCode Available | 0 | 5 |
| Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology | Oct 6, 2023 | Story GenerationVisual Storytelling | CodeCode Available | 0 | 5 |
| Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling | May 4, 2019 | AI AgentKnowledge Graphs | CodeCode Available | 0 | 5 |
| Knowledge-Enriched Visual Storytelling | Dec 3, 2019 | Knowledge GraphsStory Generation | CodeCode Available | 0 | 5 |
| Informative Visual Storytelling with Cross-modal Rules | Jul 7, 2019 | DecoderStory Generation | CodeCode Available | 0 | 5 |
| FLIP Reasoning Challenge | Apr 16, 2025 | Common Sense Reasoningimage-classification | CodeCode Available | 0 | 5 |
| AESOP: Abstract Encoding of Stories, Objects, and Pictures | Jan 1, 2021 | Story CompletionVisual Storytelling | CodeCode Available | 0 | 5 |
| Learning to Rank Visual Stories From Human Ranking Data | May 1, 2022 | Learning-To-RankText Generation | CodeCode Available | 0 | 5 |
| Visual Story Post-Editing | Jun 5, 2019 | Visual Storytelling | CodeCode Available | 0 | 5 |
| Dixit: Interactive Visual Storytelling via Term Manipulation | Mar 6, 2019 | DecoderVisual Storytelling | —Unverified | 0 | 0 |
| Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions | Jun 1, 2025 | Visual Storytelling | —Unverified | 0 | 0 |
| Diverse and Relevant Visual Storytelling with Scene Graph Embeddings | Nov 1, 2020 | DiversityStory Generation | —Unverified | 0 | 0 |
| Discourse Analysis for Evaluating Coherence in Video Paragraph Captions | Jan 17, 2022 | Video CaptioningVisual Dialog | —Unverified | 0 | 0 |
| Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks | Oct 26, 2022 | Image CaptioningLanguage Modeling | —Unverified | 0 | 0 |
| DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention | Oct 28, 2022 | Image CaptioningLanguage Modeling | —Unverified | 0 | 0 |
| DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models | Dec 12, 2023 | DenoisingDiversity | —Unverified | 0 | 0 |
| BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling | Dec 3, 2020 | SentenceVisual Storytelling | —Unverified | 0 | 0 |
| Induction and Reference of Entities in a Visual Story | Sep 15, 2019 | SentenceVisual Storytelling | —Unverified | 0 | 0 |
| Incorporating Textual Evidence in Visual Storytelling | Nov 21, 2019 | Object RecognitionStory Generation | —Unverified | 0 | 0 |
| Improving Visual Storytelling with Multimodal Large Language Models | Jul 2, 2024 | Visual Storytelling | —Unverified | 0 | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 | 0 |
| A System for Image Understanding using Sensemaking and Narrative | Jan 21, 2022 | Visual Storytelling | —Unverified | 0 | 0 |