MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models Oct 13, 2024 Cross-Modal Retrieval Question Answering
— Unverified 0Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization Oct 9, 2024 Audio captioning Large Language Model
— Unverified 0SoccerNet 2024 Challenges Results Sep 16, 2024 Action Spotting Dense Video Captioning
Code Code Available 0Fine-grained length controllable video captioning with ordinal embeddings Aug 27, 2024 Video Captioning
— Unverified 0LongVILA: Scaling Long-Context Visual Language Models for Long Videos Aug 19, 2024 Video Captioning Video Question Answering
— Unverified 0Dual-path Collaborative Generation Network for Emotional Video Captioning Aug 6, 2024 Caption Generation Video Captioning
Code Code Available 0Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos Jul 30, 2024 Semantic Role Labeling Video Captioning
Code Code Available 0Wolf: Captioning Everything with a World Summarization Framework Jul 26, 2024 Autonomous Driving Mixture-of-Experts
— Unverified 0Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance Jul 19, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
— Unverified 0EVLM: An Efficient Vision-Language Model for Visual Understanding Jul 19, 2024 Image Captioning Language Modeling
— Unverified 0https://arxiv.org/abs/2407.00634 Jul 2, 2024 Video Captioning Video Description
Code Code Available 0Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks Jun 24, 2024 Question Answering Text Generation
— Unverified 0Live Video Captioning Jun 20, 2024 Dense Video Captioning Live Video Captioning
Code Code Available 0GUI Action Narrator: Where and When Did That Action Take Place? Jun 19, 2024 Optical Character Recognition (OCR) Video Captioning
— Unverified 0Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset Jun 19, 2024 Language Modeling Language Modelling
— Unverified 0A Survey of Video Datasets for Grounded Event Understanding Jun 14, 2024 Common Sense Reasoning Event Extraction
Code Code Available 0NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative Jun 10, 2024 Language Modelling Large Language Model
— Unverified 0Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges Jun 4, 2024 Question Answering Story Generation
— Unverified 0RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning May 11, 2024 Image-text matching Retrieval
— Unverified 0A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection) May 2, 2024 Acoustic Scene Classification Event Detection
— Unverified 0The 8th AI City Challenge Apr 15, 2024 Dense Video Captioning Video Captioning
— Unverified 0DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement Apr 3, 2024 Dense Video Captioning Diversity
— Unverified 0Streaming Dense Video Captioning Apr 1, 2024 Dense Video Captioning Live Video Captioning
— Unverified 0Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding Mar 24, 2024 Dense Video Captioning Temporal Localization
— Unverified 0Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation Mar 8, 2024 Articles Hallucination
— Unverified 0