SOTAVerified

Video Description

The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.

Source: Joint Event Detection and Description in Continuous Video Streams

Papers

Showing 51100 of 104 papers

TitleStatusHype
Attention-Based Multimodal Fusion for Video Description0
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions0
AVD2: Accident Video Diffusion for Accident Video Description0
Relational Graph Learning for Grounded Video Description Generation0
Saarland: Vector-based models of semantic textual similarity0
Semantic Neighborhoods as Hypergraphs0
SHEF-Multimodal: Grounding Machine Translation on Images0
SRIUBC: Simple Similarity Features for Semantic Textual Similarity0
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation0
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description0
Technical Report: Competition Solution For Modelscope-Sora0
The Role of the Input in Natural Language Video Description0
Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time0
Unbox the Blackbox: Predict and Interpret YouTube Viewership Using Deep Learning0
Vectors of Locally Aggregated Centers for Compact Video Representation0
VideoA11y: Method and Dataset for Accessible Video Description0
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models0
Video Description: A Survey of Methods, Datasets and Evaluation Metrics0
VideoMCC: a New Benchmark for Video Comprehension0
Visual-aware Attention Dual-stream Decoder for Video Captioning0
A Comprehensive Review on Recent Methods and Challenges of Video Description0
JU\_CSE\_NLP: Multi-grade Classification of Semantic Similarity between Text Pairs0
Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation0
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language0
MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish0
Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering0
Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data0
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)0
Multi Sentence Description of Complex Manipulation Action Videos0
NarrationBot and InfoBot: A Hybrid System for Automated Video Description0
Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis0
Neural Headline Generation on Abstract Meaning Representation0
Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model0
Probabilistic Soft Logic for Semantic Textual Similarity0
PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation0
JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama modelsCode0
Predicting Visual Features from Text for Image and Video Caption RetrievalCode0
Describing Videos by Exploiting Temporal StructureCode0
Learn to Understand Negation in Video RetrievalCode0
Describing Unseen Videos via Multi-Modal Cooperative Dialog AgentsCode0
Memory-augmented Attention Modelling for VideosCode0
TGIF: A New Dataset and Benchmark on Animated GIF DescriptionCode0
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in IndonesianCode0
Adversarial Inference for Multi-Sentence Video DescriptionCode0
Egocentric Video Description based on Temporally-Linked SequencesCode0
Video Description using Bidirectional Recurrent Neural NetworksCode0
Edit As You Wish: Video Caption Editing with Multi-grained User ControlCode0
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video CaptioningCode0
SUSTechGAN: Image Generation for Object Detection in Adverse Conditions of Autonomous DrivingCode0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.