SOTAVerified

Video Description

The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.

Source: Joint Event Detection and Description in Continuous Video Streams

Papers

Showing 51100 of 104 papers

TitleStatusHype
Efficient data-driven encoding of scene motion using Eccentricity0
The Role of the Input in Natural Language Video Description0
Unbox the Blackbox: Predict and Interpret YouTube Viewership Using Deep Learning0
MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish0
A Comprehensive Review on Recent Methods and Challenges of Video Description0
Describing Unseen Videos via Multi-Modal Cooperative Dialog AgentsCode0
Active Learning for Video Description With Cluster-Regularized Ensemble Ranking0
Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering0
VizSeq: A Visual Analysis Toolkit for Text Generation TasksCode0
Prediction and Description of Near-Future Activities in Video0
End-to-End Video Captioning0
Adversarial Inference for Multi-Sentence Video DescriptionCode0
A Dataset for Telling the Stories of Social Media Videos0
Incorporating Background Knowledge into Video Description Generation0
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions0
Bridge Video and Text with Cascade Syntactic Structure0
Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data0
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video FeaturesCode0
Interpretable Video Captioning via Trajectory Structured Localization0
Video Description: A Survey of Methods, Datasets and Evaluation Metrics0
Incorporating Semantic Attention in Video Description Generation0
Integrating both Visual and Audio Cues for Enhanced Video Caption0
Attend and Interact: Higher-Order Object Interactions for Video Understanding0
Predicting Visual Features from Text for Image and Video Caption RetrievalCode0
Incorporating Global Visual Features into Attention-based Neural Machine Translation.0
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description0
Egocentric Video Description based on Temporally-Linked SequencesCode0
Attention-Based Multimodal Fusion for Video Description0
Generating Video Description using Sequence-to-sequence Model with Temporal Attention0
Hierarchical Boundary-Aware Neural Encoder for Video Captioning0
Memory-augmented Attention Modelling for VideosCode0
Neural Headline Generation on Abstract Meaning Representation0
SHEF-Multimodal: Grounding Machine Translation on Images0
Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis0
VideoMCC: a New Benchmark for Video Comprehension0
Bidirectional Long-Short Term Memory for Video Description0
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language0
A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography DetectionCode0
Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model0
Video Description using Bidirectional Recurrent Neural NetworksCode0
TGIF: A New Dataset and Benchmark on Animated GIF DescriptionCode0
Improving LSTM-based Video Description with Linguistic Knowledge Mined from TextCode0
Vectors of Locally Aggregated Centers for Compact Video Representation0
A Multi-scale Multiple Instance Video Description Network0
Describing Videos by Exploiting Temporal StructureCode0
Probabilistic Soft Logic for Semantic Textual Similarity0
Coherent Multi-Sentence Video Description with Variable Level of Detail0
Semantic Neighborhoods as Hypergraphs0
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching0
Better Exploiting Motion for Better Action Recognition0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.