SOTAVerified

Video Summarization

Video Summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts. The produced summary is usually composed of a set of representative video frames (a.k.a. video key-frames), or video fragments (a.k.a. video key-fragments) that have been stitched in chronological order to form a shorter video. The former type of a video summary is known as video storyboard, and the latter type is known as video skim.

Source: Video Summarization Using Deep Neural Networks: A Survey Image credit: iJRASET

Papers

Showing 150 of 280 papers

TitleStatusHype
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingCode4
An Egocentric Vision-Language Model based Portable Real-time Smart AssistantCode2
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
Egocentric Video-Language PretrainingCode2
VideoSAGE: Video Summarization with Graph Representation LearningCode2
UniVTG: Towards Unified Video-Language Temporal GroundingCode2
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of VideoCode2
Combining Global and Local Attention with Positional Encoding for Video SummarizationCode1
Learning Discriminative Prototypes with Dynamic Time WarpingCode1
Video Repurposing from User Generated Content: A Large-scale Dataset and BenchmarkCode1
Multi-modal Summarization for Video-containing DocumentsCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative Score.Code1
Joint Moment Retrieval and Highlight Detection Via Natural Language QueriesCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
Progressive Video Summarization via Multimodal Self-supervised LearningCode1
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domainsCode1
Unsupervised Video Summarization via Multi-source FeaturesCode1
DSNet: A Flexible Detect-to-Summarize Network for Video SummarizationCode1
Supervised Video Summarization via Multiple Feature Sets with Parallel AttentionCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative ScoreCode1
Video Joint Modelling Based on Hierarchical Transformer for Co-summarizationCode1
VideoSum: A Python Library for Surgical Video SummarizationCode1
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video SummarizationCode1
LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNNCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Movie Summarization via Sparse Graph ConstructionCode1
Multimodal Summarization of User-Generated VideosCode1
Query-controllable Video SummarizationCode1
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of VideosCode1
Align and Attend: Multimodal Summarization with Dual Contrastive LossesCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Contrastive Losses Are Natural Criteria for Unsupervised Video SummarizationCode1
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video SummarizationCode1
Discriminative Latent Semantic Graph for Video CaptioningCode1
Do Language Models Understand Time?Code1
A Comprehensive Review of the Video-to-Text ProblemCode1
Convolutional Hierarchical Attention Network for Query-Focused Video SummarizationCode1
Self-Attention Recurrent Summarization Network with Reinforcement Learning for Video Summarization TaskCode1
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video FramesCode1
Ultrasound Video Summarization using Deep Reinforcement LearningCode1
VideoXum: Cross-modal Visual and Textural Summarization of VideosCode1
Multi-Stream Dynamic Video SummarizationCode0
Query-adaptive Video Summarization via Quality-aware Relevance EstimationCode0
Adaptive frame selection in two dimensional convolutional neural network action recognitionCode0
A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video SummarizationCode0
APES: Audiovisual Person Search in Untrimmed VideoCode0
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer VisionCode0
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal GroundingCode0
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PGL-SUMF1-score (Canonical)55.6Unverified
2RR-STGF1-score (Canonical)54.5Unverified
3DSNetF1-score (Canonical)53Unverified
4VASNetF1-score (Canonical)49.71Unverified
5M-AVSF1-score (Canonical)44.4Unverified
6CSTAKendall's Tau0.25Unverified
#ModelMetricClaimedVerifiedStatus
1RR-STGF1-score (Canonical)63Unverified
2DSNetF1-score (Canonical)62.1Unverified
3VASNetF1-score (Canonical)61.42Unverified
4M-AVSF1-score (Canonical)61Unverified
5PGL-SUMF1-score (Canonical)61Unverified
6CSTAKendall's Tau0.19Unverified
#ModelMetricClaimedVerifiedStatus
1Shotluck-Holmes (3.1B)CIDEr152.3Unverified
2Shotluck-Holmes (3.1B)CIDEr63.2Unverified
3SUM-shotCIDEr8.6Unverified
#ModelMetricClaimedVerifiedStatus
1EgoVLPv2F1 (avg)52.08Unverified
2EgoVLPF1 (avg)49.72Unverified
#ModelMetricClaimedVerifiedStatus
1PGL-SUMMAP (50%)61.6Unverified
#ModelMetricClaimedVerifiedStatus
1VTSUM-BLIP1 shot Micro-F123.5Unverified