SOTAVerified

Video Summarization

Video Summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts. The produced summary is usually composed of a set of representative video frames (a.k.a. video key-frames), or video fragments (a.k.a. video key-fragments) that have been stitched in chronological order to form a shorter video. The former type of a video summary is known as video storyboard, and the latter type is known as video skim.

Source: Video Summarization Using Deep Neural Networks: A Survey Image credit: iJRASET

Papers

Showing 150 of 280 papers

TitleStatusHype
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingCode4
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
VideoSAGE: Video Summarization with Graph Representation LearningCode2
Egocentric Video-Language PretrainingCode2
An Egocentric Vision-Language Model based Portable Real-time Smart AssistantCode2
UniVTG: Towards Unified Video-Language Temporal GroundingCode2
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of VideoCode2
Combining Global and Local Attention with Positional Encoding for Video SummarizationCode1
Unsupervised Video Summarization via Multi-source FeaturesCode1
Self-Attention Recurrent Summarization Network with Reinforcement Learning for Video Summarization TaskCode1
Align and Attend: Multimodal Summarization with Dual Contrastive LossesCode1
Progressive Video Summarization via Multimodal Self-supervised LearningCode1
Video Joint Modelling Based on Hierarchical Transformer for Co-summarizationCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
Supervised Video Summarization via Multiple Feature Sets with Parallel AttentionCode1
Query-controllable Video SummarizationCode1
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domainsCode1
LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNNCode1
Multi-modal Summarization for Video-containing DocumentsCode1
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of VideosCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video FramesCode1
Joint Moment Retrieval and Highlight Detection Via Natural Language QueriesCode1
Video Repurposing from User Generated Content: A Large-scale Dataset and BenchmarkCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative Score.Code1
VideoSum: A Python Library for Surgical Video SummarizationCode1
VideoXum: Cross-modal Visual and Textural Summarization of VideosCode1
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video SummarizationCode1
DSNet: A Flexible Detect-to-Summarize Network for Video SummarizationCode1
Discriminative Latent Semantic Graph for Video CaptioningCode1
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video SummarizationCode1
Do Language Models Understand Time?Code1
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneCode1
Movie Summarization via Sparse Graph ConstructionCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Learning Discriminative Prototypes with Dynamic Time WarpingCode1
A Comprehensive Review of the Video-to-Text ProblemCode1
Ultrasound Video Summarization using Deep Reinforcement LearningCode1
Contrastive Losses Are Natural Criteria for Unsupervised Video SummarizationCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative ScoreCode1
Multimodal Summarization of User-Generated VideosCode1
Convolutional Hierarchical Attention Network for Query-Focused Video SummarizationCode1
Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers0
A Survey on Recent Advances of Computer Vision Algorithms for Egocentric Video0
A Multi-stage deep architecture for summary generation of soccer videos0
A Survey on Patch-based Synthesis: GPU Implementation and Optimization0
A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos0
A Framework towards Domain Specific Video Summarization0
Demystifying Multi-Faceted Video Summarization: Tradeoff Between Diversity,Representation, Coverage and Importance0
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PGL-SUMF1-score (Canonical)55.6Unverified
2RR-STGF1-score (Canonical)54.5Unverified
3DSNetF1-score (Canonical)53Unverified
4VASNetF1-score (Canonical)49.71Unverified
5M-AVSF1-score (Canonical)44.4Unverified
6CSTAKendall's Tau0.25Unverified
#ModelMetricClaimedVerifiedStatus
1RR-STGF1-score (Canonical)63Unverified
2DSNetF1-score (Canonical)62.1Unverified
3VASNetF1-score (Canonical)61.42Unverified
4PGL-SUMF1-score (Canonical)61Unverified
5M-AVSF1-score (Canonical)61Unverified
6CSTAKendall's Tau0.19Unverified
#ModelMetricClaimedVerifiedStatus
1Shotluck-Holmes (3.1B)CIDEr152.3Unverified
2Shotluck-Holmes (3.1B)CIDEr63.2Unverified
3SUM-shotCIDEr8.6Unverified
#ModelMetricClaimedVerifiedStatus
1EgoVLPv2F1 (avg)52.08Unverified
2EgoVLPF1 (avg)49.72Unverified
#ModelMetricClaimedVerifiedStatus
1PGL-SUMMAP (50%)61.6Unverified
#ModelMetricClaimedVerifiedStatus
1VTSUM-BLIP1 shot Micro-F123.5Unverified