SOTAVerified

Video Summarization

Video Summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts. The produced summary is usually composed of a set of representative video frames (a.k.a. video key-frames), or video fragments (a.k.a. video key-fragments) that have been stitched in chronological order to form a shorter video. The former type of a video summary is known as video storyboard, and the latter type is known as video skim.

Source: Video Summarization Using Deep Neural Networks: A Survey Image credit: iJRASET

Papers

Showing 150 of 280 papers

TitleStatusHype
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingCode4
An Egocentric Vision-Language Model based Portable Real-time Smart AssistantCode2
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
VideoSAGE: Video Summarization with Graph Representation LearningCode2
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of VideoCode2
UniVTG: Towards Unified Video-Language Temporal GroundingCode2
Egocentric Video-Language PretrainingCode2
Do Language Models Understand Time?Code1
Video Repurposing from User Generated Content: A Large-scale Dataset and BenchmarkCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative Score.Code1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative ScoreCode1
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneCode1
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of VideosCode1
Joint Moment Retrieval and Highlight Detection Via Natural Language QueriesCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
VideoXum: Cross-modal Visual and Textural Summarization of VideosCode1
Align and Attend: Multimodal Summarization with Dual Contrastive LossesCode1
VideoSum: A Python Library for Surgical Video SummarizationCode1
Contrastive Losses Are Natural Criteria for Unsupervised Video SummarizationCode1
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video FramesCode1
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video SummarizationCode1
LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNNCode1
Progressive Video Summarization via Multimodal Self-supervised LearningCode1
Video Joint Modelling Based on Hierarchical Transformer for Co-summarizationCode1
Combining Global and Local Attention with Positional Encoding for Video SummarizationCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Discriminative Latent Semantic Graph for Video CaptioningCode1
Self-Attention Recurrent Summarization Network with Reinforcement Learning for Video Summarization TaskCode1
Multimodal Summarization of User-Generated VideosCode1
Unsupervised Video Summarization via Multi-source FeaturesCode1
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domainsCode1
Supervised Video Summarization via Multiple Feature Sets with Parallel AttentionCode1
A Comprehensive Review of the Video-to-Text ProblemCode1
Learning Discriminative Prototypes with Dynamic Time WarpingCode1
Movie Summarization via Sparse Graph ConstructionCode1
DSNet: A Flexible Detect-to-Summarize Network for Video SummarizationCode1
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video SummarizationCode1
Multi-modal Summarization for Video-containing DocumentsCode1
Ultrasound Video Summarization using Deep Reinforcement LearningCode1
Query-controllable Video SummarizationCode1
Convolutional Hierarchical Attention Network for Query-Focused Video SummarizationCode1
TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness0
MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment0
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization0
Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization0
TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations0
Unsupervised Transcript-assisted Video Summarization and Highlight Detection0
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing0
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PGL-SUMF1-score (Canonical)55.6Unverified
2RR-STGF1-score (Canonical)54.5Unverified
3DSNetF1-score (Canonical)53Unverified
4VASNetF1-score (Canonical)49.71Unverified
5M-AVSF1-score (Canonical)44.4Unverified
6CSTAKendall's Tau0.25Unverified
#ModelMetricClaimedVerifiedStatus
1RR-STGF1-score (Canonical)63Unverified
2DSNetF1-score (Canonical)62.1Unverified
3VASNetF1-score (Canonical)61.42Unverified
4M-AVSF1-score (Canonical)61Unverified
5PGL-SUMF1-score (Canonical)61Unverified
6CSTAKendall's Tau0.19Unverified
#ModelMetricClaimedVerifiedStatus
1Shotluck-Holmes (3.1B)CIDEr152.3Unverified
2Shotluck-Holmes (3.1B)CIDEr63.2Unverified
3SUM-shotCIDEr8.6Unverified
#ModelMetricClaimedVerifiedStatus
1EgoVLPv2F1 (avg)52.08Unverified
2EgoVLPF1 (avg)49.72Unverified
#ModelMetricClaimedVerifiedStatus
1PGL-SUMMAP (50%)61.6Unverified
#ModelMetricClaimedVerifiedStatus
1VTSUM-BLIP1 shot Micro-F123.5Unverified