SOTAVerified

Video Summarization

Video Summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts. The produced summary is usually composed of a set of representative video frames (a.k.a. video key-frames), or video fragments (a.k.a. video key-fragments) that have been stitched in chronological order to form a shorter video. The former type of a video summary is known as video storyboard, and the latter type is known as video skim.

Source: Video Summarization Using Deep Neural Networks: A Survey Image credit: iJRASET

Papers

Showing 125 of 280 papers

TitleStatusHype
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingCode4
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of VideoCode2
An Egocentric Vision-Language Model based Portable Real-time Smart AssistantCode2
VideoSAGE: Video Summarization with Graph Representation LearningCode2
Egocentric Video-Language PretrainingCode2
UniVTG: Towards Unified Video-Language Temporal GroundingCode2
Multi-modal Summarization for Video-containing DocumentsCode1
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video SummarizationCode1
Multimodal Summarization of User-Generated VideosCode1
LTC-SUM: Lightweight Client-driven Personalized Video Summarization Framework Using 2D CNNCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative Score.Code1
Discriminative Latent Semantic Graph for Video CaptioningCode1
Movie Summarization via Sparse Graph ConstructionCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Combining Global and Local Attention with Positional Encoding for Video SummarizationCode1
Joint Moment Retrieval and Highlight Detection Via Natural Language QueriesCode1
Do Language Models Understand Time?Code1
A Comprehensive Review of the Video-to-Text ProblemCode1
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative ScoreCode1
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
Align and Attend: Multimodal Summarization with Dual Contrastive LossesCode1
Convolutional Hierarchical Attention Network for Query-Focused Video SummarizationCode1
DSNet: A Flexible Detect-to-Summarize Network for Video SummarizationCode1
Show:102550
← PrevPage 1 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PGL-SUMF1-score (Canonical)55.6Unverified
2RR-STGF1-score (Canonical)54.5Unverified
3DSNetF1-score (Canonical)53Unverified
4VASNetF1-score (Canonical)49.71Unverified
5M-AVSF1-score (Canonical)44.4Unverified
6CSTAKendall's Tau0.25Unverified
#ModelMetricClaimedVerifiedStatus
1RR-STGF1-score (Canonical)63Unverified
2DSNetF1-score (Canonical)62.1Unverified
3VASNetF1-score (Canonical)61.42Unverified
4PGL-SUMF1-score (Canonical)61Unverified
5M-AVSF1-score (Canonical)61Unverified
6CSTAKendall's Tau0.19Unverified
#ModelMetricClaimedVerifiedStatus
1Shotluck-Holmes (3.1B)CIDEr152.3Unverified
2Shotluck-Holmes (3.1B)CIDEr63.2Unverified
3SUM-shotCIDEr8.6Unverified
#ModelMetricClaimedVerifiedStatus
1EgoVLPv2F1 (avg)52.08Unverified
2EgoVLPF1 (avg)49.72Unverified
#ModelMetricClaimedVerifiedStatus
1PGL-SUMMAP (50%)61.6Unverified
#ModelMetricClaimedVerifiedStatus
1VTSUM-BLIP1 shot Micro-F123.5Unverified