SOTAVerified

Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

The evaluation is computed over five different aspects:

Correctness of information
Detail orientation
Contextual understanding
Temporal understanding
Consistency.

Additionally, VCGBench-Diverse provides a breakdown of performance across three key aspects:

Dense video captioning, which assesses the ability to generate detailed and accurate descriptions of the video content,
Spatial understanding, which evaluates the capability to understand and describe the spatial relationships and settings within the video
Reasoning, which tests the adeptness in inferring and explaining causal relationships and actions within the video.

Title	Date	Tasks	Status	Hype
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Jun 8, 2023	Question AnsweringVCGBench-Diverse	CodeCode Available	3
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	Jun 13, 2024	Dense Video CaptioningMVBench	CodeCode Available	3
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Nov 14, 2023	Image-based Generative Performance BenchmarkingLanguage Modeling	CodeCode Available	2
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Nov 28, 2023	3D Question Answering (3D-QA)Diagnostic	CodeCode Available	2
VTimeLLM: Empower LLM to Grasp Video Moments	Nov 30, 2023	Dense Video CaptioningTemporal Relation Extraction	CodeCode Available	2

VCGBench-Diverse

Papers