SOTAVerified

Video-based Generative Performance Benchmarking

The benchmark evaluates a generative Video Conversational Model and covers five key aspects:

Correctness of Information
Detailed Orientation
Contextual Understanding
Temporal Understanding
Consistency

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PPLLaVA-7B-dpo	mean	3.73	—	Unverified
2	VLM-RLAIF	mean	3.49	—	Unverified
3	TS-LLaVA-34B	mean	3.38	—	Unverified
4	PLLaVA-34B	mean	3.32	—	Unverified
5	PPLLaVA-7B	mean	3.32	—	Unverified
6	SlowFast-LLaVA-34B	mean	3.32	—	Unverified
7	VideoGPT+	mean	3.28	—	Unverified
8	IG-VLM-GPT4v	mean	3.17	—	Unverified
9	ST-LLM-7B	mean	3.15	—	Unverified
10	VideoChat2_HD_mistral	mean	3.1	—	Unverified

Video-based Generative Performance Benchmarking

Papers

Benchmark Results