Video-based Generative Performance Benchmarking
The benchmark evaluates a generative Video Conversational Model and covers five key aspects:
- Correctness of Information
- Detailed Orientation
- Contextual Understanding
- Temporal Understanding
- Consistency
We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.
Papers
Showing 1–10 of 20 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PPLLaVA-7B-dpo | mean | 3.73 | — | Unverified |
| 2 | VLM-RLAIF | mean | 3.49 | — | Unverified |
| 3 | TS-LLaVA-34B | mean | 3.38 | — | Unverified |
| 4 | PLLaVA-34B | mean | 3.32 | — | Unverified |
| 5 | PPLLaVA-7B | mean | 3.32 | — | Unverified |
| 6 | SlowFast-LLaVA-34B | mean | 3.32 | — | Unverified |
| 7 | VideoGPT+ | mean | 3.28 | — | Unverified |
| 8 | IG-VLM-GPT4v | mean | 3.17 | — | Unverified |
| 9 | ST-LLM-7B | mean | 3.15 | — | Unverified |
| 10 | VideoChat2_HD_mistral | mean | 3.1 | — | Unverified |