Video-based Generative Performance Benchmarking (Correctness of Information)
The benchmark evaluates a generative Video Conversational Model with respect to Correctness of Information.
We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.
Papers
Showing 1–10 of 15 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PPLLaVA-7B | gpt-score | 3.85 | — | Unverified |
| 2 | PLLaVA-34B | gpt-score | 3.6 | — | Unverified |
| 3 | TS-LLaVA-34B | gpt-score | 3.55 | — | Unverified |
| 4 | SlowFast-LLaVA-34B | gpt-score | 3.48 | — | Unverified |
| 5 | VideoChat2_HD_mistral | gpt-score | 3.4 | — | Unverified |
| 6 | VideoGPT+ | gpt-score | 3.27 | — | Unverified |
| 7 | ST-LLM | gpt-score | 3.23 | — | Unverified |
| 8 | MiniGPT4-video-7B | gpt-score | 3.08 | — | Unverified |
| 9 | VideoChat2 | gpt-score | 3.02 | — | Unverified |
| 10 | Chat-UniVi | gpt-score | 2.89 | — | Unverified |