Dense Video Captioning
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Papers
Showing 1–10 of 76 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VTimeLLM | CIDEr | 27.6 | — | Unverified |
| 2 | Vid2Seq | METEOR | 17 | — | Unverified |
| 3 | ADV-INF + Global | METEOR | 16.36 | — | Unverified |
| 4 | Bi-directional+intra captioning | METEOR | 11.28 | — | Unverified |
| 5 | GVL | METEOR | 10.03 | — | Unverified |
| 6 | TSRM-CMG-HRNN+SCST | METEOR | 9.71 | — | Unverified |
| 7 | PDVC (TSP features, no SCST) | METEOR | 9.03 | — | Unverified |
| 8 | TSP | METEOR | 8.75 | — | Unverified |
| 9 | CM² | METEOR | 8.55 | — | Unverified |
| 10 | BMT | METEOR | 8.44 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | HiCM² | CIDEr | 71.84 | — | Unverified |
| 2 | Vid2Seq (HowTo100M+VidChapters-7M PT) | CIDEr | 67.2 | — | Unverified |
| 3 | Vid2Seq | CIDEr | 47.1 | — | Unverified |
| 4 | E2vidD6-MASSalign-BiD | ROUGE-L | 39.03 | — | Unverified |
| 5 | CM² | CIDEr | 31.66 | — | Unverified |
| 6 | GVL | CIDEr | 26.52 | — | Unverified |
| 7 | PDVC (TSN features, no SCST) | CIDEr | 22.71 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Vid2Seq | CIDEr | 55.7 | — | Unverified |