Dense Video Captioning
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Papers
Showing 1–10 of 76 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | HiCM² | CIDEr | 71.84 | — | Unverified |
| 2 | Vid2Seq (HowTo100M+VidChapters-7M PT) | CIDEr | 67.2 | — | Unverified |
| 3 | Vid2Seq | CIDEr | 47.1 | — | Unverified |
| 4 | E2vidD6-MASSalign-BiD | ROUGE-L | 39.03 | — | Unverified |
| 5 | CM² | CIDEr | 31.66 | — | Unverified |
| 6 | GVL | CIDEr | 26.52 | — | Unverified |
| 7 | PDVC (TSN features, no SCST) | CIDEr | 22.71 | — | Unverified |