video narration captioning
Human narration is another critical factor to understand a multi-shot video. It often provides information of the background knowledge and commentator’s view on visual events. We conduct experiments to predict the narration caption of a video-shot and name this task single-shot narration captioning. We adopt the same model structure as single-shot video captioning with the ASR text as additional input, except that the prediction target is the narration caption.
Papers
No papers found.
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Ours | BLEU-4 | 18.8 | — | Unverified |