Spatio-Temporal Video Grounding
Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.
Papers
Showing 1–10 of 22 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | TA-STVG | Val m_vIoU | 40.2 | — | Unverified |
| 2 | CG-STVG | Val m_vIoU | 39.5 | — | Unverified |
| 3 | STVGFormer | Val m_vIoU | 38.7 | — | Unverified |
| 4 | TubeDETR | Val m_vIoU | 36.4 | — | Unverified |