Video-to-image Affordance Grounding
Given a demonstration video V and a target image I, the goal of video-to-image affordance grounding predict an affordance heatmap over the target image according to the hand-interacted region in the video, accompanied by the affordance action (e.g., press, turn).
Papers
Showing 1–4 of 4 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hotspot | KLD | 1.47 | — | Unverified |
| 2 | HAG-Net (+Hand Box) | KLD | 1.41 | — | Unverified |
| 3 | Demo2Vec | KLD | 1.2 | — | Unverified |
| 4 | Afformer | KLD | 1.05 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hotspot | KLD | 1.26 | — | Unverified |
| 2 | HAG-Net (+Hand Box) | KLD | 1.21 | — | Unverified |
| 3 | Afformer | KLD | 0.97 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Demo2Vec | KLD | 2.34 | — | Unverified |
| 2 | Afformer (ResNet-50-FPN encoder) | KLD | 1.55 | — | Unverified |
| 3 | Afformer (ViTDet-B encoder) | KLD | 1.51 | — | Unverified |