SOTAVerified

VGSI

Given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reason- able step towards the given goal. A model should correctly recognize not only the specific action illustrated in an image (e.g., “turning on the oven”), but also the intent of the action (“baking fish”).

Papers

Showing 11 of 1 papers

TitleStatusHype
Visual Goal-Step Inference using wikiHowCode0
Show:102550

No leaderboard results yet.