VGSI
Given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reason- able step towards the given goal. A model should correctly recognize not only the specific action illustrated in an image (e.g., “turning on the oven”), but also the intent of the action (“baking fish”).
Papers
No papers found.
No leaderboard results yet.