SOTAVerified

Given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reason- able step towards the given goal. A model should correctly recognize not only the specific action illustrated in an image (e.g., “turning on the oven”), but also the intent of the action (“baking fish”).

VGSI

Papers